CB Queryoptimization 01
CB Queryoptimization 01
Query Optimization
04 INDEX OF FIGURES
06 INTRODUCTION
06 DV Cookbook Series
07 QUERY OPTIMIZATION
07 Before You Start
08 The Denodo Platform Compared to Relational Databases
09 THE LIFE OF A QUERY
11 ANALYZING QUERY PERFORMANCE
11 Execution Trace
13 Query Plan
13 Exploring an Execution Trace
16 Query Plans vs. View Trees
16 DELEGATION TO SOURCES (PUSH-DOWN)
17 The Benefits of Delegation
18 Showcase: The Performance Effects of Delegation
20 Optimizing for Delegation
20 Delegation Through Caching
23 JOINS
23 Join Methods
23 Hash Joins
25 Merge Joins
27 Nested Joins
29 Nested Parallel Joins
30 Join Methods and Data Source Query Capabilities
30 Inspecting the Join Method Chosen by the Optimizer
30 Nested Joins in Query Traces
32 Manually Specifying the Join Method
32 N-joins
34 Join Types
34 Inner Joins
35 Outer Joins
35 Cross Joins
35 Join Type vs. Join Method
36 MEMORY USAGE
36 Query Complexity vs. Dataset Size
36 Row Shipping vs. Row Hoarding (Streaming vs. Blocking)
39 Caching and Memory Usage
41 Java Heap Memory
41 Swapping
43 Analyzing Memory Usage
43 Interesting Memory Patterns
44 Server Memory Settings
45 STATIC OPTIMIZATION
46 Enabling Static Optimization
46 View Definition vs. Query Conditions
47 Sample Query Rewriting Scenarios
47 Branch Pruning
48 Join Reordering
48 Join Type Change
48 Post-pruning Delegation
50 Aggregation Pushdown
51 Partial Aggregation Pushdown
52 Join Pushdown
54 Data Movement With Partitioning
55 DYNAMIC OPTIMIZATION
55 View Statistics
55 Statistics Used by the Optimizer
56 How to Gather Statistics
57 How Often Should You Gather Statistics?
58 DATA MOVEMENT
58 Data Movement as an Optimization Technique
60 Rationale and Benefits
61 DATA SOURCE OPTIMIZATIONS
61 Connection Pools
64 Pass-through Credentials
65 Base Views From SQL Query
68 DENODO PLATFORM THREAD MODEL
70 VQL CONSIDERATIONS
70 Sorting
71 Aggregations
72 The DISTINCT Clause
72 Functions
72 Delegable vs. Non-delegable Functions
73 Delegable Custom Functions
74 The CONTEXT Clause
75 OTHER CONSIDERATIONS
77 SUMMARY
INDEX OF FIGURES
DV COOKBOOK SERIES
This Data Virtualization Cookbook is part of a series of in-depth manuals showcasing
modern uses of data virtualization in the enterprise. In these books we explain common,
useful data virtualization patterns, from the big-picture, business-decision-making level, to
the architectural considerations, to the low-level, technical details and example data sets.
The Denodo Cookbooks will enable you to understand the architecture of current data
virtualization best practices and get you ready to apply these patterns to your own business
cases, ensuring you get the maximum value out of your data processing infrastructure.
Query Optimization
Query optimization is a critical aspect of any data virtualization project. The Denodo Platform
provides state-of-the-art, automated optimization but every architect involved in a data
virtualization solution should be familiar with the techniques that are automatically applied to
every query received by the Denodo Platform. In many cases, out-of -the-box, the Denodo
Platform will meet or even exceed the performance requirements of the project-at-hand, but
in some complex scenarios, some degree of manual tuning can bring the platform’s
performance to the next level. This Cookbook will describe the areas of the Denodo Platform
that include query-performance components and it will cover their internal behavior, what
tuning opportunities they offer to the query designer, and in what situations each one should
be used.
This is an introductory guide to help new users understand performance optimization in the
Denodo Platform, but it is not exhaustive; users are encouraged to further explore these
topics by reading other technical materials available from Denodo, such as the Virtual
DataPort Administration Guide, the Virtual DataPort Advanced VQL Guide, and the Denodo
Community Knowledge Base.
The Denodo Platform optimizer is an extremely advanced piece of engineering that can apply
optimizations whose functions might not be immediately clear to developers or architects.
While these optimizations are applied in a very intelligent way, the optimizer does not
operate in a vacuum: it needs view statistics, such as the number of rows or the number of
distinct values per column, to make its optimization decisions.
The descriptions of the dynamic optimizer that are detailed in this book assume that the
optimizer has access to the statistics of the views involved in the queries; it is the
responsibility of the Denodo Platform users to actually gather these statistics in a timely
manner. Without them, the dynamic optimizer cannot do its job. So remember: enable and
gather the statistics for your views, if you want to have a fully functioning optimizer!
If you see queries that are not being optimized as they should be, check to see 1) if
the dynamic optimizer is operating, and 2) if the view statistics for the views
involved in the query are up-to-date
For starters, the Denodo Platform holds no data itself: it retrieves the data in real time from
the data sources (or the cache system, if the data has been previously cached) every time a
query is received from a data consumer. This and other differences make the data
virtualization layer different in its operation than other traditional information processing
systems. Particularly in the area of query performance optimization, the concerns that
developers and administrators should keep in mind are different than in typical database
scenarios: optimizing queries is very different when you own the data versus when you have
to retrieve the data from multiple, disparate remote systems.
The main motif throughout this book will be the optimization of all operations with two goals
in mind:
• Reducing the amount of data processing performed at the data virtualization layer.
These two ideas are key in bringing the performance of a data virtualization solution in line
with that of a traditional physical data scenario, and they are very different than the
considerations that are taken into account by the architecture and design of relational
databases and data warehouses. Many techniques are applied by the optimizer to this effect,
and most of them try to follow the philosophy of making the data sources work as much as
possible in preprocessing the data, and transfer the absolute minimum amount of data
necessary from the sources into the Denodo Platform. This book will review all of these
techniques, it will explain the different parts of the query optimization pipeline in the Denodo
Platform, and it will showcase the areas that query designers and administrators should pay
attention to when designing high performance data virtualization solutions.
A note about this process: the description below is a simplified model that does not
exactly match the way that the Denodo Platform executes queries, but it is an
excellent, high-level mental model for understanding how the Denodo Platform
works. The actual implementation may reorder operations, merge steps, and
execute some in parallel, but the model explained below is very useful for
understanding the actual behavior of the data virtualization software.
2. Once the query has been accepted into the server, it gets parsed. This process
involves transforming the string into a data structure that can be analyzed and
modified by the query optimization and execution pipeline.
3. The parsed query is then fed to the static optimizer. In this step the general structure
of the query is analyzed, and it is rewritten in a way that preserves the semantics of
the query (the results of the rewritten query will be exactly the same as the results of
the original query) yet improves the performance (for example, by removing
unnecessary conditions or subqueries, reordering operations, etc.). We will see the
details of this optimization later.
4. The statically optimized query is then passed on to the query plan generator. VQL is
a declarative language, meaning that the user specifies what results are expected,
but not how to compute them. The exact list of steps that need to be executed to get
those results is called a query execution plan, and it is the responsibility of the query
plan generator to analyze the query and come up with a plan. It is important to note
here that multiple query execution plans can compute the same result set yet
demonstrate different performance characteristics.
6. The selected plan is then passed to the execution engine, which takes the plan and
executes its steps to retrieve the data from the sources, combine and transform the
results, and prepare the result sets for the data consumer.
7. After the execution of the query has finished, and all the associated resources have
been freed, the query is considered complete.
Every query received by the Denodo server is processed in this way, although in some cases
the process stops early, such as if the query is rejected for security reasons, if the query is not
correctly formatted, if errors appear in accessing the sources, if calculations result in
arithmetic errors (for example, a division by zero), etc. A query can fail in many ways, but the
rest of this Cookbook focuses on how to optimize queries that finish their executions
correctly, highlighting the static and dynamic optimization steps, and how they affect the
execution step in combination with the decisions taken when designing the query and
defining the virtual views used by it.
The first step to optimizing the performance of a query is to analyze what is happening when
we execute it. This is an often-overlooked step, as it is very common to take assumptions
about the runtime behavior as fact, which then renders the corrective actions much less
effective than they could be. When approaching optimization problems, try to gather actual
data as much as possible, and draw conclusions from the gathered data.
Execution Trace
Our main tool in this query analysis phase will be the execution trace. The execution trace of
a query is a textual or graphical representation of the actual execution of its query plan. If we
think of the query plan as a template for the execution of a query, the trace is akin to filling
that template with actual values that describe a single execution of the query. As such, it
gives us a detailed view of all the things that happened when running the query.
If the “Execution Trace” button is disabled, make sure that the option “”Execute
with TRACE” is selected in the previous screen.
• If the query was executed through the VQL Shell, the trace can be examined by
clicking on the “Execution Trace” button of the “Query Results” tab.
If the “Execution Trace” button is disabled, you will need to add the keyword
at the end of your query to make sure this information is recorded. For example, if
you are running the query
modify it to
As this method shows the execution trace of a query that is still being executed,
there will be nodes that have not been executed yet.
Query Plan
These steps help us to get the actual execution trace of a query while or after it is run. The
contents of this trace are a specific instance of the query plan with values related to that
execution. In some cases, we might be interested in inspecting the query execution plan
without actually executing the query (for example, because we might not want to wait for a
long-running query, or we might not want to hit the source systems at that moment). We can
do so by clicking on the “Execute” button of a view and then the “Query plan” button.
Remember: this gives you a view of the execution plan of the query but not the details of a
specific execution; we get a view of the template, but the details are not filled in. Regardless
of that, it is always useful to take a look at the query plan if we cannot get the full execution
trace.
The graphical representation of the execution trace has two parts: the tree overview and the
details panel. You should be very familiar with all the details in this screen, as they will be
used across the whole book and will be of critical importance in your day-to-day interactions
with the Denodo Platform.
Each node in the tree can represent a different type of operation, indicated by its icon and
background color. In the Virtual DataPort Administration Guide, you can find a comprehensive
reference covering all the available types.
• Type: The type of operation that this node executes (for example, a join, an
aggregation, an interface view, etc.).
• Execution time: The node’s total execution time. This enables you to search for
bottlenecks in the execution of any query. Always check your assumptions about
possible problems against the actual timing displayed in each node.
• Execution timestamps: Three timestamps are available: start, end, and response time,
which mark the start of the execution of the node, the end of the execution of the
node, and the time when the first result row was returned, respectively. The most
common use for these is checking the difference between the start time and the
response time.
• State: The status of the node. For example, the node may be running (PROCESSING),
finished (OK), waiting for other nodes (WAITING), unable to finish (ERROR), etc.
• Completed: The panel will display Yes if the node has finished successfully.
Otherwise, it will display No.
• Search conditions: The conditions used in this node. It’s useful to review this
information because the static and dynamic optimizations applied to the query plan
could have modified the original conditions.
• Detailed row numbers: A list of how many rows were discarded by the filter conditions,
and how many were discarded because they were duplicates.
• Memory details, which are extremely useful for diagnosing slow queries:
• Memory limit reached: This message will be displayed if the memory used by the
node, for intermediate results, reached the limits configured for this view.
• Swapping: If the panel displays Yes, intermediate results could not fit in the
allocated memory space and were therefore stored in the physical hard drive,
slowing the process down. If the panel displays No, then no disk swapping has
occurred.
The query plan of a query is a tree-like structure, very similar to the view hierarchy of the view
that is being queried. The two can match in their structure but they don’t have to; the role of
the optimization pipeline in the Denodo Platform is to identify opportunities for reducing the
complexity and the number of executed steps, to improve query performance. The more the
structure of a query plan matches the structure of the view being queried, the less
optimization techniques have been applied to the query. An extreme optimization case
occurs when a query is fully delegated to the data source, and the Denodo Platform does not
perform any additional steps; we will see examples of this in the next section.
We have reviewed the big picture of what happens to queries that are received by the
Denodo Platform and how to inspect the inner workings of the queries during and after their
execution. Now we can start talking about the different factors that impact performance and
how to use them to our advantage.
The first important technique that the Denodo Platform uses for optimizing query
performance is delegation to sources (or push-down). In this context, “delegation to sources”
refers specifically to the delegation of processing (or operations) to the data sources; that is,
making the data sources perform as much data processing as possible.
In a data virtualization scenario, several software elements can potentially have data-
processing capabilities, at both the source level and the data virtualization level. The simplest
case is a relational database that acts as a source for the virtualization layer—both systems
have similar capabilities, so any operation on the data could be performed by either the
source database or the data virtualization layer.
• One way to execute the join would be for the data virtualization layer to pull the data
from both tables and then do the join in-memory.
In the second case, the join operation has been pushed down to the source. In general, any
operation can be pushed down: relational operations (joins, unions, etc.), scalar functions
(abs, upper, etc.), aggregation functions (avg, max, min, analytic functions, etc.), and more.
In order for an operation to be pushed down to the data source, it must meet two
requirements:
1. The data source must have the capability to support the operation. This means that if
the data virtualization layer is to instruct a source to execute an operation over the
data, the data source must be able to carry it on. Data processing capabilities vary
across different types of data sources, for example:
• Traditional databases and data warehouses offer large sets of operations so in
general they will be able to support a high degree of delegation.
• Web services usually offer a much more limited set of operations. Most often, they
just support selecting data from them, along with some limited filtering
capabilities, but without the richness that relational databases allow through SQL.
• Flat files, such as XML or CSV files, don’t accept any delegation whatsoever.
2. The operation must be performed over a data set that is contained in a single source.
This means that all operations must be performed on data that is already residing on
that source, a source with visibility over the whole set of data being processed by the
pushed down operations (this rule has an exception, which will be explained in the
section that deals with data movement).
Generally speaking, delegating operations to data sources will improve query performance.
This is achieved through a variety of mechanisms:
• Source systems have more intimate knowledge of low-level details about the data,
which helps in performing the operations with better performance. Source systems
have a detailed understanding about the physical distribution of the data as well as
details about the performance of physical storage regarding I/O, local indexes of
data, status of memory buffers, etc. Each source system will also have its own query
engine and optimizer, which will incorporate those additional insights into the system
to further optimize each query sent by the data virtualization layer. So it makes sense,
from the performance side, to send as much work as possible to the source systems.
Let’s see a concrete example of the performance benefits we get by using an optimizer with a
“delegate first” orientation. Assuming the following data model:
In a scenario with no delegation, for example, with both tables stored in different physical
systems, the naive approach would be to:
With those steps, the data virtualization layer has transferred a total of 101,020,000 rows
(including the 20,000 sent to the client application), and has performed a 100 million x 1
million join, in memory, and a further aggregation.
If those two tables were stored in the same relational database, the situation would be very
different:
3. The results of the query are streamed by the data virtualization layer to the client
application (20,000 rows with a single column).
Note that in a real-world scenario, the optimizer would try to push the aggregation
to the source, even if the join operation cannot be delegated, greatly improving
performance. This will be explained in the “Static optimization” section of this
book.
The Denodo Platform is always looking for opportunities to delegate operations to source
systems; it is one of the most important parameters guiding query optimization decisions. It is
so important that it takes precedence over other considerations when choosing among the
generated query plans. This means that sometimes the optimizer makes choices that may
seem suboptimal at first but might make sense when taking into account the full picture. For
example, the query engine might ignore a cache of a view in very specific scenarios if it
results in a higher degree of delegation to the source.
When designing queries and view hierarchies, the data architect or developer should always
keep this in mind, and thus try to always group operations affecting each data source as
closely as possible to maximize opportunities for delegation to those sources.
• If the data source does not include support for the operations being executed, then
the query or subquery cannot be delegated, or if the data source is a passive system
(such as a flat file) it cannot execute any operation on its own.
• If a query or subquery references tables that exist in different data sources it cannot
be delegated, so it must be split into subqueries that are small enough so that they
only reference data in a single source system; the exceptions to this are data
movement and reordering of operations so that they can be delegated, and both of
these topics will be discussed later in this book.
• The cache is always shared for all the views of a given virtual database. This means
that the cache for a given virtual database is all stored in a single schema on the
cache system, which in turn means that the data in the cache can be used in
combined operations by the cache relational database.
These two features combine in a way that enables the data architect to cache views from
different data sources in a tactical way, so that when operations that would not be delegable
using the naive approach (because they involve views over different data sources and/or
they are data sources with non-delegable capabilities) become delegable because:
1. All the views involved in the query are stored in the cache so they can be combined
together (circumventing the original non-locality of the data).
2. The cache relational database provides sufficient execution capabilities to run all the
operations involved in the query (circumventing the reduced capabilities of the
original data sources).
All three sources have been cached in full mode, and the query can be fully delegated to the cache.
This section of the Cookbook will cover the most important aspects of join operations and
how they relate to optimization scenarios in the Denodo Platform: the different join methods,
how they relate to the data sources, how to inspect and modify them, and the different types
of joins.
Join Methods
The most important aspect of join operations is the method used to execute the join. When a
user issues a query that joins two views, the query only specifies that the two datasets have
to be combined, without specifying exactly how. This is due to the declarative nature of SQL;
the user specifies what is needed, and it is the responsibility of the query engine to generate
a query plan with the specific steps for fulfilling the user’s request. There are several ways to
execute a join, so let’s review all the options provided by the Denodo Platform, and the
rationale behind each one.
Choosing the best join method automatically is one of the main tasks of the
optimizer. You should always gather the statistics of the views in your project and
enable the cost-based optimization in your server, to enable the optimizer to use
the best join method and join order for each query.
The join methods described below are not specific to the Denodo Platform; they
are the standard methods that traditional relational databases use to execute
joins, so each is covered by extensive literature, describing their advantages and
disadvantages and when they should be applied.
HASH JOINS
The hash join operates by building a hash table of the join attribute(s) of all the rows on the
right side of the join and then checking for matches in the table for each hash value on the
left side. The process works as follows:
2. A hash table is built, using, as keys, the values of the join attributes for each of the
rows obtained from the right side.
3. Once the table is built, the left side of the join is queried.
4. For each row received from the left side, its join attributes are used to build a key
that is used to check for occurrences in the hash table built in step 2.
a) If the key does not exist in the hash table, the row is discarded.
b) If the key exists in the hash table, the join between the row on the left side and
all the rows from the right side that share the same key is performed, and those
new joined rows are added to the result set of the join.
5. The process continues until all the rows from the left side have been processed.
The most important aspects of this process, from the standpoint of performance, are:
• The whole data set from the right side is queried and processed before the query on
the left side begins.
• Once the right side is ready, the rows from the left side are streamed from the source,
compared to the hash table, and then returned. This happens row by row, so as soon
as the right side has been finished, the join is potentially ready to start returning
rows, and the number of rows per second returned will be only dependent on the
speed the rows are retrieved from the left side. The join will never return any row
before the right side has finished.
• The memory required for this type of join is directly proportional to the size of the
result set from the right side, because before the join can return the first row, it needs
to build a hash table of all the rows on the right side.
Those three factors guide when hash joins should be used and when they should be
avoided:
One important aspect we mentioned in the previous description is the difference between
operations (or parts of operations) that can operate on individual rows versus operations that
need the full data set before continuing. In our example, the task of building the hash table is
of the second variety: it blocks the whole join operation until it has been completed. On the
other hand, the processing of the left branch belongs in the first category: once the hash
table is built, each row from the left side can be processed individually and returned as part
of the join results (if it finds a match in the hash table). We will review this concept in more
depth later, as it is very important when analyzing performance; keep your eyes open so you
can spot these two kinds of operations throughout our discussion.
A note about join ordering: The description of the hash join, as explained here,
applies to joins that use the natural order. That is, they execute the hash join by
building the hash table on the right branch of the join. If you run into a situation
where a hash join seems like the right choice but your small dataset is on the left
side instead, then you can still use a hash join effectively by configuring the join to
use reverse order. In that case, the join will build the hash table on the left branch
and then stream the rows from the right side.
MERGE JOINS
Merge joins work by doing a simultaneous linear scan of both branches of the join. The
process works as follows:
1. The merge join operation requests the data from both branches of the join to be
sorted by the join attributes.
3. The merge join takes all the rows from the left side that contain the first value for the
join attribute. This could be a single row or a set of rows that have that value. We will
consider the case of just a single row for this (imagine a join using a primary key
attribute), but the concept is easily extensible to multiple rows with the same join
attribute value.
4. The merge join takes all the rows from the right side that contain the first value for
the join attribute. Let’s again assume this is a single row.
6. Step 5 is repeated until we run out of rows in one of the branches of the join.
To understand why this operation works, remember that both branches are sorted by the join
attribute, so any time we draw a new row, we are increasing the value of that attribute. When
the values of the join attribute from both sides of the join don’t match, then we draw rows
from the side that has the smallest value, thus eventually increasing that value, until they both
match (or until the side that was smaller becomes bigger, in which case the merge join starts
drawing rows from the other side to try to balance the values on both branches and find a
match).
This explanation only considers situations with non-repeating join attributes: the process
described above draws rows one at a time. In the real world, we will often find a join between
two tables where there are many rows that share the value of the join attribute. The extension
of the algorithm for calculating the merge join in this case is simple: Instead of drawing a
single row, the merge join draws all the rows with the same value for the join attribute, and
when a match is found instead of joining two rows, the merge join performs a cartesian
product between the partial sets pulled from both sides of the join (matching each row from
the left side with all the rows from the right side) and returns all those rows as partial results
of the join.
From a performance point of view, the behavior of this join has the following implications:
• The merge join requires both sides of the join to get their data sorted. This implies
that the source systems or subqueries on both sides of the join need to 1) be able to
sort the data and 2) will potentially have a slight performance penalty due to this
sorting operation. The exact magnitude of this impact will depend on the source
system, but in general, relational systems that have indexes defined over the
appropriate columns will demonstrate high performance during this operation.
• Once the results start coming in from both sides, the merge join does a simple linear
scan of both tables, with very simple comparison operations at each step. This is a
very lightweight process that offers excellent performance.
• The merge join offers the best performance versus other types of joins, when there is
a need to join two extremely big data sets.
Sorting both branches of the join is an Sorting both branches of the join would
acceptable performance tradeoff. have too big an impact to performance.
Both branches of the join have big data At least one of the data sets is small.
sets.
One last consideration: Merge joins are symmetrical so they are not affected by the order that
the join specifies (as it was the case with the hash join).
NESTED JOINS
Nested joins work by querying one branch of the join first and then querying the other branch
based on the results of the first branch. In detail:
1. The left branch of the join is queried and the results are obtained.
2. For each row received from the left branch a new query is issued to the right branch,
adding a clause to select exactly the rows that match the join condition
derived from the value coming from the left side. For example, if we receive the
id_right 7 from the left side (assuming the join condition is over the id_right field),
then the right branch will be queried with a condition .
3. The results of the query on step 2 are matched to the results of the left side.
5. The process continues until there are no more results from the left branch.
Nested joins are very common in scenarios where the right side imposes a restriction on the
types of queries that can be accepted; the most typical example is a web service that has an
input parameter. In that case, you can think of the nested join as first getting rows from the
left branch and then feeding those one at a time to the right branch to meet the query
capabilities of the right branch.
In the description of how the nested join works, the right side was queried once per row on
the left side. In reality the Denodo Platform optimizer tries to batch the calls to the right side
when possible to reduce the number of calls as much as is feasible. The exact details on how
this batching is done depends on the specific data source or view tree that is on the right
side—this happens transparently when querying relational databases and LDAP data sources.
Common techniques that are used by the optimizer to specify several values on a single
query are:
• conditions
• clauses
• Using inner joins with a subquery that selects the specific values
The size of the blocks on the right side when batching is controlled by the property
Nested joins are a streaming operation: As soon as the left branch of the join returns a row,
the join can query the right branch, and as soon as the right side rows for that first query are
returned, the join will begin to return results.
The right branch has source constraints Any time a merge or a hash join can be
and cannot answer free-form queries, used.
instead forcing the use of input
parameters.
Nested joins are asymmetrical: The left and right branches of the join have different roles, so
make sure that your data sets and data sources match the patterns described in the nested
join requirements; if the roles in your data are switched, remember to specify order
in your join definition.
Nested joins offer an additional layer of performance optimization: When the join returns only
rows from the right branch, the optimizer identifies the nested join as a semi-join and further
optimizes the execution by not trying to match the results from the right side with the results
from the left side, instead just shipping the results from the right branch as results of the join.
When to use a nested parallel join When to avoid using a nested parallel
join
Use it over a regular nested join any The source on the right side cannot
time it is possible. cope with the load imposed by a large
number of parallel iterations.
As you have seen in this description of the join methods, the optimizer has three main
options when deciding what type of join to use. These options are not equivalent, and as
explained in the previous section, each one makes different assumptions about the data on
each branch of the join.
A very important consideration is the querying capabilities of the underlying views on each
branch of the join. When the optimization process decides what method to use, the available
options are restricted by the capabilities of those views; the most typical case is a view with
mandatory fields (such as a view representing a SOAP web service operation with a
mandatory input field), which does not accept a query (it requires a clause
with a condition setting a value for the mandatory fields). In this situation the only available
join method will be a nested join, because that type of source cannot retrieve the full
underlying data set in a single query, and so it cannot be used in a merge or hash join.
Moreover, because the view specifies input parameters, it will always appear on the right side
of the nested join (the side that receives multiple queries), never on the left side (which
requires a naked without a clause). By having a single restriction on the
query capabilities of the underlying views, the optimizer’s usual room for maneuvering gets
reduced, and the parameters for the join get automatically fixed. Other similar situations may
include capabilities related to sorting data. These might come into play, for example, when
the optimizer is presented with the option of executing merge joins.
Checking the join method chosen by the optimizer is easy; inspect either the query plan
(before the query execution) or the execution trace (during or after the query execution), and
click on the node of the execution plan that represents the join that you want to check. The
right panel will display the method used for that particular join execution, in addition to other
detailed information. Remember, this automatic selection is highly dependent on the
optimizer having access to the statistics of the views being queried!
Notice that this will not happen when inspecting the query execution plan (pre-
execution), only in the query execution trace (post- and during execution). The
query plan does not know the results that will be obtained from the left branch of
the join before execution time, thus it cannot predict the exact number of queries
that will be issued to the right side (although it can estimate the number based on
view statistics as will be shown later). Remember, the query execution plan and the
query trace are related, but they do not convey the same information.
Denodo Platform users can choose to let the optimizer do its job and select the best join
method for each query (or what the optimizer believes to be the best method). However, the
optimizer may not always choose the most appropriate method for every situation. If this
happens, the user can override the default option and provide an alternative to be used at
runtime. In that case, the join would not be further optimized, and it would be left as the user
specified. This setting is found in the “Join Conditions” tab of the join operation.
N-joins
N-joins are joins that are performed over more than two views. When we create an n-join, we
drag all the views that take part in the join and then connect them by dragging fields from
one view to another to specify a join condition between those two views. As the conditions
are defined exclusively between two views, we need several join conditions to connect all the
views involved in an n-join (at least N-1 conditions if we have N views). This results in a join
between several views but with somewhat of an internal structure. The join will be always
interpreted by the Denodo Platform as a sequence of two-way joins.
Having multiple consecutive 2-way joins offers the optimizer an opportunity for reordering
them to increase performance. For example, if we have a 5-way join, in which four of the
views come from the same source database, the optimizer will probably reorder the join so
the first three pairs of views that are joined are the ones that come from the database, and
the last view joined will be the other view that comes from a different source; this would
enable the optimizer to delegate the join between the first four views to the source system,
increasing performance as we have seen before.
Join Types
INNER JOINS
Inner joins represent the most basic type of join, in which only the rows that match the join
condition are returned from both sides of the join. This is the type of join that is selected by
the optimizer when possible, as it is the most performant, and it returns the least rows. The
optimal join type to use is of course determined by the specific use case more than the
performance considerations, as the semantics of using an inner join and an outer join are
different (the result sets of both operations are different) but under some circumstances they
can become equivalent; in these cases the static optimizer will try to convert outer joins into
inner joins.
• In the case of a left outer join, all the non-matching rows from the left side of the join.
• In the case of a right outer join, all the non-matching rows from the right side of the
join.
• In the case of a full outer join, all the non-matching rows from both the left and right
sides of the join.
Outer joins are worse than inner joins from the performance standpoint, both because they
return more rows than inner joins and because of the mechanics of calculating the result sets.
The optimizer of the Denodo Platform server will always try to transform outer joins into inner
joins using static optimization techniques that will be discussed later.
CROSS JOINS
Cross joins, or “cartesian products of two views,” generate all the possible combinations of
rows from the rows coming from each branch of the join. If we have N rows on the left side
and M rows on the right side, the join will return NxM rows—it’s very important to always use
cross joins judiciously because that NxM resulting data set can grow very quickly even when
the source views have small data sets. For example, if we do a cross join between two
10,000-row views the result set will have 100,000,000 rows. We have moved from a situation
with two small views into a situation with a respectable data set; depending on the intended
usage of this hundred million rows, the performance of the query could be extremely poor.
Always use cross joins as the last resort!
The Denodo Platform optimizer will also try to avoid cross joins in queries. For
example, reordering join operations for maximizing delegation to the data sources
may produce cross joins between views; in these cases the optimizer will avoid
reordering as this would probably perform worse than the original query.
When faced with the question of “how much memory will this query need?” most novice
users try to answer by looking at the size of the source data sets. Intuitively a query that
processes 1 billion rows should need more memory than a query that processes 1 thousand
rows. Unfortunately this simple approach is wrong; the correct way to estimate how much
memory a query will need is to look at the query’s complexity.
From the standpoint of CPU load, the most important consideration is the computational cost
of the functions that are applied to each individual cell.
From the standpoint of memory consumption, the most important parameter is the type of
operations that we execute over the data, splitting them into operations that stream tuples vs
operations that hoard tuples in memory. Let’s take a look at each of those.
• First, operations can be applied over each row of the source data set individually, so
the result of the operation over a single row only needs that row to be calculated. For
example, if we are projecting a subset of the columns of a dataset, we can grab the
desired columns from each row individually. Another example is a selection based on
a condition over the columns of the dataset—when we grab a single row, we can
check whether that row verifies the condition or not; we don’t need other rows to
know that. This type of operation is called a streaming operation. These operations
ship rows individually to the next stage of execution.
• On the other hand, some operations cannot calculate their results based on
individual rows, so instead they need a collection of rows before they can return the
results. One example of this type of operations is sorting a data set, because if we
grab a single row we cannot determine where that row should be returned; the
process needs to grab all the rows in the source data set and then sort them in
memory. Another example is an aggregation of rows, such as when summing a
column of the source data set. We cannot return the result of the operation until we
have added all the values of all the rows in the source. These operations are called
blocking operations, and they may hoard rows in memory until they have enough to
produce one or more result rows. These operations can be further split in two
subtypes:
• The first subtype is exemplified by the sort: It is an operation that needs to
actually hold the source data in memory before it can return the results of the
query, since the results of the query are actually all or part of the rows of the
original data set. These operations are called linear memory operations.
Constant memory operation: Every source row is immediately aggregated into the partial results.
From a memory perspective, the obvious best choices are to always use streaming or
constant memory operations. If your query only uses those types of operations, then the
memory usage will be more or less constant independent of the size of the data set being
processed. This notion can be counterintuitive, but it should be one of the first criteria in
query design.
Obviously, sometimes we cannot avoid adding blocking operations, but always try to reduce
these operations as much as possible. A single blocking operation will have a huge impact on
row latency, as no result row can be returned before the whole data set is in memory,
although the total execution time may not be affected (vs. a streaming operation). If you are
chaining several operations, always evaluate whether it is to your advantage to execute the
blocking operations earlier or later in the chain. As a rule of thumb, earlier is better most of
the time, but every situation should be assessed individually. Of course, all of this discussion
assumes that it is possible to reorder operations without changing query semantics. If that is
not possible, there are no avenues for query optimization, and you should choose the
operation order that gives the required results.
Now that we’ve covered streaming vs blocking operations, we recommend that you review
the discussion above about the different join methods that the Denodo Platform offers,
paying attention to the behavior of each branch of each join, and identifying which areas are
streaming, constant memory, and linear memory, so you are familiar with the runtime behavior
of your queries.
However, the stream of rows that are inserted into cache will store more and more rows in
memory, roughly at a linear pace, so by using the operation to populate a cache table, the
operation will be turned into a linear memory operation (independent of the operation’s
original behavior).
There is a way of avoiding this behavior: Tell the Denodo Platform to cache just the results of
the query, without returning any row as a result of the operation, when the query is issued.
This is done through the context clause . When
using this setting the result rows will not be returned as results of the query; instead the
Denodo Platform will only instantiate the pipeline that feeds the cache system, which in turn
will be used as the moderator of the speed at which the operation will be performed.
The Denodo Platform runs as a Java process regardless of the operating system where it is
executed. This means that the memory that the server will have available at runtime will be
the amount that the operating system assigns to the Java process at startup. Inspecting the
memory consumed by the Java process that executes the Denodo Platform may provide an
incorrect view of the actual memory being consumed by the data virtualization operations.
The operating system assigns the Java Virtual Machine (JVM) an initial heap size, and the
JVM can choose to grow or shrink the amount of memory at any point. At any given point the
heap will be partially used by the Denodo Platform running on that JVM—the virtual machine
will always have some free memory on reserve so if the Denodo Platform needs more
memory, it can provide it right away without having to request more from the operating
system.
The bottom line is that the memory consumption numbers obtained by examining the
Denodo Platform process at the operating system level will not provide accurate information
about the memory being consumed by the server at a given point in time, as those numbers
refer to both the memory consumed by the server and the free memory that the JVM is
holding.
Always use the Diagnostics and Monitoring tool or the logging facilities included in
the Denodo Platform to check the memory consumption of the server.
Swapping
Swapping refers to the practice of using hard drive space to temporarily store data that does
not fit in the main memory. Hard drive space is then used as an extension of the computer’s
RAM, albeit a much slower one. This technique is a tradeoff that allows a calculation to
complete successfully even when there is not enough physical memory present to complete
the calculation, although at a reduced performance level, as access to the hard drive can be
several orders of magnitude slower than access to main memory. This loss of performance is
an unfortunate side effect, but in most cases it is better than the alternative: the operation
failing due to lack of memory.
• It allows the system to execute queries that require more memory than what is
assigned to the Denodo Platform server process.
• It protects the Denodo Platform server from running out of memory when heavy
queries are received, keeping the service available for other queries.
The way to tell if a query is using swapping is to look at the query’s execution trace. Using the
query monitor, look at nodes in the execution plan that seem to be taking a long time, and
click on them to show the details; if the node is using the swapping subsystem, it will show
both the “Swapping YES” status and “Number of swapped tuples X”, where X stands for the
number of rows that overflowed the allotted memory size and were therefore sent to the hard
drive.
When you are designing queries, one of your goals should be to minimize the amount of
swapping. Of course sometimes swapping is unavoidable, given the size of the data sets
being processed and the types of operations that are performed. Regardless, you should be
aware of the type of operations that your queries are executing, memory-wise, and if they
have a constant memory profile, or if they are blocking (either constant memory type or linear
memory type). If you find some of your queries are being swapped, that could mean they are
being executed as linear memory type, and the most common situation is that the wrong join
method is being applied.
For example, remember how hash joins work. They retrieve the full data set from the right
branch of the join, then they build a hash table with the values of that branch, and finally they
stream the rows from the left side, comparing each one with the contents of the hash table.
The optimal scenario is having a very small table on the right side and an arbitrarily large
table on the left side. If we reverse the situation, and place a huge table on the right branch
of the hash join, the hash table that is to be built will be very big, which means it will consume
a lot of memory, and this will probably force the query to use swapping. This situation would
be easily solved by changing the order of the join so the big table is on the left branch and
the small table is on the right branch of the join.
Always check the join methods that are executed in swapped queries and verify
that they are suitable for the data sets being processed.
The main tool for analyzing the memory usage of the Denodo Platform server is the
Diagnostic and Monitoring tool, included in the core distribution of the Denodo Platform. This
tool is a web application that needs to be started in the same way as the data virtualization
server. To do so, open your Denodo Platform Control Center and in the Virtual Data Port tab,
click the start button for the diagnostics tool; afterwards, you can click on the Launch button
to open a browser pointing to the running tool.
Check the official documentation of the Diagnostic and Monitoring tool for
complete setup and initial configuration instructions.
Always take into account your server settings, regarding query memory limits and
swapping, when you try to visualize the memory consumption of your server (these
are in the Administration > Server configuration > Memory menu of the
administration tool). It may be more difficult to discern patterns in the graphs if the
server is set to limit the memory consumption of queries, and enable swapping.
Memory consumption trends will only be recognizable in the diagnostics tool when
processing medium to high volumes of data. The graphs will not register enough
change for queries that are executed very quickly and which do not process
enough rows to make a difference in the overall memory state of the server.
The Denodo Platform optimizer offers some controls over how the memory is managed when
executing queries:
• Setting the maximum amount of memory that any query will consume.
• Setting the maximum amount of memory that unprocessed, intermediate results can
use.
These settings can be found under the “Administration > Server configuration > Memory
usage” menu in the VDP Administration Tool.
Static Optimization
Static optimization is the first step that the optimizer takes when a new query is received. In
this step, the optimizer rewrites the original query into a new form that will be executed with
higher performance. The transformations that are applied are at the SQL level; they do not
define how to execute each operation (that will be the role of the dynamic optimizer), but
what operations to execute and in which order.
The static optimization process takes into account the original query (after being parsed into
a format that is amenable to analysis) and the metadata available about both the views that
are affected and the data sources that will be used in the query. The details about the
capabilities of each data source are of special importance during this process, as they will
constrain the number and type of operations that in the end will be delegated to the data
sources.
The static optimization process modifies the SQL of the query that was received,
but of course the resulting query will maintain the semantics of the original query:
The results of the optimized query will always be correct!
The Denodo Platform server should have the static optimization enabled by default, but
should you want to check if your server has it enabled, or if you would like to disable it for
testing, open the Administration > Server configuration > Queries optimization menu in the
Virtual DataPort Administration tool.
A virtual data model is composed of a collection of views.These are split into base views,
which represent physical tables or entities that exist in remote systems, and derived views,
which are virtual views created as a combination of other views in the data virtualization layer.
Every derived view that is created in the data virtualization layer is created from a list of
relational operations and clauses (joins, aggregations, projections, clauses, ordering
clauses, etc.). These operations are codified in the definition in the view and are immutable.
The Denodo Platform receives queries from client applications. These queries are dynamic, in
the sense that they don’t need to be defined within the data virtualization layer: The client
application can assemble any SQL on the fly and send it to the data virtualization server to be
executed. These queries can also define operations and values in the same way as the view
definitions.
When a query is received and executed, the final query plan that is calculated for the query
execution is generated by combining the parts defined in the dynamic query sent by the
client application with the parts defined in the views that the query uses. Two different
queries over the same virtual view can lead to very different query execution plans due to the
conditions and operations specified in the runtime part of the query.
Some of the opportunities for optimization stem from collisions between the conditions
defined in the views and the conditions that appear in the query sent by the client
application; any conflict between those two (for example, the conditions and
are in conflict) will result in branches of the query plan that are guaranteed to return no rows.
the optimizer can safely remove those branches from the query plan.
Always take into consideration the runtime behavior of the views under querying
conditions, as opposed to the design-time behavior.
The static optimizer will apply a variety of optimizations depending on the specific situation
that it finds; it applies an extensive catalog of optimizations that enables the optimizer to work
under many different scenarios. Some examples of these optimizations are showcased
below; this sampling of techniques should provide a general sense of how the static
optimizer works and convince the reader of the strong performance optimization features
present in the Denodo Platform, but it is by no means exhaustive. The Denodo Platform has
many more tricks up its sleeve.
BRANCH PRUNING
Query plans are a tree of nodes,
each executing an operation in the
overall plan that answers the
query sent by the client
application. These nodes
sometimes return zero rows, as
mentioned earlier, and in these
cases the optimizer can remove
these branches (or subtrees) of
the execution plan; this is known
as branch pruning, and it can
massively simplify the query A typical partitioned data scenario.
execution plan.
Queries received over the union view will by default hit all the partitions in the union, and
subqueries will be delegated to each underlying data source accordingly. But in certain
scenarios, some queries will only need data from specific partitions. These queries specify
conditions that match one or more partitions but conflict with the definition of the other
partitions. For such queries, the optimizer would simplify the query execution plan and
remove the partitions that are known to be useless for that specific query so that only the
needed data sources would be queried.
Another common scenario for branch pruning is join operations over several tables. Consider
the example of a star schema, in which a sa es fact table is joined with dimensions such as
, , etc., that are then queried, retrieving fields only from part of the base
tables (such as and , but not ). In many of these cases, the unused
branches of the join are pruned and not executed.
JOIN REORDERING
The optimizer can also apply optimizations to n-joins. As explained above, the Denodo
Platform executes n-joins in pairs, by splitting them into a series of 2-way joins. This means
that there is an order to the sequence of joins when executing the n-join, so the optimizer has
some degree of freedom to seek performance improvements. The optimizer’s main goal
when reordering joins is to delegate part of the n-join (at least one of the 2-way joins) to the
underlying data sources.
POST-PRUNING DELEGATION
Most data virtualization scenarios include data combinations across different physical
systems, which prevents the delegation of operations if naive query plans are used. Some of
these scenarios include pruning the branches of the query plan, which allows for further
query optimization.A typical scenario results in the pruning of branches due to unused
This example shows that many times during the optimization process the application of one
technique opens the door for further optimization tactics, chaining them in a process that
enhances a query from displaying unacceptable performance to demonstrating performance
levels that are equivalent to those of physical systems.
AGGREGATION PUSHDOWN
Aggregation pushdown can be applied in many scenarios but it is very typical of logical data
warehouse situations. A large fact table in a physical system is joined with one or more
dimensions stored in separate physical systems, and aggregated over the primary key of one
or more dimensions. This represents a classical query in reporting scenarios and it is
probably the most common type of query that data warehouses will receive.
For a more detailed logical data warehouse scenarios explanation, visit this entry
in the Denodo Community Knowledge Base.
1. Pre-aggregate the results using the primary/foreign keys in the fact table. This
operation is fully delegated to the data source.
2. Join the results of the first step with the dimensions and then perform the
aggregation over the attributes of the dimension.
JOIN PUSHDOWN
Another useful optimization technique is pushing down joins under partitioned unions so they
can be delegated to the data sources (in line with one of our main optimization themes). This
technique is applied in situations like this: A union of two views in two different data sources
(for example, a partitioned fact table) is joined with another view (for example, a dimension)
and the dimension is physically stored twice, once in each data source.
The optimizer can change the order of the join and the union and choose to execute the join
first (as the dimension table is present in both data sources). After this reordering, the joins
can be delegated to the data source.
Base scenario for join pushdown; the products base view references two
identical copies of the data in datasources 1 and 2.
In such cases, the optimizer can apply automatic data movement to satisfy the requirements
of the join pushdown technique:
1. First, the dimension data is copied into all data sources involved in the union, before
the query execution.
2. Once the dimension is copied, the requirements for the join pushdown are satisfied,
and the optimizer applies that technique (pushing the join and possible aggregations
under the union, then delegating those operations to all the data sources).
The specifics of data movement will be explained later in this book, in the “Data movement”
section.
View Statistics
The optimizer is an extremely advanced system that can take a wide range of actions to
optimize the performance of queries received by the Denodo Platform. As mentioned earlier,
the optimization pipeline has two broad optimization phases: static and dynamic. Static
optimization works on the SQL of the received query on its own, while dynamic optimization
takes into consideration the context in which the query is executed.
The goal of the dynamic optimizer is to generate multiple execution plans for the given query
and then select the optimal one. In this process, the optimizer decides on the best way to
execute each operation in the plan. For example, for each join operation, it assesses data
about each one of the branches of the join and selects the best join method, type, and
ordering for that specific case.
This set of decisions is based on the knowledge the optimizer possesses about the data to
be queried. This knowledge is not available initially; the operators of the Denodo Platform
must feed it to the optimizer by gathering statistics on the views involved in the virtual data
model.
Gathering statistics is a crucial process that is absolutely required for the correct
functioning of the optimizer, and it should be one of the first steps taken by the
administrators of the data virtualization platform.
Statistics about the data gathered by the Denodo Platform are stored for each view, and
include values such as:
The last item is specially important. Data sources often allow the definition of indexes over
the data. These indexes usually cover a subset of the columns of the whole table and queries
that use them are much faster than queries that do not. Therefore, defining indexes on the
data is of extreme importance during the traditional process of designing the database layer
of any application. As the Denodo Platform is querying these sources, it becomes extremely
important to know which columns have the performance advantage of being included in an
index, and the optimized queries at the Denodo Platform level will take advantage of that fact
and always favor delegating queries that hit the available index over queries that use other
columns.
Check the Virtual Data Port Administration Guide for details regarding indexes in
the sources and in the Denodo Platform. Some data sources (such as some parallel
databases) require extra configuration steps for declaring their indexes correctly in
the data virtualization layer.
• For the manual gathering of statistics, use the “View Statistics Manager” in the Virtual
DataPort Administration Tool (in the Tools > Manage statistics menu). This tool will
show a selector of the database you want to gather the statistics in, and it will display
all the views available in the selected database. You can select multiple virtual views
and then click on the “Gather” action. After confirmation, the Denodo Platform will
start gathering the data from the sources, by querying each one in turn.
• Programmatic gathering of statistics is done through a VQL query that uses one of
the available statistics-related stored procedures in the Denodo Platform. These are
and . The difference
between them is that the former will use the source’s system metadata when
available, and the latter will always issue a statement to the source systems
to gather the data. These stored procedures can be called as part of a VQL query
that can be scheduled to be triggered periodically, either through Denodo Scheduler
or any other scheduling software that you may have available.
Gathering statistics is done by querying the data sources from the data virtualization layer
and asking them about the summarized knowledge about the data; the Denodo Platform will
always try to use the most performance-aware method for issuing these queries such as
querying the metadata tables directly when the source makes them available, as is the case
with relational databases. The Denodo Platform uses metadata tables when the source
supports this method. In those cases, some of the statistics (such as the minimum and
maximum values) may have to be calculated using queries over the data itself. This extra step
is enabled by selecting the checkbox “Complete missing statistics executing SELECT
queries”; note that the minimum and maximum values are only useful for range queries (for
example, conditions like ), so if your queries do not use that type of
clause, you may not need to retrieve the additional statistics and the metadata-based
statistics will suffice.
Keep in mind that statistics gathering may take a long time (especially when non-relational,
slow sources that do not have accessible statistics-related metadata are involved in the data
combinations).
Statistics only need to be updated when there are changes in the contents of the data
sources that may be significant enough to sway the decision from one execution plan to
another. If some table contains one million rows, there is no need to gather statistics every
time a hundred rows are added, as that small number of new rows will have no impact on the
selection of the final query execution plan. This is compounded by the fact that updating the
statistics of a view, if the underlying data has not changed, will have no effect whatsoever, so
the frequency of gathering the statistics should be dependant on the frequency of changes in
the data sources.
In addition, the statistics of each view can be gathered independently of each other. Thus,
gathering statistics should not be seen as a monolithic process, but instead as a set of small
processes that must be triggered at the right time. The frequency of changes in the data for
each of the underlying tables in the data sources will drive the need for setting the periodicity
of gathering statistics for each independent view.
Data Movement
Data Movement as an
Optimization Technique
Data movement is a performance A data movement base case. The join+aggregation cannot be
optimization technique that is delegated.
relevant when executing queries
Data movement is a
technique that piggybacks
on the benefits of
delegating queries to
sources (as seen at the The whole query can be delegated after the temporary data
beginning of this book). movement.
Imagine that the table in data source A contains 1,000 rows. It’s a small table with a small
number of columns. In data source B we have a table that contains 1,000,000,000 rows. We
then join these two tables together using the Denodo Platform, and let’s assume there is a
relatively low number of results of the join. Let’s say only 2% of the billion rows match a row
on the thousand rows.
This kind of situation is not uncommon for scenarios in which a database or a data
warehouse contains a set of dimension tables, and one or more of those are joined with a
very big fact table that is living on a big data store (for example, Hadoop).
Let’s do some quick calculations for the naive scenario. The Denodo Platform cannot
optimize the join so it will pull one thousand rows from the left branch, plus one billion rows
from the right side. The join is then done in memory (although if done properly with a hash
join it should not require a big amount of memory), and the resulting twenty million rows are
transferred to the client application. The total number of rows transferred in this scenario is
1,020,001,000.
If we use data movement, the numbers will be quite different: First we transfer the 1,000 rows
from data source A to the Denodo Platform and then to data source B, making it a transfer of
2,000 rows. (In some cases, it may be possible to transfer the rows from data source A to
data source B without doing a round trip through the data virtualization layer, but this is not
true for the general case). Then we delegate the join operation to data source B and we get
the resulting 20,000,000 rows and ship them to the client application. This means a total of
40,002,000 rows transfered. This means that the data-shipped join transfers fewer than 4%
of the rows the naive join transfers, which represents a huge improvement in performance.
The gains are even more evident in a typical reporting scenario, where on top of the join
between the dimension and fact tables, we aggregate the join to calculate some metrics over
the whole history in our data systems. In this case, if the distributed join was preventing any
delegation, by copying the dimension table to the second data source we can probably
delegate both the join and the aggregation, so the resulting data set that we pull as a result
Assuming that aggregation reduces the number of rows in the result data set by 99%, then
the numbers are 1,000,201,000 rows for the naive join and 402,000 rows for the data
shipped join. The gains are staggering (the number of rows transfered by the data movement
join is around 0.04% of the rows transfered by the standard join).
Connection Pools
The Denodo Platform receives queries from data consumers; these queries are processed by
the query engine in the Denodo Platform server, and during the query plan generation they
are iteratively decomposed into a series of subqueries that are sent to each data source
involved in the original query. This means that a query received from a data consumer will
ultimately result in one or more queries sent to data sources; the Denodo Platform will then
open a connection to each of the data sources, send each subquery, retrieve the data and
close the connection. This process is usually optimized by using a connection pool.
A connection pool works by generating a set of connections to a given data source when the
data virtualization server is initialized. Those connections are then kept alive and reused for
all the queries that need to use the data source, so when a new query needs to access it
there is no need to create a new connection from scratch. If the pool has a connection
available (one that is unused by any other query) then it is assigned to the next query. Once
the query has finished, the connection is returned to the pool and made available for other
queries. If no connection is available, the pool will either create a new one for the query (so
the pool size will grow) or pause the query until a connection is available.
This situation has two sides, when viewed from a performance- optimization perspective:
The second point is very important and is often overlooked by newcomers to data
virtualization. It is critical to review all the connection pools that are enabled in the Denodo
Platform server and make sure they are appropriately sized; otherwise, connection pools can
easily become performance bottlenecks in scenarios with a high volume of transactions
(even when the Denodo server is under light load, it will not process more queries if the
affected connection pools do not have available connections). The main tool for diagnosing
this is the set of server logs; use them to characterize your server load and predict the
average and maximum loads that each data source is likely to go through, and then perform
any corrective actions on the pool size configuration.
Remember that the connection pools in the Denodo Platform come in a default size
of 4 initial connections and 20 maximum connections. There are no “one-size-fits-
all” use cases so make sure you size your connection pools according to your
needs.
In summary, remember to check your connection pool sizes if your queries seem to be stuck
at the base view level in your performance testing. In these cases, the execution trace will
look like this:
If you find that situation in your logs, then increasing the size of the affected pool(s) should
solve the issue and make the queries run normally.
Finally, a word about multiple connection pools. The Denodo Platform will create an
independent connection pool for each data source that actively maintains connections to
other systems (for example all the data sources of type JDBC, ODBC, multidimensional DB,
etc.). This means that each pool is configurable separately, so the performance can be tuned
to exactly meet your needs, and queries waiting on the connection pool of a specific data
source will have no impact in the connection pools of other data sources.
Each data source in the Denodo Platform maintains its own connection pool to the source system.
There is one connection pool that usually has a critical status, with a higher importance over
the rest of connection pools: the connection pool configured to the cache system used by
the Denodo Platform server. This pool is more critical because it will potentially be involved in
answering many queries, as all queries in a database (and sometimes a whole server) use the
same cache data source. So, independently of what other data sources they need, if they hit
the cache they will all go through the same pipe. This makes it extremely important to size
the pool to the cache system appropriately, otherwise the cache may become a performance
bottleneck, which is the exact opposite of what the cache is for.
What is the problem, then, with connection pools? Remember how a connection pool works:
When the server starts, it creates a set of connections to the data source. This is done before
any query is received so these connections must be created with the same user credentials,
and these credentials will potentially be different than the credentials that the Denodo
Platform server wants to pass-through to the data source when a query is received. So pre-
creating the connections does not work.
Another option would be to populate the server with connections as the queries are coming.
Under this schema, the pool would be populated with connections using different credentials
that depend on which users sent the queries to the server, but a future query issued by user
A may receive a connection that was created with the credentials of user B. This would
obviously not be acceptable so this solution does not work either.
Multiple connection pools for a single data source when using pass-through credentials.
The implication of this schema from the performance angle is that the data source in question
will create an unbound number of connection pools. Each new user that issues a query
related to that data source will create a new connection pool to the source, and each pool
may have multiple connections to the data source. This could result in a data source
receiving more connection requests than it can process, so the performance could suffer. It is
therefore very important that you use a combination of pass-through credentials and a
connection pool only in the cases where you know that the number of Denodo Platform users
accessing the data source will be small, or when the data source has enough capacity (in
both maximum number of open connections received and processing power) to satisfy all the
requests from the data virtualization server.
Another optimization option that the Denodo Platform provides is the ability to create base
views from specific SQL queries. The most common way to create a new base view is to
graphically introspect a data source, select one of the tables or views that the data source
contains, and then click on “Create selected base views.” This, as you know, creates a base
view in the Denodo Platform which refers to a physical table (or a view) in the remote data
source.
Read this entry in the Denodo Community Knowledge Base for a more detailed
explanation of base views created from a SQL query.
There is another way to create a base view in data sources that accept SQL: We can create a
base view from a SQL query that we as users specify. This is done using the “Create from
query” button in the data source introspection screen:
With this approach, we can specify arbitrary SQL queries in the definition of the base view,
which offers two avenues for performance optimization:
Remember that the Denodo Platform uses vendor-specific SQL dialects by default,
so we always recommend checking what is the standard behavior in your use case
before going the base-view-from-query route.
Once a base view is created in this manner it becomes a SQL black box. If we execute a
simple query over that base view, the Denodo Platform will just issue the
underlying SQL query to the data source. But we are not just limited to these simple queries.
The base view is a first class citizen of our data virtualization layer, so we can use it in
combination with other views. If we end up combining this base view with other views that
live in the same data source, the Denodo Platform will still be able to delegate the data
combination to the source (as it would do with regular base views). In this situation, the base
view created from a SQL query is treated as a text blob, so in the resulting query that blob is
added as a named subquery in the clause and then used in the general query sent to
the data source, referencing it by name.
Note: delegating the definition of the base view as a subquery is the default
behavior starting in version 6.0 of the Denodo Platform. In previous versions, you
need to change the value of the field “Delegate SQL Sentence as Subquery” to
“Yes” by opening the base view in question, clicking “Advanced,” selecting the
“Search Methods” tab and clicking the “Wrapper Source Configuration” link.
For example, let’s suppose a base view myBaseView was created from the following query:
The naive way to execute the query would first execute the query defined by myBaseView,
retrieve all its rows into the data virtualization layer, and then execute the group by query.
Instead, the Denodo Platform optimizes the query by sending the following SQL to the data
source:
The takeaway is that this method of creating base views does not preclude delegation, so in
the Denodo Platform it is possible to leverage performance optimizations that use native
hints specific to individual data sources.
Note that creating base views from SQL queries should only be done when
necessary; if the use case can be solved by defining the views graphically using
the wizards in the VDP Administration Tool, that method should be preferred as it
will allow for greater maintainability of the views during the regular operation of
the system.
The Denodo Platform can execute several operations concurrently, because the server is
implemented in a multithreaded environment. In this context, a thread represents a
lightweight version of an operating system process. It can execute operations on data
independently and being lightweight, each thread requires fewer resources to be created
and destroyed than does a traditional process.
Threads require fewer resources than OS-level processes, but the Denodo Platform has to
maintain a balance between the number of operations that are executed concurrently and all
the resources that are available to be shared across all threads (such as CPU cores, RAM,
and network bandwidth). Finding this balance is key to achieving the right performance in
each possible scenario. This is the policy that the Denodo Platform implements for thread
creation:
• When a query plan needs to query a data source, the Denodo Platform creates a
thread for each specific access to that data source in the query.
• When data needs to be loaded in the cache, the Denodo Platform creates a new
thread for each independent load process.
• The Denodo Platform creates separate threads for the embedded Apache Tomcat
web server.
These threads are all assigned from a common thread pool that is created by the Denodo
Platform server at startup time. The size of the thread pool is configurable by the user by
tuning the values found in the Administration > Server configuration > Threads pool menu.
Administrators can set both the maximum number of concurrent requests in the server and
the maximum number of queries waiting to be executed.
VQL Considerations
Sorting
Sorting data is achieved in VQL/SQL using the clause. The section “Memory
usage” described different types of operations based on their memory footprint; in that
regard, sorting is always going to be a linear memory operation. It will always need to
process the full data set before the results are returned. As such, you should avoid sorting
From a pure performance standpoint, sorting is a relatively fast operation, with typical
average complexity of O(n log n), which is usually in the realm of acceptable performance
even for big data sets. This operation is usually pushed down to the sources, which may
include indexes that will assist in the sorting, making it even faster.
Finally, although sorting data incurs a penalty hit, the Denodo Platform optimizer sometimes
uses it for performance optimization. The classic example is to request the two branches of a
join to be sorted so the optimizer can use a merge join, which in most cases has the best
performance profile of all join methods. This again follows the theme of the optimizer
choosing suboptimal execution plans for the subnodes because that results in a better
execution plan for the combined query; the combination of sorting both branches and using a
merge join may be faster than retrieving both branches as they are and then using another
join strategy like a hash join.
Aggregations
Aggregations are a basic operation in VQL (expressed using the clause) and one
of the most important operations in reporting scenarios, so it is always important to
understand the performance implications of executing aggregations.
• If the source data that is being aggregated is not sorted by the aggregation field(s),
then the will behave as a blocking operation. The specific subtype will
depend on the specific aggregation operation that is performed: Counting rows (for
example, with a query) will be a constant memory operation, whereas
nesting rows into an array will be a linear memory operation.
• If the source data is sorted by the aggregation field(s), then the operation will be
semi-blocking. This means that the query will start reading rows until the value of the
aggregation field changes; at this moment, there are more results pending, but those
pertain to different result rows (because the aggregation value is different, and the
data is sorted), so the results of the current group of rows can be calculated at that
point. n effect, the query will block for each one of the groups, but once all the rows
of a group are available the result row for that group can be calculated. This has a
much lower memory footprint than aggregating unsorted data; sometimes the
When we add an aggregation to a view tree, we are adding an extra operation that will add
extra CPU and memory requirements to our query. Counterintuitively, however, joining views
and aggregating the results may result in better performance than just joining the views:
• As we saw in the query rewriting section, the Denodo Platform will try to reorder the
operations to get maximum performance. This means that sometimes the
aggregations will be pushed under the joins (the data will be first aggregated, then
joined); this is very common when querying star schemas, where a fact table is joined
with one or more dimensions and then aggregated. This reordering in turn makes for
the better performance of joins in the general case, as once the source data is
aggregated the cardinality of the join will potentially be much lower (as after the
aggregation there are fewer rows to join with the other side).
• If the aggregation is delegated to the data source, the number of rows that will be
transferred through the network will be lower, achieving better performance.
Functions
The Denodo Platform will also help you understand when a function is not delegable through
the query trace; when it finds this kind of situation the function will be displayed in this way:
In summary, always be mindful of the capabilities of the underlying data sources that you are
querying and plan your views to use functions accordingly.
By default, such functions are not delegable, as they are arbitrary code that may or may not
have any mapping to the source capabilities; but sometimes a custom function may have an
equivalent in some data sources. The Denodo Platform enables users to specify a delegation
equivalent for custom functions against specific data sources (this is only available for JDBC
data sources). For example, we can notify the Denodo Platform that the new custom function
The specific mechanism for creating these delegable functions is described in the Virtual
DataPort Developer Guide, in the section “Developing Custom Functions that Can Be
Delegated to a Database.” In a nutshell, for each data source that accepts this function, we
define a record with three fields:
• Database name.
• Database version.
• Pattern, which defines the syntax that will be used when delegating the function to
the source. This pattern is a string with placeholders for parameters specified with
the syntax , , , etc. For example, a pattern could be
“ ”.
With these three items the Denodo Platform will be able to delegate the function successfully
to that specific source.
This mechanism can also be used to effectively “import” functions that are supported in a
specific source but not by the Denodo Platform. The procedure would be to create a new
custom function, marking it as a function with no implementation (using the annotation
) and configured to be delegated to the source through the
delegable custom function syntax seen above. Normally the Denodo Platform does not
accept a custom function that is not implemented, but this is overlooked when using the
delegation syntax.This enables the Denodo Platform to use the function in any query, but with
a caveat: If the query is such that the function is applied at a point that is not being delegated
to the data source, the query will fail because the custom function does not have an
implementation and thus it needs to be delegated to the data source.
The clause of VQL is a very important tool when designing queries to be used in a
production environment where they are issued automatically from either a BI/reporting tool or
a custom-made application (web, mobile or desktop). This VQL clause is specified at the end
of the query string and enables the user to configure many of the options that have been
discussed throughout this book (and many others), in a programmatic way as opposed to
graphically. Some of the options that are available through the clause are:
The clause has a very rich set of options, so it is strongly recommended to check
the VQL reference found in the “Denodo Virtual DataPort Advanced VQL Guide” that lists all
the possible hints that can be specified in the clause and details the syntax and
semantics for each.
Other Considerations
This book has covered performance topics from different perspectives, with the intention of
pushing the readers to do their own testing in their Denodo Platform environments and find
optimal solutions for their data virtualization scenarios. Here are a few important
considerations when dealing with real world performance measurements.
The first refers to the methods used to reach your conclusions. The most common mistake
when evaluating query performance is to run them a single time and draw conclusions from
that execution. This is very problematic because the execution time of a query is not a set
quantity that will be perfectly repeatable every single time. Execution time varies every time
the query is run. If these variations are not taken into account when drawing conclusions, the
insights gained by the analysis could be very flawed.
In general, you should take the same steps you would in any other type of statistical analysis.
At a minimum:
• Run each query a multitude of times and write down all the resulting execution times.
For extra points, write down not only the total execution time but also the execution
times of intermediate nodes that may be important (such as base views and key
intermediate calculations). An easy way to do this is actually saving the full execution
traces of each query as a text file and then gathering the data from the saved traces.
• An often-overlooked data point is the execution time and the latency of each of
the data sources involved in a query. Always pay attention to those values; users
are often surprised at the actual behavior of their data sources and how it relates
• Once the raw execution times are gathered, calculate average times as needed. But
don’t stop there. For each time, calculate at least the mean value and the standard
deviation. These are very important data points in understanding the behavior of the
query. For example:
• A big difference between the average and the mean could signal the presence of
outliers and/or a skewed distribution.
• The standard deviation of each value will indicate how much “spread” there is. A
low standard deviation means the values are tightly packed around the mean
value, whereas a large standard deviation shows the values are spread across a
bigger range.
• When you draw and report on your conclusions, always state the sample size that
was used (the number of runs for the query), in addition to the conditions under
which the test was executed.
Those are the most basic steps that should be done, at a minimum. As with any other
statistical analyses, it can be useful to analyze further to try to characterize your queries. Plot
the histogram of values, try to fit them to a distribution, and check for the presence of outliers,
as they can have a big impact in a real world scenario, even if they are not common events.
Second, realize that data virtualization scenarios, by their nature, have multiple moving parts
that will affect the performance of the running queries. Many factors may influence
performance, including:
• Data source behavior: the latency, average execution time, bandwidth, etc.
• Hard drive behavior, including maximum and average read and write speeds (both
sequential and random), seek time, traditional spinning media vs SSD performance
issues, etc.
• CPU and memory load in the server, including the other processes that are running
concurrently with the Denodo Platform. Are there any processes scheduled to run at
specific times in your production servers? Could they affect the performance of your
data virtualization solution?
The takeaway should not be surprising: Taking the proper time to fully understand the
behavior of your queries is critical in your path to success with data virtualization.
Summary
In this Cookbook, we have reviewed the basic components and techniques involved in query
optimization in data virtualization scenarios with the Denodo Platform. This basic introduction
to the topic covered both the features that the Denodo Platform offers for manual
intervention and the automatic capabilities that assist the user in obtaining the best
performance at all times. We recommend using this manual as the starting point for anybody
interested in designing and optimizing queries using the Denodo Platform, and we
encourage everybody to expand their knowledge of these important topics, starting with the
resources at the Denodo Community, including the Questions & Answers, the Knowledge
Base, and Virtual DataPort documentation such as the Virtual DataPort Administration Guide,
the Virtual DataPort Advanced VQL Guide and the
Virtual DataPort Developer Guide.
Denodo EMEA
Portland House, 17th floor, Bressenden Place
London SW1E 5RS, United Kingdom
Phone (+44) (0) 20 7869 8053
Email: info.emea@denodo.com
Denodo DACH
3rd Floor, Maximilianstraße 13
Munich, 80539, Germany
Phone (+49) (0) 89 203 006 441
Email info.emea@denodo.com
CB-QueryOptmization-01