Performance and Debugging
This is a legacy Apache Ignite documentation
The new documentation is hosted here: https://ignite.apache.org/docs/latest/
The following page covers these topics:
- Using EXPLAIN
- SQL Performance and Usability Considerations
- Result Set Lazy Loading
- Querying Collocated Data
- Increasing Index Inline Size
- Query Parallelism
- Index Hints
- Partition Pruning
- Skip Reducer on Update
- SQL On-heap Row Cache
- Using TIMESTAMP instead of DATE
To learn more on indexes tradeoffs refer CREATE INDEX documentation.
Using EXPLAIN Statement
Ignite supports "EXPLAIN ..." syntax for reading execution plans and query performance investigation purposes. Note that a plan cursor will contain multiple rows: the last one will contain a query for reducing node, others are for map nodes.
EXPLAIN SELECT name FROM Person WHERE age = 26;
The execution plan itself is generated by H2 as described here:
http://www.h2database.com/html/performance.html#explain_plan
SQL Performance and Usability Considerations
If the query contains an OR operator, then indexes may not be used as expected. For example, for the query select name from Person where sex='M' and (age = 20 or age = 30)
, index on field sex
will be used instead of index on field age
although the latter is a more selective index. As a workaround for this issue, you can rewrite the query with UNION ALL (notice that UNION without ALL will return DISTINCT rows, which will change the query semantics and introduce additional performance penalty). For Example:
select name from Person where sex='M' and age = 20
UNION ALL
select name from Person where sex='M' and age = 30
Avoid having too many columns in the result set of a SELECT query. Due to limitations of the H2 query parser, queries with 100+ columns may perform worse than expected.
Result Set Lazy Loading
By default, Ignite attempts to load the whole result set to memory and send it back to the query initiator that is usually a client application. This approach provides optimal performance for queries of small or medium result sets.
However, if result set is too big to fit in the available memory, then it can lead to prolonged GC pauses and even OutOfMemoryError
.
To minimize memory consumption, at the cost of a moderate performance hit, you can load and process the result sets lazily by passing the lazy
parameter to the JDBC and ODBC connection strings or use a similar method available for Java, .NET and C++ APIs:
SqlFieldsQuery query = new SqlFieldsQuery("SELECT * FROM Person WHERE id > 10");
// Result set will be loaded lazily.
query.setLazy(true);
jdbc:ignite:thin://192.168.0.15?lazy=true
Querying Collocated Data
Whenever Ignite executes a distributed query, it sends sub-queries to individual cluster members and groups the results on the reducer node. If you know in advance that the data you are querying is collocated by the GROUP BY condition, you can use SqlFieldsQuery.collocated = true
to reduce network traffic between the nodes and query execution time. When this flag is set to true
, the query is executed on individual nodes first and the results are sent to the reducer node for final calculation. Consider the following examples, in which we assume that data is collocated by department_id
.
Example 1.
SELECT SUM(salary) FROM Employee GROUP BY depatment_id
Because of the nature of the SUM operation, Ignite will sum up the salaries over the elements stored on individual nodes and, then, send these sums to the reducer node where the final result will be calculated. Enabling the collocated flag will only slightly improve performance.
Example 2.
SELECT AVG(salary) FROM Employee GROUP BY depatment_id
In this example, Ignite has to fetch all (salary, department_id)
pairs to the reducer node and calculate the results there. However, if employees are collocated by the department_id
field, i.e. employee data for the same department is stored on the same node, setting SqlFieldsQuery.collocated = true
will reduce query execution time, because Ignite will calculate the averages for each department on the individual nodes and send the results to the reducer node for final calculation.
Increasing Index Inline Size
Ignite partially includes indexed values in the index itself to optimize querying and data updates. Fixed-size data types (bool, byte, short, int, etc.) are included in full. For variable-size data (string, byte[]), only a fixed-length portion is included. The length of the included portion is called the inline size and by default equals the first 10 bytes of the value.
When values are not fully included in the index, comparing these values may require reading the corresponding data pages, which may have a negative impact on performance. When indexing variable length data, it is recommended that you estimate the length of your fields and set the inline size to the value that includes most or all values.
Use one of the following properties to set the inline size (in all cases the value is set in bytes):
QueryIndex.inlineSize
when configuring indexes viaorg.apache.ignite.cache.QueryEntity
objects. See Query Configuration using QueryEntity for more details.@QuerySqlField.inlineSize
when using annotations.INLINE_SIZE
property of the CREATE INDEX command.
There is constant overhead of 1 byte per constant-length column such as long, and of 2 bytes per
VARCHAR
column, which should be accounted for when specifying inline size. Also note that since Ignite encodes strings toUTF-8
, some characters use more than 1 byte.
Query Parallelism
By default, an SQL query is executed in a single thread on each participating Ignite node. This approach is optimal for queries returning small result sets involving index search. For example:
select * from Person where p.id = ?;
Certain queries might benefit from being executed in multiple threads. This relates to queries with table scans and aggregations, which is often the case for OLAP workloads. For example:
select SUM(salary) from Person;
You can control query parallelism through the CacheConfiguration.queryParallelism
property which defines the number of threads that will be used to execute a query on a single node. Use a preconfigured CacheConfiguration
template that adjusts a value of the parameter whenever CREATE TABLE generates the SQL schema and underlying caches.
If a query contains JOINs
, then all the participating caches must have the same degree of parallelism.
Use with care
Currently, this property affects all queries executed on the given cache. While providing speedup to heavy OLAP queries, this option may slowdown other simple queries. This behavior will be improved in further versions.
Index Hints
Index hints are useful in scenarios when it's known that one index is more suitable for certain queries than another. They are also needed to instruct the query optimizer to choose a more efficient execution plan. To do this optimization in Apache Ignite, you can use USE INDEX(indexA,...,indexN)
statement that tells Ignite to apply only one of the named indexes provided for query execution.
Below is an example that leverages from this capability:
SELECT * FROM Person USE INDEX(index_age)
WHERE salary > 150000 AND age < 35;
Partition Pruning
Partition pruning is a technique that optimizes queries that use affinity keys in the WHERE condition. When executing such a query, Ignite will scan only those partitions where the requested data is stored. This will reduce query time because the query will be sent only to the nodes that store the requested partitions. To learn more about partition distribution, see Partitioning and Replication.
In the following example, the employee objects are collocated by the id
field (if the affinity key is not specified, Ignite will use the primary key to collocate data).
CREATE TABLE employee (id BIGINT PRIMARY KEY, department_id INT, name VARCHAR)
/* This query is sent to the node where the requested key is stored */
SELECT * FROM employee WHERE id=10;
/* This query is sent to all nodes */
SELECT * FROM employee WHERE department_id=10;
In the next example, the affinity key is set explicitly and, therefore, will be used to collocate data.
CREATE TABLE employee (id BIGINT PRIMARY KEY, department_id INT, name VARCHAR) WITH "AFFINITY_KEY=department_id"
/* This query is sent to all nodes */
SELECT * FROM employee WHERE id=10;
/* This query is sent to the node where the requested key is stored */
SELECT * FROM employee WHERE department_id=10;
Skip Reducer on Update
When Ignite executes a DML operation, first, it fetches all the affected intermediate rows for analysis to the query initiator (also known as a reducer), and only then prepares batches of updated values that will be sent to remote nodes.
This approach might affect performance, and saturate network if a DML operation has to move many entries over it.
Use this flag as a hint for Ignite to do all intermediate rows analysis and updates "in-place" on corresponding remote data nodes. The hint is supported for both JDBC and ODBC:
jdbc:ignite:thin://192.168.0.15/skipReducerOnUpdate=true
SQL On-heap Row Cache
Ignite durable memory stores data and indexes outside of Java heap. This means that with every data access, a part of the data will be copied from an off-heap region to Java heap, potentially deserialized and kept in the heap as long as your application or server node references it.
The SQL on-heap row cache is intended to store hot rows (key-value objects) in the Java heap, minimizing resources spent for data copying and deserialization. Each cached row refers to an entry in the off-heap region and can be invalidated when one of the following happens:
- The master entry stored in the off-heap region is updated or removed.
- The data page that stores the master entry is evicted from RAM.
The on-heap row cache can be enabled for a specific cache (if CREATE TABLE is used for SQL tables and caches creation, then the parameter can be passed via a cache template):
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="name" value="person"/>
...
<property name="sqlOnheapCacheEnabled" value="true"/>
</bean>
If this row cache is enabled, you might get up to 2x performance increase for some SQL queries and use cases by allocating more RAM for rows caching purposes. Consider this as a tradeoff.
SQL On-Heap Row Cache Size
Presently, the cache is unlimited and can occupy as much RAM as allocated to off-heap data regions. Make sure to:
- Set the JVM max heap size equal to the total size of all the data regions that store Ignite caches for which this on-heap row cache is enabled.
- Tune JVM garbage collection accordingly.
Using TIMESTAMP instead of DATE
Use the TIMESTAMP type instead of DATE whenever possible. The DATE type is serialized/deserialized very inefficiently resulting in performance degradation.
Updated over 4 years ago