Elasticsearch Usage with Apinizer - Part:2

In the first section of the “Elasticsearch Usage” article series, I talked about why we chose Elasticsearch in Stage 1 and the settings we applied as a result of the literature review with Elasticsearch in Stage 2. In this article, we conducted tests to evaluate the issues that leave a question mark on the performance effect of the index theme.

Elasticsearch Usage with Apinizer – Part:1 

Stage 1: Couchbase & InfluxDb & Elasticsearch, which one is the right decision for us?

Stage 2: What is the right Elasticsearch setting for us?

Elasticsearch Usage with Apinizer – Part:2 (this article)

Stage 3: Performance Tests

Elasticsearch Usage with Apinizer – Part:3

Stage 4: Shard and Disk Size

Stage 5: Crucial Checks


 

Stage 3: Performance Tests

We decided to test some cases to measure what the impact will be and how much of an impact it will have on performance. Because researches and recommendations focused on modeling the document structure to be the cheapest and searching on as few fields as possible. We have shortened the situations (our curiosities) to be tested as follows;

Case 1: What effect will the _source and index mapping parameters have on disk space and query speed? What is the difference between index in default mapping and index with these parameters?

Case 2: What are the changes in the results on Case1 over 10M documents and what are the differences in the measured values when indexing with default settings?

Case 3: Is there any difference between index with custom template and index created with default settings?

Case 4: Is Elasticsearch performance really good on loaded data?

📓 Note: When handling these cases, they should not be compared, as different criteria are intended to measure and different document structures are used.


 

Case 1:

Index size and disk size comparison will be made on indexes where the index mapping parameter is true and false.

To test this situation, an index with two different mappings as case1condition1 and case1condition2 was created. The mapping of these indexes is as follows;

A sample document for the case1condition1 and case1condition2 indexes;

# The text in the message1 and message2 fields is 500 characters
{
"message1":"But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or.",
"message2":"But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or.",
"age":20
}

The following queries were run on these indexes:

* The value of the size parameter in the query was queried separately by giving 0 and 20 values.

Query results and index size run on 500 thousand documents

Index Case1Condition1Condition2
Index Size (mb)4951
Stats AggregationSize:20First Execution33 ms39 ms
Cached Execution43 ms45 ms
Size:0First Execution36 ms32 ms
Cached Execution2 ms1 ms
Terms AggregationSize:20First Execution84 ms86 ms
Cached Execution2 ms2 ms
Size:0First Execution44 ms29 ms
Cached Execution2 ms2 ms

In conclusion;

  • The source property controls whether the document or fields are stored. With the excludes and includes parameters, it is possible to control which fields will be in the source or not. Although this feature is used because it reduces disk space, it should be noted that when set to _source:false, data will not be saved in the source part of the index, so the result will be found in SearchHit, but the data will not come. It can only be applied to fields that will not be fetched in queries.
  • The index property checks whether the field is indexed or not. It should be noted that if the field is set to index:false when source: true, then this field will not appear in the returned SearchHit object when a search is performed (search cannot be made over this field, what is meant here is to search over another field).
  • Considering the size of the indexes on the disk, case1condition1 has a smaller size than case1condition2. The 2MB difference between these two indexes is due to the index parameter in the age
  • If no query will be made on numeric fields, index:false and _source exclude can be evaluated in terms of not taking up space. Because there was no difference in performance during aggregation.
  • It is seen that the aggregation made over the age field is faster when size:0 between case1condition1 and case1condition2.In other words, the aggregation on the indexed field worked faster. However, since the queries will run over the cache in our application most of the time and the difference between the queries without cache is not very large, the case of this field being indexed can be ignored. (Although our resources are limited, we decided to continue with index, since the disk size of these fields is not that big for us – in this example, 2MB for 500K-)
  • It has been noted that the results are not cached when size:20. It has been concluded that this is because the “get + fetch” of the returned SearchHit object may differ each time the query is answered. Thus, the response time does not decrease when the query is called more than once. Therefore, it seems that the use of size:0 value should be preferred in aggregation queries. Bucket values are more important than SearchHit for aggregation anyway.
  • If only an aggregation query will be made on a field, it can be index:false. Because it has been observed that the disk size is reduced. However, it has been decided to use the default value of index:true on our platform.
  • To run an aggregation query on a field with a text data type, it must be fielddata:true.

 

Case 2:

In this section, a retest was carried out on indexes with 10M documents to measure the performance of the other most used Term and Match queries and to find an answer to the question of “What is the difference between the disk size and search performances of case1condition1 and case1condition2 indexes and the index with default mapping when the number of documents increases in Case1?”

  • In the above section, the mapping specified with the statement case1condition1 has been applied to the case2condition1
  • In the above section, the mapping specified with the statement case1condition2 has been applied to the case2condition2
  • In the case2condition3 index, however, no mapping process was applied. It was indexed with default mapping properties.

The following queries were run on the case2condition1, case2condition2, and case2condition3 indexes.

* The value of the size parameter in the query was queried separately by giving 0 and 20 values.

Query results and index size run on 10 million documents;

Case2Condition1Condition2Condition3
Index Size (mb)1017.331027.791254.84
Stats AggregationSize:20First Execution1223 ms619 ms1206 ms
Cached Execution1082 ms568 ms1175 ms
Size:0First Execution481 ms417 ms450 ms
Cached Execution2 ms3 ms2 ms
Terms AggregationSize:20First Execution1116 ms907 ms1214 ms
Cached Execution1067 ms894 ms1170 ms
Size:0First Execution429 ms411 ms437 ms
Cached Execution2 ms3 ms2 ms
Term QuerySize:20First Execution*41 ms19 ms
Cached Execution12 ms15 ms
Size:0First Execution20 ms4 ms
Cached Execution2 ms2 ms
Match QuerySize:20First Execution40 ms9 ms12 ms
Cached Execution5 ms3 ms10 ms
Size:0First Execution1 ms2 ms4 ms
Cached Execution1 ms1 ms2 ms

* Term query could not run a query on age because the index value was false.

In conclusion;

  • The index with the index:false setting did not make a big difference with other indexes in relation with the disk size.
  • Parallel to the result in Case 1, there is no visible difference in the aggregation processes for index:false compared to the 10M result. The query ran faster on the indexed field, albeit slightly.
  • (Naturally) it seems that the performance decreases as the cluster size to be aggregated increases. Considering the numbers, it was concluded that it is not correct to use a data set larger than 10M since the 1sec limit for aggregation is approached. This has led to the need to update the filters in our application in a predictive way and to set rules accordingly.
  • It should be noted that Elasticsearch does not allow querying on the index:false
  • Although the text-type and the excluded text message2 field allows full-text search while index:true, it did not return the value of this field in SearchHit, since it is excluded. In our queries, it has been decided that no field should be excluded since all fields must be brought in.

 

Case 3:

Comparison of an index with custom template and index created with default settings

All the completed operations so far consist of research and tests on what settings we will make to optimize disk size and run more performance queries. Under this situation, we tried to measure what we gained or lost by comparing the case3condition1 index with the mapping we created and the case3condition2 index in the default settings. These indexes have 1 million documents.

To test this case, a template named case3condition1template was created, which includes the mapping and settings of case3condition1. For the case3condition2 index, there is no need for any template as indexing will be done in the default settings. Elasticsearch indexes with default mapping values and settings. These indexes are;

Click here for mapping of the case3condition1 index

Click here for mapping of the case3condition2 Index

It was tested on these indexes with a more complex aggregation and full-text query suitable for real use. Unshortened domain names were used in the queries made on the case3condition2 index. These queries are;

Click here for Complex Query

Click here for Complex Full-Text Query

Query results and index size run on 1 million documents;

Case3Condition1Condition2
Index Size (gb)22.6
Complex QuerySize:0First Execution194 ms317 ms
Cached Execution130 ms130 ms
Complex Full-Text QuerySize:20First Execution180 ms260 ms
Cached Execution22 ms21 ms

In conclusion;

  • In the case3condition1 index, text type is assigned to the fields that will be full-text, a keyword type to the fields to be run aggregation and query, and data type assignment was made based on the size of the numeric fields. This configuration setting caused a data difference of 600MB. It seems that there will be a serious difference if the number of documents reaches 100 million.
  • Although the queries run faster on the case3condition1 index on their first run, the running times of the cached queries are close to each other in index 2.
  • As additional information, characters above the value written with the ignore_above parameter of the keyword data type are only stored, but these stored fields are not searched.

As a result of Case 1, 2, 3;

Our work seems to serve the purpose. 😸 The elasticsearch cluster on our platform is operational as both a time-series database and a full-text-search database.

  • With proper mapping, the disk size has drastically reduced. All our fields are indexed and available in the source. However, the use of unnecessary large types (there were fields where we preferred to use keywords for text and there were places where we used bytes and integers for numeric values) has been cleaned.
  • Index number, shard number, and replica numbers were adjusted.
  • Queries occur in multiple indexes, indexing occurs in the latest index. All indexes except the last one are read-only. Thus, index lifecycle management became easier.
  • We think (also we hope 😺) that Index lifecycle management has positive effects that we cannot measure (at least in terms of consistency).
  • We checked for in-app cacheable queries and fixed the ones that were not. In our time series queries, our queries that used to be based on instantaneous time retrieval are rounded to the minute so that they can be cached. Our aggregations are set to yield no results. (size:0)
  • With the use of the cache, the results began to come noticeably faster.
  • The term passed over the fields using match query.
  • Query performance is increased by bringing only the required fields with the includes keyword instead of using the default query results.

Case 4:

Is Elasticsearch performance really good on loaded data?

We tested the query speeds on the index with 550M documents.

To test this situation, an index with mapping to the case4condition1 index was created. The mapping and an example document of this index is as follows;

  • Click here for mapping of case4condition1 index
  • Click here for an example document of the index case4condition1

The following queries were run on the case4condition1 index;

* The value assigned to the field name and value property in this query differs according to the data type.

The results of the queries run on the case4condition1 index, which has 550 million documents and has a size of 96.3GB;

Query TypeField NameFirst Execution (ms)Cached Execution (ms)
Stats Aggregationnumber1203842
Terms Aggregationnumber1218582
Date Histogram and Terms Aggregationdate1, enum228536 
Match Querymetin11435
metin22106
enum1674
enum2643
number1979830
number2234152
number320452
number49612
Term Querymetin11225
metin216411
enum164
enum232
number1856828
number2227154
number310246
number4595

In conclusion;

  • A 3-word text is kept in the Text1 field and a 15-word text in the Text2 field. When the match queries on these fields are compared, the query slows down as the indexed data grows.
  • The Number1 field is in the range of 0–10, the Number2 field is in the range of 0–100, the Number3 field is in the range of 0–1000, and the Number4 field is in the range of 0–10000. It has been determined that the response time of both the term query and the match query decreases with the increase in the number of differences and cardinality of values in numeric fields. That’s why you can search with term instead of match.
  • When match and term query were compared generally, any type of term query run faster.
  • When the Aggregation query (without a meaningful filter) was run, it resulted in approximately 28 seconds, and when it was attempted to be run more than once, a 504 (Gateway Time-out) error was received. Therefore, aggregation on large data sets does not seem meaningful.