Elasticsearch Usage with Apinizer - Part:2

In the first section of the “Elasticsearch Usage” article series, I talked about why we chose Elasticsearch in Stage 1 and the settings we applied as a result of the literature review with Elasticsearch in Stage 2. In this article, we conducted tests to evaluate the issues that leave a question mark on the performance effect of the index theme.

Elasticsearch Usage with Apinizer – Part:1

Stage 1: Couchbase & InfluxDb & Elasticsearch, which one is the right decision for us?

Stage 2: What is the right Elasticsearch setting for us?

Elasticsearch Usage with Apinizer – Part:2 (this article)

Stage 3: Performance Tests

Elasticsearch Usage with Apinizer – Part:3

Stage 4: Shard and Disk Size

Stage 5: Crucial Checks

Stage 3: Performance Tests

We decided to test some cases to measure what the impact will be and how much of an impact it will have on performance. Because researches and recommendations focused on modeling the document structure to be the cheapest and searching on as few fields as possible. We have shortened the situations (our curiosities) to be tested as follows;

Case 1: What effect will the _source and index mapping parameters have on disk space and query speed? What is the difference between index in default mapping and index with these parameters?

Case 2: What are the changes in the results on Case1 over 10M documents and what are the differences in the measured values when indexing with default settings?

Case 3: Is there any difference between index with custom template and index created with default settings?

Case 4: Is Elasticsearch performance really good on loaded data?

📓 Note: When handling these cases, they should not be compared, as different criteria are intended to measure and different document structures are used.

Case 1:

Index size and disk size comparison will be made on indexes where the index mapping parameter is true and false.

To test this situation, an index with two different mappings as case1condition1 and case1condition2 was created. The mapping of these indexes is as follows;

Click here for mapping of case1condition1 Index.
Click here for mapping of case1condition2 Index:

A sample document for the case1condition1 and case1condition2 indexes;

# The text in the message1 and message2 fields is 500 characters
{
"message1":"But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or.",
"message2":"But I must explain to you how all this mistaken idea of denouncing pleasure and praising pain was born and I will give you a complete account of the system, and expound the actual teachings of the great explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or.",
"age":20
}

The following queries were run on these indexes:

* The value of the size parameter in the query was queried separately by giving 0 and 20 values.

Query results and index size run on 500 thousand documents

Index Case1			Condition1	Condition2
Index Size (mb)			49	51
Stats Aggregation	Size:20	First Execution	33 ms	39 ms
	Size:20	Cached Execution	43 ms	45 ms
	Size:0	First Execution	36 ms	32 ms
	Size:0	Cached Execution	2 ms	1 ms
Terms Aggregation	Size:20	First Execution	84 ms	86 ms
	Size:20	Cached Execution	2 ms	2 ms
	Size:0	First Execution	44 ms	29 ms
	Size:0	Cached Execution	2 ms	2 ms

In conclusion;

The source property controls whether the document or fields are stored. With the excludes and includes parameters, it is possible to control which fields will be in the source or not. Although this feature is used because it reduces disk space, it should be noted that when set to _source:false, data will not be saved in the source part of the index, so the result will be found in SearchHit, but the data will not come. It can only be applied to fields that will not be fetched in queries.
The index property checks whether the field is indexed or not. It should be noted that if the field is set to index:false when source: true, then this field will not appear in the returned SearchHit object when a search is performed (search cannot be made over this field, what is meant here is to search over another field).
Considering the size of the indexes on the disk, case1condition1 has a smaller size than case1condition2. The 2MB difference between these two indexes is due to the index parameter in the age
If no query will be made on numeric fields, index:false and _source exclude can be evaluated in terms of not taking up space. Because there was no difference in performance during aggregation.
It is seen that the aggregation made over the age field is faster when size:0 between case1condition1 and case1condition2.In other words, the aggregation on the indexed field worked faster. However, since the queries will run over the cache in our application most of the time and the difference between the queries without cache is not very large, the case of this field being indexed can be ignored. (Although our resources are limited, we decided to continue with index, since the disk size of these fields is not that big for us – in this example, 2MB for 500K-)
It has been noted that the results are not cached when size:20. It has been concluded that this is because the “get + fetch” of the returned SearchHit object may differ each time the query is answered. Thus, the response time does not decrease when the query is called more than once. Therefore, it seems that the use of size:0 value should be preferred in aggregation queries. Bucket values are more important than SearchHit for aggregation anyway.
If only an aggregation query will be made on a field, it can be index:false. Because it has been observed that the disk size is reduced. However, it has been decided to use the default value of index:true on our platform.
To run an aggregation query on a field with a text data type, it must be fielddata:true.

Case 2:

In this section, a retest was carried out on indexes with 10M documents to measure the performance of the other most used Term and Match queries and to find an answer to the question of “What is the difference between the disk size and search performances of case1condition1 and case1condition2 indexes and the index with default mapping when the number of documents increases in Case1?”

In the above section, the mapping specified with the statement case1condition1 has been applied to the case2condition1
In the above section, the mapping specified with the statement case1condition2 has been applied to the case2condition2
In the case2condition3 index, however, no mapping process was applied. It was indexed with default mapping properties.

The following queries were run on the case2condition1, case2condition2, and case2condition3 indexes.

* The value of the size parameter in the query was queried separately by giving 0 and 20 values.

Query results and index size run on 10 million documents;

Case2			Condition1	Condition2	Condition3
Index Size (mb)			1017.33	1027.79	1254.84
Stats Aggregation	Size:20	First Execution	1223 ms	619 ms	1206 ms
	Size:20	Cached Execution	1082 ms	568 ms	1175 ms
	Size:0	First Execution	481 ms	417 ms	450 ms
	Size:0	Cached Execution	2 ms	3 ms	2 ms
Terms Aggregation	Size:20	First Execution	1116 ms	907 ms	1214 ms
	Size:20	Cached Execution	1067 ms	894 ms	1170 ms
	Size:0	First Execution	429 ms	411 ms	437 ms
	Size:0	Cached Execution	2 ms	3 ms	2 ms
Term Query	Size:20	First Execution	*	41 ms	19 ms
	Size:20	Cached Execution		12 ms	15 ms
	Size:0	First Execution		20 ms	4 ms
	Size:0	Cached Execution		2 ms	2 ms
Match Query	Size:20	First Execution	40 ms	9 ms	12 ms
	Size:20	Cached Execution	5 ms	3 ms	10 ms
	Size:0	First Execution	1 ms	2 ms	4 ms
	Size:0	Cached Execution	1 ms	1 ms	2 ms

* Term query could not run a query on age because the index value was false.

In conclusion;

The index with the index:false setting did not make a big difference with other indexes in relation with the disk size.
Parallel to the result in Case 1, there is no visible difference in the aggregation processes for index:false compared to the 10M result. The query ran faster on the indexed field, albeit slightly.
(Naturally) it seems that the performance decreases as the cluster size to be aggregated increases. Considering the numbers, it was concluded that it is not correct to use a data set larger than 10M since the 1sec limit for aggregation is approached. This has led to the need to update the filters in our application in a predictive way and to set rules accordingly.
It should be noted that Elasticsearch does not allow querying on the index:false
Although the text-type and the excluded text message2 field allows full-text search while index:true, it did not return the value of this field in SearchHit, since it is excluded. In our queries, it has been decided that no field should be excluded since all fields must be brought in.

Case 3:

Comparison of an index with custom template and index created with default settings

All the completed operations so far consist of research and tests on what settings we will make to optimize disk size and run more performance queries. Under this situation, we tried to measure what we gained or lost by comparing the case3condition1 index with the mapping we created and the case3condition2 index in the default settings. These indexes have 1 million documents.

To test this case, a template named case3condition1template was created, which includes the mapping and settings of case3condition1. For the case3condition2 index, there is no need for any template as indexing will be done in the default settings. Elasticsearch indexes with default mapping values and settings. These indexes are;

Click here for mapping of the case3condition1 index

Click here for mapping of the case3condition2 Index

It was tested on these indexes with a more complex aggregation and full-text query suitable for real use. Unshortened domain names were used in the queries made on the case3condition2 index. These queries are;

Click here for Complex Query

Click here for Complex Full-Text Query

Query results and index size run on 1 million documents;

Case3			Condition1	Condition2
Index Size (gb)			2	2.6
Complex Query	Size:0	First Execution	194 ms	317 ms
Complex Query	Size:0	Cached Execution	130 ms	130 ms
Complex Full-Text Query	Size:20	First Execution	180 ms	260 ms
Complex Full-Text Query	Size:20	Cached Execution	22 ms	21 ms

In conclusion;

In the case3condition1 index, text type is assigned to the fields that will be full-text, a keyword type to the fields to be run aggregation and query, and data type assignment was made based on the size of the numeric fields. This configuration setting caused a data difference of 600MB. It seems that there will be a serious difference if the number of documents reaches 100 million.
Although the queries run faster on the case3condition1 index on their first run, the running times of the cached queries are close to each other in index 2.
As additional information, characters above the value written with the ignore_above parameter of the keyword data type are only stored, but these stored fields are not searched.

As a result of Case 1, 2, 3;

Our work seems to serve the purpose. 😸 The elasticsearch cluster on our platform is operational as both a time-series database and a full-text-search database.

With proper mapping, the disk size has drastically reduced. All our fields are indexed and available in the source. However, the use of unnecessary large types (there were fields where we preferred to use keywords for text and there were places where we used bytes and integers for numeric values) has been cleaned.
Index number, shard number, and replica numbers were adjusted.
Queries occur in multiple indexes, indexing occurs in the latest index. All indexes except the last one are read-only. Thus, index lifecycle management became easier.
We think (also we hope 😺) that Index lifecycle management has positive effects that we cannot measure (at least in terms of consistency).
We checked for in-app cacheable queries and fixed the ones that were not. In our time series queries, our queries that used to be based on instantaneous time retrieval are rounded to the minute so that they can be cached. Our aggregations are set to yield no results. (size:0)
With the use of the cache, the results began to come noticeably faster.
The term passed over the fields using match query.
Query performance is increased by bringing only the required fields with the includes keyword instead of using the default query results.

Case 4:

Is Elasticsearch performance really good on loaded data?

We tested the query speeds on the index with 550M documents.

To test this situation, an index with mapping to the case4condition1 index was created. The mapping and an example document of this index is as follows;

Click here for mapping of case4condition1 index
Click here for an example document of the index case4condition1

The following queries were run on the case4condition1 index;

* The value assigned to the field name and value property in this query differs according to the data type.

The results of the queries run on the case4condition1 index, which has 550 million documents and has a size of 96.3GB;

Query Type	Field Name	First Execution (ms)	Cached Execution (ms)
Stats Aggregation	number1	20384	2
Terms Aggregation	number1	21858	2
Date Histogram and Terms Aggregation	date1, enum2	28536
Match Query	metin1	143	5
	metin2	210	6
	enum1	67	4
	enum2	64	3
	number1	979	830
	number2	234	152
	number3	204	52
	number4	96	12
Term Query	metin1	122	5
	metin2	164	11
	enum1	6	4
	enum2	3	2
	number1	856	828
	number2	227	154
	number3	102	46
	number4	59	5

In conclusion;

A 3-word text is kept in the Text1 field and a 15-word text in the Text2 field. When the match queries on these fields are compared, the query slows down as the indexed data grows.
The Number1 field is in the range of 0–10, the Number2 field is in the range of 0–100, the Number3 field is in the range of 0–1000, and the Number4 field is in the range of 0–10000. It has been determined that the response time of both the term query and the match query decreases with the increase in the number of differences and cardinality of values in numeric fields. That’s why you can search with term instead of match.
When match and term query were compared generally, any type of term query run faster.
When the Aggregation query (without a meaningful filter) was run, it resulted in approximately 28 seconds, and when it was attempted to be run more than once, a 504 (Gateway Time-out) error was received. Therefore, aggregation on large data sets does not seem meaningful.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Total:

Elasticsearch Usage with Apinizer - Part:2

Stage 3: Performance Tests

Case 1:

Case 2:

Case 3:

As a result of Case 1, 2, 3;

Case 4:

REQUEST A DEMO

Products

Resources

Company