Elasticsearch Usage with Apinizer - Part:1

We researched whether there is a more suitable and usable log database instead of Elasticsearch (v6.3.1), which we have been using since Apinizer v.1 for the renovation process of our Apinizer – Full API LifecycleManagement & Integration Platform.

While doing these researches, these were our needs:

  1. It was necessary on our platform to record approximately 15 million messages daily, together with requests, responses, and metrics.
  2. It was necessary to signify these recorded data, to perform full-text searches on appropriate fields, and to create reports and graphs with metric data. For this reason, we started to search for suitable databases and compare them with each other.
  3. The most important point when making comparisons was that the hardware resources of the customer were always limited for some reason, but the queries were required to high-performance work. (¯\_(ツ)_/¯)
  4. I would also like to underline that data loss is not important to our customers.

To summarize the work we have done in this context, there are 3 posts which contains these stages:

Elasticsearch Usage with Apinizer – Part:1 

Stage 1: Couchbase & InfluxDb & Elasticsearch, which one is the right decision for us?

Stage 2: What is the right Elasticsearch setting for us?

Elasticsearch Usage with Apinizer – Part:2

Stage 3: Performance Tests

Elasticsearch Usage with Apinizer – Part:3

Stage 4: Shard and Disk Size

Stage 5: Crucial Checks


 

Stage 1: Couchbase & InfluxDb & Elasticsearch, which one is the right decision for us?

 

Couchbase (v6.0.0)

We tested the performance of most used SQL/queries in 1 million documents with default configuration without tuning on Couchbase and Elasticsearch.

SQL/Query NameResponse Times (ms)CouchbaseElasticsearch
Match QueryFirst execution9630
Cached execution78.229.79
Term QueryFirst execution2157
Cached execution2.8103
Prefix QueryFirst execution314
Cached execution2.712.1
Stats AggregationFirst execution67950
Cached execution1104.744

The upside of Couchbase;

  • Elasticsearch responded faster in match and aggregation queries by far and Couchbase, on the other hand, responded faster in term and prefix queries by far.

The downsides of Couchbase for us;

  • The documentation is not sufficient and descriptive,
  • Insufficient community support,
  • Couchbase outperforms Elasticsearch in full-text search and aggregation queries.

Conclusion;

Full-text search and aggregation queries are used more on our platform and our team does not have enough Couchbase experience. For these reasons, it was concluded that using Couchbase instead of Elasticsearch would not be an advantage for us, so we eliminated that option.

InfluxDb (v1.7)

The other analyzed database was InfluxDb. As a result of the researches, we were surprised that the write operations were significantly more performant compared to Elasticsearch in terms of compression and query performance on the disk, and we started using InfluxDb. (You can view a sample benchmark results by clicking this link.) The negative aspects of InfluxDb that we encountered during the review process were as follows;

  • The documentation is not sufficient and descriptive (or we did not understand 😄 ),
  • Insufficient community support,
  • Inability to perform full-text-search on the text,
  • Lack of cluster structure support in its open-source version.

Conclusion;

InfluxDb was eliminated because of this lack since it is an important criterion for us that all technologies we use in our medium and large-scale customers with high log density are expandable. Maybe a hybrid solution for full-text search could be considered if this was possible.

Elasticsearch (v7.9.2)

As a result, we once again decided to use the open-source Elasticsearch because it allows cluster structure, provides easy management, full-text search, the performance of queries, very detailed documentation, and effective community support (we also had some sympathy 😄). We also understood that it’s more trouble than it’s worth if we change it.

 

Stage 2: What is the right Elasticsearch setting for us?

At this stage, we reviewed the literature to find an answer to the question of “what actions should be taken to increase the performance of Elasticarch and use it efficiently?”. As a result of our research, we decided to implement the following subjects:

 

➡️ Automating index management with Index Lifecycle policies

Before the renovation process, indexes were created in Apinizer with the definition of “date math”. Thus, operations such as searching the data of the last 5 days, backing up the documents of the last month, maintaining the index were easier to manage.

An example index name created with Date math management;

# <static_name_of_the_index{date_math_description|optional_time_zone}

# <apiproxylogs{now/d{yyyMMdd}}

# If the current date is 26.12.2019, the index name where today's documents will be kept: apiproxylogs20191226

Instead of this approach, we decided to implement Elasticsearch because it allows us to automatically manage indexes with Index Lifecycle Management (ILM) policies based on the data retention period. Thanks to the ILM policy, operations such as rollover (creating a new index), merge (reducing the number of segments of the index), shrink (reducing the number of shards in the index), delete (deleting the index), etc. can be applied to indexes. Curator can be used to automate these processes. But since all components are embedded in our platform and we manage them all, we didn’t want to form a new front separately. 😊

# A sample request to manually create an ILM policy;

PUT _ilm/policy/apinizer-log-ilm-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_age": "1d",
"max_size": "30gb",
"max_docs": 15000000
}
}
},
"warm": {
"actions": {
"readonly": {},
"allocate": {
"number_of_replicas": 0
},
"shrink": {
"number_of_shards": 1
},
"forcemerge": {
"max_num_segments": 1
}
}
},
"cold": {
"min_age": "30d",
"actions": {}
}
}
}
}

# Calling information by policy name;
GET _ilm/policy/apinizer-log-ilm-policy

# Checking the policy processes applied to the index;
GET index_name/_ilm/explain

ILM manages indexes by applying a hot-warm-cold architecture to time-series or metric data. As indexes age, they pass through these phases, and actions in phases are performed. Generally, the policy is added to the index with a template. Thus, the new index is automatically applied to the policy. When a request is made that matches the index pattern (index_pattern) given in the template, the template and policy are applied to the index.

min_ageis the parameter used to enter the index into phase. The phase cannot be entered before this value is exceeded. We operate 3 phases in our index where log data is kept. That is to say;

  1. Hot phase:

In this phase, create/update/search operations are actively applied to indexes.

Rollover: This action initiates a write operation to a new index when the index reaches a certain size (max_size), a certain number of documents (max_docs), or a certain age (max_age) to improve disk usage. How the values here are assigned is included in the 3rd article series.

  1. Warm Phase:

Actions in this phase are applied only to indexes that are not updated, but only to which the query is run.

Readonly: Write operation to the index must be turned off to perform these other actions working with this phase. Because usage performance can get worse.

# Request to do this manually:
PUT index_name/_settings
{
  "index": {
   "blocks.read_only": true
  }
}

Allocate: It takes the replica shard number of the index that will be created in this phase, not from the template specified for the index, but from the value assigned in this action setting.

Force Merge: Segments, i.e. inverted indexes, are physical files on the disk. Queries are made on the inverted index and the results are sent to the shard. Having too many segments or large shards can increase the query time. Force merge API forces shards with one or more indexes to merge, and the number of segments decreases as a result of the merge operation. During this process, deleted documents are completely removed and free space is created on disk space. Because deleted documents are marked as deleted on the segment, they are not physically removed. In a study, the index with a size of 2.2GB decreased to about 510.9MB after merging. This means that serious disk space can be saved.

Normally, Elasticsearch does this automatically. But sometimes it may also need to be triggered manually. Because the merge operation both improves resource usage and increases the search speed as it provides a more effective structure on the merged indexes. If this process is to be done manually, it should be noted that the write operation is applied on finished indexes. If this process is to be managed manually, it should be done when the node is using less CPU and disk as it can consume a lot of resources during the process.

# Request to do this manually:

POST index_name/_forcemerge?max_num_segments=1

# To watch the segments in the merge process:

GET _cat/segments/indeks_adi?v

Shrink: To improve disk usage, a new index can be created by reducing the number of shards in an existing index. For example, an index with 8 shards can be reduced to 1 shard.

  1. Cold Phase:

Updates are not made on the index, and this is the stage where querying is done very rarely. So queries run slower. Here, the index can be moved to less performing and less costly hardware.

  1. Delete Phase:

The index that is not needed is permanently deleted. We applied this phase to the index where trace logs are kept. Because in these indexes, both more data were kept and the time needed for data was shorter. We enabled this phase to free up disk space.

# Delete phase part of a sample ILM policy;

# Here, the following point should be noted; it deletes the index 30 days after rollover. The min_age value is based on the rollover time, not the time the index was created.

"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}

 

➡️ Keeping index sorted by time domain

Each shard corresponds to one Lucene Index. Each Lucene index also includes segments. It is configurable in what order the segments will be in the shard when a new index is created because Lucene does not perform any sorting by default. When a query request is sent based on the sorting criteria, it returns all documents matching the query and applies the desired sorting operation on it.

When the documents are sorted, Elasticsearch looks at the first 20 documents in each segment and returns the matching results, since the order is clear when, for example, ‘size:20’ is called in the query request.

Another example: The query will be made on a smaller dataset instead of the entire dataset in the cluster by listing it according to a certain date range in the date-range filter query, over the field of the date type (@timestamp field for us). This makes the query run faster. Since we made our queries according to the @timestamp field in our application, we predicted descending order according to this field.

# Manually index sort request:

PUT indexname
{
"settings" : {
"index" : {
"sort.field" : "@timestamp",
"sort.order" : "desc"
}
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
}
}
}
}

 

➡️ Abbreviation of domain names in the document

To save disk space, We decided to index domain names like ‘apiProxyName’ by shortening them like ‘apn’. The domain names of a document with a size of approximately 2.7KB have been shortened, and the size of the same document has been reduced to 2.1KB. When 1 million documents were added to the index with abbreviations in the domain names and the index without abbreviations in the domain names, there was a difference of 658MB between the two indexes.

 

➡️ Using the smaller numeric data type

For numeric type fields, the data type that takes the least amount of space will be used as it significantly affects the use of disk space. Details are given under the heading Stage 3.

 

➡️ Indexing a String type field

When the field that holds a string type other than mapping is indexed, it is kept as both text and keyword type. If a full-text search will be made on the field, the text type should be given to this field, since the analyzer process will be applied. The type should be a keyword if there are fields that do not need to be processed by the analyzer such as id, error type. This improves disk usage. If there are dynamically incoming fields, a default data type can be assigned to a string value by creating a dynamic template.

 

➡️ Using a Filter query instead of a Match query

If it is sufficient to answer yes/no when making match queries, the ‘filter’ query should be used. Also, the results of the filter query are cached. Otherwise, the query results by calculating the relevancy score value. This affects search performance.

# Node query cache can be checked with the following request
GET index_name/_stats?filter_path=indices.**.query_cache

 

➡️ Using rounded date math values to cache the query

For example, now the value is not cached because it changes dynamically. But when requesting data from an hour ago, the query can be run from the cache using a rounded date. Let’s assume that we are searching with the definition of ‘now-1h/m’ instead of ‘now-1h’ to search over an hour before the current time (16:31:29) on the range query. It matches all values in the time range of 15:31:00–16:31:59. When this call is made at the same minute, the ‘query cache’ makes the query run faster. We have recreated/edited our queries accordingly.

PUT index_name/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "created": {
            "gte": "now-1h/m",
            "lte": "now/m"
          }
        }
      } 
    }
  }
}'

 

➡️ Caching of Aggregation query

When the Aggregation query is run, caching results improve search performance. For this, the following should be noted; The value ‘size:0’ must be given in order not to return a hit value from the query. Otherwise, the query with a hit value will not be cached. The body of the request should not change. And if there is, the rounded date information should be used in the query.

# Shard query cache can be checked with the following
GET index_name/_stats?filter_path=indices.**.request_cache

 

➡️ Adjusting the number of Shard & Replica

Setting the number of replicas of the index is a critical step. Because it can have advantages as well as disadvantages. Because replicas are a copy of the shard, they ensure both data accessibility and better performance of queries. The disadvantage is that replicating the same indexed data reduces indexing speed and search performance. It is recommended to give an ideal value for this.

For example, let’s assume that in the cluster with 2 nodes, an index with 2 shards and 0 replicas and another index with 2 shards and 2 replicas are created. The index with less number of shards per node is processed faster. But data loss may occur when 1 node fails in the index with replica 0. (I would like to remind you again that data loss is not important to our customers.)

Here, the decisive factor was our customer and data growth rate. By default, we called the number of shards 1 and the number of replicas 0. In this way, we have highly optimized the performance of the shrinking process during phase changes in the index lifecycle.

# Request to do this manually:
PUT index_name
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
}

 

➡️ Index refresh time interval

Index refresh time is set by adding the index.refresh_interval property and its default value is 1 second. Increasing this value reduces the cost of merge/create operations. While giving this value, it should not be forgotten that the documents will be searched after the refresh. Increasing this value helps to reduce the number of segments and reduce the I/O cost for searching. Caching will be invalid as data changes with each refresh. It can also make Elasticsearch use the cache more efficiently. Although we set the refresh interval as 15 seconds, we made it configurable as the user may need real-time data analysis.

# It can be checked how many segments there are and how much time is spent for merge/refresh from the request below.
GET index_name/_stats?filter_path=indices.**.refresh,indices.**.segments,indices.**.merges

 

✌ ⭐️ Additionally;

Under this heading, there are some situations we pay attention to and other settings that we find useful even if they don’t fit our terms and that we think we can use later.

  • Searching on parent-child-related documents and nested aggregation query can increase the query time. We avoided these situations. We modeled the design of our data structure accordingly.
  • If processing is performed on large documents and returning a few fields in the document is sufficient, the “stored_fields” mapping parameter can be used to optimize the search performance.
  • Stop words (a, the, and, etc.) can cause the query to explode. For example, when a search for “fox” is made, tens of hundreds of documents may come up. But when a search for “the fox” is made, it can slow down the whole system because all documents may contain the word “the”. Stop Token Filter can be used to stop searching on Stop words.
  • By making bulk requests, a request with more than one document provides higher performance instead of a request with a single document at the same time. Because it reduces the load per request. If it is desired to increase the indexing speed while performing loaded indexing in a single request, such as bulk, the refresh_interval property can be assigned a value of -1. But this should be temporary because it can lead to data loss.
  • After calling the Force Merge API, if there are replica shards, Synced Flush API is called to improve the cluster recovery speed. This API performs a synced flush on one or more indexes. Synced flush can also be performed in the ongoing indexing operation. Time-based indexes are recommended for clusters that are infrequently updated and have many indexes.
  • If queries are filtered over an enumerable data type (for example, the region field with a value of US, Euro, etc.), the index can be divided into multiple indexes according to the values of this field. Thus, the filter query is removed from the search request. If querying on more than one index is desired, wildcards can be used.
  • If the value of the filter query is not enumerable and the index has more than one shard, custom routing can be used. For example, queries need to be filtered through the userId value. It’s impossible to split the index for each recipient. When indexing with the same userId, routing can be used to place the document on the same shard. So the queries run in the shard that matches the routing key. Checks for indexes with more than one shard; 1- When indexing with Id or routing key, it should be ensured that the documents are placed equally on the shard. Otherwise, one shard larger than the other will cause slow read/write operations. 2- Pay attention to the routing_partition_size setting. It should be ensured that the shards distributed on different nodes are equally distributed. Otherwise, the node, which is more shard than other nodes, takes more load than other nodes.
  • If there are too many input-output processes, high-performance hardware should be preferred as resource consumption will increase. Like using an SSD instead of a spinning disk. (Click here to see the hardware that handles 10k requests per second.)
  • Some queries such as query_string and multi_match work slowly. Therefore, as few areas as possible can be searched.
  • The index:false feature can only be used if a histogram is to be made. To fetch this value when the query is answered, the field must be assigned to the includes array of the source parameter. This saves disk space.
  • Terms aggregation works faster than range aggregation. For example, let’s say that a range query is made on the ‘price’ field over the values of 0–2000, 2000–5000, 5000-~. Instead, we add a field called ‘price_range’ and keep the range of the price value, so that there is no need for a range query and terms aggregation can be run on the ‘price_range’ field.
  • If aggregation, sort, script will not be performed on the field and the queries will not be fetched when run, the doc_values:false parameter can be added to improve the disk space.
  • Compressing the fields contained _source with the codec:best_compression property can be used to optimize disk space. However, it should be used with caution as it will reduce the search speed.
  • It is recommended to avoid scripts as they can reduce the query speed. If needed, painless or expression script language can be used.