A recommended resilient and scalable Elasticsearch data indexing architecture — Part 2
In the first part of this 2-part series on an architecture for your Elasticsearch data indexing pipeline, I gave a short background and introduction. If you’ve not read part 1, you can do so here.
As a continuation, we’ll be discussing the proposed architecture in detail, the benefits and complication in part 2.
A recap of the outline of this 2-part series.
- What is Data Indexing? And why do I need to index?
- Open-source Search Engines — Elasticsearch
- Message queue/brokers — RabbitMQ
- Conventional data indexing flow
- Problems with conventional data indexing flow
- Introduction to Extract-Transform-Load (ETL) indexing architecture
- Benefits of proposed architecture
- Complication of horizontal scaling
Problems with conventional data indexing flow
In the first part of this series, we discussed a conventional flow of how a system can index data into Elasticsearch. A recap:
I like to call this the microservice “push” model, because the microservice is pushing data directly into both the primary data store (could be SQL, MongoDB, etc) and Elasticsearch. One problem in this model is that there is a larger transaction boundary.
A transaction is an atomic unit of work that must be completed as a whole. It usually consists of one or more operation.
Here’s a straightforward example: when you transfer money to your friend, the bank deducts money from your bank account and adds it to your friend’s account balance. This is a “2-operation” transaction that MUST be executed to the end together. Imagine if the bank deducts your money, but it fails to add to your friend’s account. In that case, the entire transaction must be cancelled, that is, to reverse the deduction from your account, otherwise, the money will be lost.
Therefore, a transaction boundary is where a transaction starts and ends (being committed or rolled-back).
Impact of a larger transaction boundary
In our Elasticsearch scenario, it means that if the index operation in Elasticsearch fails for whatever reason, we need to rollback changes to the primary data store. Otherwise, there would be data inconsistency, i.e. data is actually in primary store but cannot be searched.
Ultimately, it means that we need to write code to cater for the rollback logic.
A better way to implement data indexing to Elasticsearch could be as follows. One that uses a message queue and a separate indexer module.
Similar to the conventional flow, the application will push data changes to the primary data store first, then a copy of the data will be sent to the message queue (e.g. RabbitMQ). Next, an indexer module will subscribe to the queue and grab jobs from the queue to index into Elasticsearch.
Benefits of Proposed Architecture
This architecture can be easily scaled up to meet higher demand. For instance, the number of indexer instances can be increased if there are more documents to process. The size of the RabbitMQ message queue can also be increased easily to support more or larger document sizes.
Resiliency is also another benefit of this architecture. When data is sent to the queue, it will persist until it has been successfully processed and acknowledged by a subscriber. This basically guarantees eventual consistency; any changes made is guaranteed to be indexed into Elasticsearch without any data loss from transient failures.
Modular — Separation of Concerns (SoC)
SoC is one of the most important design principle, and it states that a computer program should be separated into distinct sections that serves one and only one function.
And one of the major benefits of this architecture is that the indexing operation is separated out into its individual module, creating modular components, namely the app and indexer. This also enable easier tech refresh by the individual components.
Complication of Horizontal Scaling
However, there’s a complication that arise from this architecture when it is scaled up, it is something that will need to be taken care of.
In a multi-indexer instance environment, if one of the instance fails after retrieving a message but didn’t finish processing it, the message may become out-of-order after retry by another instance. This may result in outdated versions of document being indexed.
This complication is illustrated in the following diagram:
In the example above, v1 is retrieved by indexer 1 and v2 is retrieved by indexer 2. Now let’s assume indexer 2 successfully acknowledged processing the document, but indexer 1 fails for some reason and didn’t successfully acknowledge. In that case, v1 will be retried and sent to another instance that is available, indexer 3. After which, it will be processed and acknowledged.
Now, since v1 is processed after v2, it will replace v2 in the Elasticsearch index, which is the oudated document.
To mitigate this issue, we need to have each indexer instance verify if the current document is oudated before indexing in Elasticsearch.
We do this by comparing the version of the document with the current one in ES.
- If the version in ES is before the current doc. version, we can proceed to index it.
- If the version in ES is after the current doc. version, we can safely discard it.
The benefits of this architecture is quite clear, and it can fit projects of various scale. However, if your project is simpler and more straightforward, it may be an overkill to implement such an architecture.