A recommended resilient and scalable Elasticsearch data indexing architecture — Part 1
In this 2-part story, I’ll be introducing an architecture for your Elasticsearch data indexing pipeline. This architecture will be especially useful if you need some kind of full-text search on your data and would like to leverage on specialised full-text search engines like Elasticsearch.
This post is written with the assumption that you don’t have any background knowledge on the technology that we’re going to talk about, so I’ll be giving a short background and introduction in part 1. Some technical & coding knowledge is definitely required to understand majority of it.
We’ll discuss the architecture in detail, the benefits and complications of it in the 2nd part of this series.
- What is Data Indexing? And why do I need to index?
- Open-source Search Engines — Elasticsearch
- Message queue/brokers — RabbitMQ
- Conventional data indexing flow
- Problems with conventional data indexing flow
- Introduction to Extract-Transform-Load (ETL) indexing architecture
- Benefits of proposed architecture
- Complication of horizontal scaling
What is Data Indexing? And why do I need to index?
If you’re reading this with the knowledge of relational databases and their indexes, it’ll be very easy to grasp whatever I’m about to say, as it’s basically the same concept. We index data to allow for efficient search and retrieval of data from the database.
Imagine you have a list of 2,000,000 names, and you want to find a specific name on the list, how long would that take? You’ll have to practically scan through the list until you find it. Worst case scenario, the name you’re looking for is the last item, you’d have looked through 2,000,000 names to find the one you’re looking for.
Now imagine you have a mapping of the name keywords to their position on the list. Now if you want to find “John Doe”, you simply lookup the position in the mapping and seek directly to the data instead of having to scan through the entire list.
Data indexing is basically the process of creating this “mapping” of keywords or data to some kind of “position” where the data is situated. This allows for very efficient search/retrieval of data given a particular search query.
Open-source Search Engines — Elasticsearch
Now, you may be wondering.. “How does data indexing apply to me and my projects?”
Most, if not all applications require some sort of search capability. Be it to pull out a customer’s order from your database, or customers searching for a product on your online store. Conventionally, if you store your data in a relational database like MySQL, or even a NoSQL database like MongoDB, “searching” for data would involve crafting some kind of query to instruct the database engine to match whatever you’re looking for to fields in the DB, and return the results.
Example SQL query:
SELECT * FROM Product WHERE name LIKE '%badminton%'
OR description LIKE '%badminton%'
OR category LIKE '%badminton%'
If we want to search into every single field available, you can imagine how long the query can get, not to mention the performance impact.
Furthermore, if we want to search into the normalized one-to-many or one-to-one data, the query will be a nightmare to build. E.g. a product can have multiple tags, which will be represented by one-to-many relationship.
Ergo, Elasticsearch to the rescue!
Elasticsearch (ES) is an open-source full-text search engine that provides scalable search and near real-time search. It comes packed full with features and tools like Logstash and Kibana (more popularly known as ELK stack) for data processing and visualisation.
The concept of ES is very simple, we save documents into an ES index (similar to a database in RDBMS), and ES provides us with full-text search and other cool search functionalities right out of the box.
We’re not going to go into the detailed functionality and benefits of using Elasticsearch, that’s way out of the scope. If you’re interested to dive deeper into it, there’s a lot of resources available.
“BUT… most RDBMS have their own full-text search functionality!”
Yes, that’s true. I’m not going to go into a whole debate of RDBMS full-text search vs specialised search engines like Elasticsearch. But specialised search engines come with advanced feature-set and can scale much more easily.
To quote someone from Reddit,
MySQL does nothing well, but everything well enough. It’s a general, flexible relational database. It’s a Toyota Camry.
Elastic(search) is powerful but complex, and is well suited to its particular domain (search). It’s an F1 car. — htom3heb, Reddit
Of course, if you’re doing a simple title or name search, stick with your RDBMS query. You don’t need an F1 car for your daily drive to work. But if you want to build your system to scale, and possibly more advanced search, you’ll need something more sophiscated.
Message queue/brokers — RabbitMQ
It may not be immediately obvious why we’re talking about a message broker and how it fits into the picture, but hear me out first. I’ll get to how we use a message broker in part 2 of this series.
RabbitMQ is a popular open-source message broker that implements the Advanced Message Queuing Protocol (AMQP). It enables inter-application communication in a reliable and interoperable manner.
The basic concept of a message queue/broker is simple.
Suppose 2 applications want to communicate with each other. They do so by sending messages through the message queue. 1 of the application will enqueue the message, and the receiving party will dequeue it for processing.
And as the name suggests, messages are processed in a first-in-first-out basis (i.e. messages enqueued first will be dequeued and processed first).
RabbitMQ works on a publish-subscribe pattern. The application publishing messages to the queue is called the Producer, and the application(s) consuming the messages are aptly called, the Consumer.
You can have as many consumers as you’d like to subscribe to messages. The message queue manager will handle distribution, as well as acknowledgement of messages after a consumer successfully processed the message.
RabbitMQ Topic Exchange
How do producers and consumers know where to send/receive the messages to/from? Introducing the Topic Exchange, which forward the messages according to the routing key provided.
For example, you can have 3 different routing keys bound to separate queues:
to represent the Create, Update and Delete operation of products in your system. Every time a product is created, updated or deleted, a message can be published into the respective queue using their corresponding routing key. These messages will then be consumed by the consumers who have subscribed to the queue(s) using the routing key.
Reliability of published messages
One of the main benefits of using a message broker like RabbitMQ is the reliable delivery of data. Once a message is published to a queue, it is persisted on the server’s hard drive until the broker receives an acknowledgement that the message is successfully processed by a consumer. This guarantees a message will always be processed and won’t be lost even in the event of an error or server failure.
In the event of an application error on the consumer-side while processing the message, a negative acknowledge (nack) can be sent to the broker to requeue the message so it can automatically be consumed again. Essentially a retry mechanism.
Conventional data indexing flow
The diagram above shows how a system conventionally integrates with Elasticsearch to provide a search functionality. Whenever a C_UD (Create/Update/Delete) operation happens, the data will first be saved/modified in the primary data store (e.g. MySQL), then the same piece of data is pushed into Elasticsearch to be indexed for search purposes. With this flow, every new creation, or modification, will reflect immediately when users search for it.
However, there are a few problems with this model, which we’ll discuss in part 2 of this series. I’ll also be proposing a more resilient and scalable architecture to index data for search purposes.
Continue reading on part 2 — to be updated