Building a Scalable Reporting Architecture with ElasticSearch

Digii
October 8, 2024

Working at a startup brings a unique set of challenges. With limited resources, you’re constantly tackling issues related to scalability, performance, and product development. As our leader often reminds us, “You cannot always throw cash and solve a problem!“

Embracing frugal engineering principles, we focused on innovative, cost-effective solutions to build a scalable and efficient reporting system, demonstrating that creativity and resourcefulness can drive success even with constraints.

Here’s how we addressed one such challenge in our organization—building a robust reporting architecture to meet client demands.

Context

We had a business requirement for a scalable system to serve reporting needs of up to 150K rows with near real-time data. For such heavy reports, it’s obvious that the master DB connection should not be used as it will increase the instance heap and may lead to the RDS instance slowing down, affecting other processes. We needed a secondary data store to read and generate reports from.

Sample single-tenant DB architecture

Building a Scalable Reporting Architecture with ElasticSearch - single tenant DB architecture

Potential Solutions

Read Replicas: Having a master-slave DB architecture can help in the generation of big reports by performing the read through the slave DB which is in real-time synced with the master DB. All cloud services provide an option to create a read replica for the created DB instances.

Pro: Easy to implement, more DB connections for various use cases.
Con: Huge increase in infrastructure cost for a single-tenant DB architecture.

2. Data Warehouse: A data warehouse is an enterprise system used for the analysis and reporting of structured and semi-structured data from multiple sources. There are multiple approaches to building a data warehouse, some of them are:

a) AWS Redshift: Using AWS Redshift and other cloud-managed services that connect to the data source using Glue and provide the data in queryable format within Athena. This solution provides ease-of-implementation and AWS successfully removes the hassle of daily maintenance while providing a data sync option in real-time. For details on how to implement this, refer here.

Pro: Easy to implement, connects to a variety of structured and unstructured data sources, no latency, data retrieval in various formats.
Con: Running Redshift is very costly.

b) Change Data Capture (CDC): Change data capture (CDC) is a process that tracks and identifies changes to data in a database, and then transmits those changes to a downstream system in real time. CDC is a method of ETL (Extract, Transform, Load). This architecture can be implemented using Debezium connectors and Kafka topics to capture the SQL Binlog changes and dump those to an S3 bucket, Apache Hudi can then be used to aggregate the delta changes and query using AWS Athena, with the help of various ETL pipelines we can achieve the desired data sync frequency. For detailed information on the implementation, refer here.

Pro: Self-hosted solution, desired data sync, customization flexibility, multiple source and sink connectors
Con: High maintenance architecture, observability challenges, handling DDL operation on the source DB

3. Elastic Search: Elasticsearch is an open-source distributed search and analytics engine designed for handling large volumes of data. It is built on top of Apache Lucene and is supported by Elastic. Elasticsearch is used to store, search, and analyze structured and unstructured data in near real-time. All these make it very useful for architectural use-cases like ElasticSearch data reporting architecture, ElasticSearch data analysis.

In our case we wanted to go for a more frugal engineering architecture due to which we were not able to go for Redshift or read-replicas approach, coming to CDC even though we have a central schema across all our instances, the data warehouse for a single tenant db architecture is very complex to maintain and also very difficult to handle DDL operations for schema evolution.

Why We Chose ElasticSearch?

Elasticsearch stands out as a powerful, distributed search and analytics engine, primarily due to its ability to deliver real-time results at scale. What makes ElasticSearch reporting architecture technically superior to our needs is how it efficiently manages both in-memory data processing and secondary storage to achieve lightning-fast responses and powerful aggregations.

Efficient Use of RAM
Secondary Storage Optimization
Powerful Aggregations
Scalability and Distribution
Cost-Effective

Implementation

Reporting with ElasticSearch offers an intriguing engineering solution. Our guide on implementing ElasticSearch in a production environment covers the essential steps and best practices for startup engineering, helping you achieve effective reporting and data management.

ES Infrastructure Setup

Self-hosted ElasticSearch and Kibana instances on AWS Ec2 with 3 nodes for better performance.

a) Install Java

sudo apt update sudo apt install openjdk-11-jdk -y

b) Install Elastic Search on each instance

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add - sudo sh -c 'echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" > /etc/apt/sources.list.d/elastic-7.x.list' sudo apt update sudo apt install elasticsearch -y

c) Open the Elasticsearch configuration file /etc/elasticsearch/elasticsearch.yml on each instance and configure it

i. Node 1 (Master Node):

cluster.name: my-cluster
node.name: node-1
node.master: true
node.data: true
network.host: 0.0.0.0
discovery.seed_hosts: ["", "", ""]
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

ii. Node 2 (Data Node):

cluster.name: my-cluster
node.name: node-2
node.master: false
node.data: true
network.host: 0.0.0.0
discovery.seed_hosts: ["", "", ""]
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

iii. Node 3 (Data Node):

cluster.name: my-cluster
node.name: node-3
node.master: false
node.data: true
network.host: 0.0.0.0
discovery.seed_hosts: ["", "", ""]
cluster.initial_master_nodes: ["node-1", "node-2", "node-3"]

d) Enable and start Elasticsearch on each node

sudo systemctl enable elastic search
sudo systemctl start elasticsearch

e) Check cluster health from any node

curl -X GET "localhost:9200/_cluster/health?pretty"

f) Install Kibana on one of the instances, typically the one you will use to access the cluster

sudo apt install kibana -y

g) Open the configuration file /etc/kibana/kibana.yml and modify it

server.port: 5601
server.host: "0.0.0.0"
elasticsearch.hosts: ["http://:9200", "http://:9200", "http://:9200"]

h) Enable and start Kibana

sudo systemctl enable kibana
sudo systemctl start kibana

i) Access Kibana

Open a web browser and navigate to <a href=”http://http://<EC2-public-ip-of-Kibana-instance>:5601

j) Additionally for security we can configure certificate and change config

sudo /usr/share/elasticsearch/bin/elasticsearch-certutil ca
sudo /usr/share/elasticsearch/bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12

//copy certs
sudo mkdir /etc/elasticsearch/certs
sudo cp /path/to/elastic-certificates.p12 /etc/elasticsearch/certs
sudo chown -R elasticsearch:elasticsearch /etc/elasticsearch/certs

//change config
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/certs/elastic-stack-ca.p12

Sample Use Case Journey

Consider the task of generating student performance reports. We need to retrieve real-time data for approximately 100K students from our production application.

1) Application changes are needed to support ElasticSearch configuration and service calls

a) Add Maven Dependencies

     <dependency>         <groupId>org.elasticsearch.client</groupId>         <artifactId>elasticsearch-rest-high-level-client</artifactId>         <version>7.9.3</version>     </dependency>      <dependency>         <groupId>org.elasticsearch</groupId>         <artifactId>elasticsearch</artifactId>         <version>7.9.3</version>     </dependency>

b) config file

@Configuration
@Component
public class ElasticReportConfiguration {

private ElasticServiceConfig config; @Value("${esReport.userName}") private String ES_REPORT_USERNAME; @Value("${esReport.password}") private String ES_REPORT_PASSWORD; @Value("${esReport.url}") private String ES_REPORT_URL;  @PostConstruct public void init() {     ElasticServiceConfig elasticServiceConfig = new ElasticServiceConfig();     elasticServiceConfig.setUsername(ES_REPORT_USERNAME);     elasticServiceConfig.setPassword(ES_REPORT_PASSWORD);     elasticServiceConfig.setUrl(ES_REPORT_URL);     config = elasticServiceConfig; }   public ElasticServiceConfig getElasticSearchReportConfig(ElasticReportTypeEnum index) {     if (Objects.isNull(index)) return null;     config.setIndex(ThreadLocalStorage.getTenantName().toLowerCase() + "_" + index.toString().toLowerCase());     return config;    } }

c) service structure

public interface IElasticReportService {

/**  * @implNote Bulk insert document into elastic reports  * @param elasticServiceConfig : config details from ElasticReportConfig  * @param data : data in ReportDocument format  * @throws IOException  */ void insertBulkDocument(ElasticServiceConfig elasticServiceConfig, List<ReportDocument> data) ;   /**  * @implNote Check whether a particular index exists in the es report instance or not  * @param elasticServiceConfig  * @return  * @throws IOException  */  boolean checkIndexExists(ElasticServiceConfig elasticServiceConfig) throws IOException;  /**  * Create elastic report index  * @param elasticServiceConfig  * @param index  * @throws IOException  */ void createIndex(ElasticServiceConfig elasticServiceConfig,ElasticReportTypeEnum index) throws IOException;   /**  * @implNote Get documents  * @param elasticServiceConfig  * @param query :  * @param sort  * @param aggregationBuilder  * @return  * @throws ParseException  * @throws IOException  */  SearchResponse searchDocument(ElasticServiceConfig elasticServiceConfig, QueryBuilder query,                               FieldSortBuilder sort, TermsAggregationBuilder aggregationBuilder) throws ParseException, IOException;  /**  * @implNote  Delete docs from es index  * @param elasticServiceConfig  * @param ids : ids of documents to be deleted  */ void deleteElasticDocs(ElasticServiceConfig elasticServiceConfig,List<String> ids);  }

2) Create a holistic data schema mapping file that supports all data points for reporting while also making sure to avoid any unnecessary data bloating in the index

{ "mappings": { "properties": { "student_id": { "type": "keyword" }, "student_name": { "type": "text", "fields": { "raw": { "type": "keyword" } } }, "course_id": { "type": "keyword" }, "course_name": { "type": "text", "fields": { "raw": { "type": "keyword" } } }, "module_id": { "type": "keyword" }, "module_name": { "type": "text", "fields": { "raw": { "type": "keyword" } } }, "session_id": { "type": "keyword" }, "session_name": { "type": "text", "fields": { "raw": { "type": "keyword" } } }, "grade": { "type": "float" }, "created_at": { "type": "date", "format": "yyyy-MM-dd'T'HH:mm:ssZ" }, "updated_at": { "type": "date", "format": "yyyy-MM-dd'T'HH:mm:ssZ"       }    }   } }

3) Create a Tenant-Level ES index for data isolation from that schema mapping file dynamically from the application

ex {tenant1}_{reportType}, as seen in the config file a function can be created to fetch configuration based on the tenant that is referenced.

4) Once the service structure is in place the goal is to keep the production data and the ES data in the same state, this requires constantly making async updates to the ES data in case there are any changes in the data point referenced by the mapping file. ES also supports updating of existing documents based on the document_id used during the insert operation.

ex:

if a new student record is added, a new student record with relevant data points is to be sent to ES as a document.
If student_name is updated in the application, since it is referenced in the mapping file we upsert all the referenced data to ES.

Additionally, a small spring application can be created to fetch the data from the ES instance based on user selection and generate Excel in the application, upload it in AWS S3, and mail the user the link to the generated report.

Results

Previously, our report generation was limited to less than 50K records, with a 24-hour data delay and a generation time of 15 minutes. After implementing ElasticSearch, we achieved:

Generation of 1.8 million rows in under 5 minutes.
Data upsert time of around 50–100ms and retrieval times of less than 250ms.

This significant performance improvement underscores the effectiveness of ElasticSearch in our reporting architecture, especially within the constraints of a startup environment.