Elasticsearch: Distributed Search Engine, Vector Search, and the OpenSearch Fork

Elasticsearch is an open-source, distributed search and analytics engine built on top of Apache Lucene. Originally released by Shay Banon in 2010, it has become one of the most widely adopted search technologies in the world, powering search functionality for organizations ranging from small startups to large enterprises including Wikipedia, GitHub, Netflix, and Uber. Elasticsearch is developed and maintained by Elastic, the company that also produces the broader Elastic Stack (formerly the ELK Stack), which includes Kibana for visualization, Logstash for data processing, and Beats for data collection.

At its core, Elasticsearch stores data as JSON documents and indexes them using inverted index structures that make full-text search extremely fast. When a document is indexed, Elasticsearch analyzes the text content, breaking it into tokens through a process of tokenization and filtering. These tokens are stored in inverted indexes that map each term to the documents containing it, enabling sub-second search across millions or even billions of documents.

Elasticsearch is distributed by design. Data is automatically divided into shards, which can be spread across multiple nodes in a cluster. Each shard can have one or more replicas for redundancy and read performance. This architecture allows Elasticsearch to scale horizontally -- when more capacity is needed, additional nodes can be added to the cluster, and shards are automatically rebalanced. This makes it suitable for workloads ranging from a few gigabytes to petabytes of data.

The Query DSL (Domain Specific Language) is one of Elasticsearch's most powerful features. It provides a comprehensive JSON-based query language that supports full-text search, structured queries, fuzzy matching, wildcard searches, range queries, boolean combinations, and much more. Queries can be combined and nested to create complex search logic. The relevance scoring system, based on BM25 by default, determines how well each document matches a query and ranks results accordingly.

Beyond basic search, Elasticsearch excels at aggregations -- analytical operations that summarize and group data in real time. Aggregations can compute metrics like averages, sums, and percentiles, create histograms and date ranges, perform geospatial analysis, and build complex nested groupings. This capability makes Elasticsearch valuable not just for search but also for analytics dashboards, log analysis, and business intelligence applications.

The REST API is the primary interface for interacting with Elasticsearch. All operations -- indexing documents, executing searches, managing cluster settings, and monitoring health -- are performed through HTTP requests with JSON payloads. This makes Elasticsearch accessible from virtually any programming language or platform. Official client libraries are available for Java, Python, JavaScript, Go, Ruby, PHP, and several other languages.

One of the most common use cases for Elasticsearch is log and event data analysis. The Elastic Stack provides a complete pipeline for collecting logs from applications and infrastructure, processing and enriching them, storing them in Elasticsearch, and visualizing patterns and anomalies in Kibana. This has made Elasticsearch a standard tool for observability, security analytics, and operational monitoring across the industry.

Elasticsearch also supports vector search and approximate nearest neighbor (ANN) algorithms, which are essential for modern AI-powered applications. By storing dense vector embeddings alongside traditional text, Elasticsearch can perform semantic search that understands meaning rather than just matching keywords. This capability is increasingly important for building retrieval-augmented generation (RAG) systems, recommendation engines, and other machine learning applications.

It is worth noting that in 2021, Elastic changed the license for Elasticsearch from the permissive Apache 2.0 license to the more restrictive Server Side Public License (SSPL) and Elastic License, in response to cloud providers offering managed Elasticsearch services. This led Amazon to fork the project under the Apache 2.0 license as OpenSearch. As a result, organizations now have a choice between the Elastic-maintained Elasticsearch and the community-maintained OpenSearch fork, both of which continue to evolve independently. The emergence of OpenSearch as a community-maintained fork under a permissive license illustrates a broader pattern: when a single company tightens control over widely adopted open-source infrastructure, the community can and does reassert its independence. Despite this split, Elasticsearch remains one of the most capable and widely deployed search engines available, with a rich ecosystem of tools, integrations, and community support.

Elasticsearch, SaaS, Search

2020-03-17