When we index a document in Elasticsearch, its text values pass through an analysis process. In this article we'll cover what happens in this process and how Elasticsearch's standard analyzer works.
Introduction to analysis
The main objective of the analysis is storing the documents in a way that makes them efficient for searching. This happens at the moment that we index some document in Elasticsearch and it uses three mechanisms to do so, which are:
- Character filter
- Tokenizer
- Token filter
Character filters
The first step consists in receiving the full text and adding, removing or changing characters. For example, we can remove HTML tags, such as:
Input: <p>I <strong>REALLY</strong> love to go hiking!</p>
Result: I REALLY love to go hiking!
An analyzer may contain zero or more character filters, and the result of the operation is passed to the tokenizer.
Tokenizer
Differently from the character filters, an analyzer must contain exactly one tokenizer, and its responsibility is to split a String into tokens. In this process, some characters may be removed from the text, such as punctuation. An example of this would be:
Input: I REALLY love to go hiking!
Result: "I", "REALLY", "love", "to", "go", "hiking"
Token filters
The token filters will receive the tokens and operate on them. A simple example is the lowercase filter.
Input: "I", "REALLY", "love", "to", "go", "hiking"
Result: "i", "really", "love", "to", "go", "hiking"
An analyzer may also have zero or more token filters.
For more examples of built-in character filters, tokenizers and token filters, we can check the official documentation.
Elasticsearch's standard analyzer consists of:
- No character filters
- The standard tokenizer
- The lowercase and an optional stop words token filter.
Using the analyze API
Elasticsearch provides us a way of visualizing how a String gets analyzed, which is the analyze API. To use it, we just need to send a POST
request to the /_analyze
endpoint with a "text"
parameter. Let's try it out!
POST /_analyze
{
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. :)"
}
In the response, we can see the generated tokens:
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "quick",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "brown",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "foxes",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "jumped",
"start_offset" : 24,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "lazy",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dog's",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "bone",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
In this API, we can specify the character filters, the tokenizer and the token filters that we want to use. The same result we got would've been returned if we had made the request like this:
POST /_analyze
{
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. :)",
"char_filter": [],
"tokenizer": "standard",
"filter": ["lowercase"]
}