Unraveling Elasticsearch queries
How to create a "intelligent" search
Who am I?
@guilhermeguitte
Leroy Merlin Brasil.
Co-organizer Laravel Meetup in São Paulo.
Software Developer.
Scrum Master.
http://www.guitte.org
Before to get digging into elasticsearch...
What is "elasticsearch"?
Real-Time Data
Real-Time Advanced Analytics
Massively Distributed
High Availability
Multitenancy
Full-Text Search
Document-Oriented
Schema-Free
Developer-Friendly, RESTful API
Per-Operation Persistence
Build on top of Apache Lucene™
What is a "index"?
It's like a database in traditional relational database.
GET http://localhost:9200/web/orders/_search
index
What is a "type"?
GET http://localhost:9200/web/orders/_search
type
What is a "inverted index"?
It's like...
What you learned?
Basic jargon of elasticsearch.
What is a index.
What is a type.
What it is elasticsearch.
Now, with the basic jargon of Elasticsearch...
Be ready!
Queries
Basic structure
{ "query": {}}
GET http://localhost:9200/web/orders/_search
Structured search
"Finding for documents that exactly match with query"
The result will be "YES" or "NO".
SELECT *
FROM orders
WHERE status = "received"
{ "query": { "term" : { "status": "received" } }}
GET http://localhost:9200/web/orders/_search
{ "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] }}
{ "query": { "constant_score" : { "filter": { "term" : { "status": "received" } } } }}
GET http://localhost:9200/web/orders/_search
{ "took": 7, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }}
To start simple, but real world is not.
Get used yourself with "bool" queries
{ "query": { "bool" : { "must" : [], "should" : [], "must_not" : [], "filter": [] } }}
GET http://localhost:9200/web/orders/_search
{ "query": { "bool" : { "must": { "term": { "status": "received" } } } }}
GET http://localhost:9200/web/orders/_search
{ "took": 24, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] }}
{ "query": { "constant_score": { "filter": { "bool" : { "must": { "term": { "status": "received" } } } } } }}
GET http://localhost:9200/web/orders/_search
{ "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }}
"Bool" structure is very flexible
{ "query": { "constant_score": { "filter": { "bool" : { "should": [ { "term": { "status": "received" } }, { "bool": { "must": { "term": { "customer": "Prof. Shaylee Greenholt" } } } } ] } } } }}
GET http://localhost:9200/web/orders/_search
{ "took": 12, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] }}
What types of queries elasticsearch have?
Term
Terms
Range
Exists
Missing
Prefix
Wildcard
Regexp
Fuzzy
Type
Ids
{ "term": { "status": "received" }}
{ "terms": { "status": ["received", "delivering"] }}
{ "range": { "total": {"gte": 100.5, "lte": 140.5 }}}
{ "exists": { "field": "region" }}
{ "missing": { "field": "region" }}
{ "prefix": { "customer": "Dolly" }}
{ "wildcard": { "customer": "Doll*" }}
{ "regexp": { "customer": { "value": "Doll*" }}}
{ "fuzzy": { "customer": { "value": "Doll*", "fuzziness": 2 }}}
{ "type": { "value": "orders"}}
{ "ids": { "type": "orders", "values": ["1", "2"]}}
All queries could boost the scoring if you like, but if all are inside of "constant_score", scoring will not be calculated.
Structure queries is good for:
Filter documents before to run queries that you would like to score your documents.
It's fast because Elasticsearch can cache them and reuse about time.
What we learned?
Structure search
Boolean match with document.
Use bool queries.
Filter documents before runs full-text search.
Full-text search
Two different things about full-text search
Relevance
"How well which document match this query"
TF/IDF
(Term freq./Inverted Document Freq.)
Proximity to a geolocation
Fuzzy similarity
...
_scoremax_score
The simplest query
{ "query": { "match": { "customer": "John" } }}
{ "_score": 4.6189003, "customer": "John Upton" }, { "_score": 4.6189003, "customer": "John Borer" }, { "_score": 4.6189003, "customer": "John Emard" }, { "_score": 4.06103, "customer": "John Runolfsdottir IV" }, { "_score": 3.8275056, "customer": "Mr. John Cartwright III" }, { "_score": 3.8275056, "customer": "John Hodkiewicz DDS" }
GET http://localhost:9200/web/orders/_search
{ "query": { "match": { "customer": "Joh" } }}
{ "took": 15, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 0, "max_score": null, "hits": [] }}
GET http://localhost:9200/web/orders/_search
Elasticsearch persist the data in a different way what you are accustomed.
To understand full-text search in Elasticsearch, first you need to understand how elasticsearch persist your data.
It's called "Analysis"
Is a pipeline that begins with:
Is a pipeline that begins with:
Create mapping ("Schema") for the web index. (if it's not).
Receive the document from Index API.
Iterate each field and sees if the field are analyzed.
Then run analyzer for this field.
Persist the data.
The document
{ "customer": "Dr. Emiliano, the Mitchell Sr.", "items": [ { "id": 8510629, "value": 769 } ], "total": 2874.81, "created_at": "15/11/2015 17:00:57", "status": "delivered", "region": "sao_paulo", "shipping_fees": 719}
"customer": "Dr. Emiliano, the Mitchell Sr."
The document
"Dr. Emiliano, the Mitchell Sr."
The document
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Dr. Emiliano, the Mitchell Sr."
Char filter
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Dr. Emiliano, the Mitchell Sr."
string
"Dr. Emiliano, the Mitchell Sr."
Tokenizer
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Dr."
"Emiliano,"
"the"
"Mitchell"
"Sr."
tokens
Token filter
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"dr."
"emiliano,"
"the"
"mitchell"
"sr."
"Dr."
"Emiliano,"
"the"
"Mitchell"
"Sr."
1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID
1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID
TF/IDF (Term Freq/Inverted Document Freq) that generates the score.
{ "query": { "match": { "customer": "Mitchell" } }}
GET http://localhost:9200/web/orders/_search
"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"Mitchell"
The term
Analyzer:
Tokenizer: whitespace
Token filter: lowercase
Char Filter: html_strip
"mitchell"
1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID
"mitchell"
1
1
1
1
1
Persist it
"dr." "emiliano,"
"the"
"mitchell"
"sr."
Token Doc ID
"mitchell"
What types of queries elasticsearch have?
Match
multi_match
...
{ "match": { "name": "John" }}
{ "multi_match": { "query": "John", "fields": ["name.raw", "name.autocomplete"]}}
Understand analysis process is must to understand how to search.
Analyzer are compound with: tokenizers, token_filters and char filters.
You need to understand how user will search to make the right query into elasticsearch.
What we learned?
References
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html
https://www.youtube.com/playlist?list=PLZ4puV97Zwm2fEmTLrPsP7QgLsjnnQggX
Official PHP elasticsearch package: https://github.com/elastic/elasticsearch-php
https://github.com/sleimanx2/plastic
Thanks!
@guilhermeguitte
http://www.guitte.org