Desvendando as queries no Elasticsearch - v2

Desvendando as queries no Elasticsearch Como criar uma busca mais
inteligente

Quem sou eu? @guilhermeguitte • Leroy Merlin Brasil. • Co-organizer
Laravel Meetup in São Paulo. • Software Developer. • Scrum Master. http://www.guitte.org

Antes de entrar em detalhes sobre Elasticsearch...

O que é "Elasticsearch"?

• Real-Time Data • Real-Time Advanced Analytics • Massively Distributed
• High Availability • Multitenancy • Full-Text Search • Document-Oriented • Schema-Free • Developer-Friendly, RESTful API • Per-Operation Persistence • Build on top of Apache Lucene™

O que é um "index"?

É como se fosse um "banco de dados" num contexto
de bancos de dados relacional.

GET http://localhost:9200/web/orders/_search index

O que é "type"?

GET http://localhost:9200/web/orders/_search type

O que é "inverted index"?

É como se fosse...

O que é "um documento"?

O documento { "customer": "Dr. Emiliano, the Mitchell Sr.", "items":
[ { "id": 8510629, "value": 769 } ], "total": 2874.81, "created_at": "15/11/2015 17:00:57", "status": "delivered", "region": "sao_paulo", "shipping_fees": 719 }

O que aprendemos? • Um vocabulário básico sobre Elasticsearch. •
O que é um "index". • O que é um "type". • O que é o "elasticsearch". • O que é um documento.

Agora, com um conhecimento básico sobre Elasticsearch...

Esteja preparado!

Queries

Estrutura básica

{ "query": {} } GET http://localhost:9200/web/orders/_search

Structured search

"Procurar pelos documentos que exatamente atende aos critérios da query"

O resultado será "SIM" or "NÃO".

SELECT * FROM orders WHERE status = "received"

{ "query": { "term" : { "status": "received" } }
} GET http://localhost:9200/web/orders/_search { "took": 24, ... "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }

{ "query": { "constant_score" : { "filter": { "term" :
{ "status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 7, "timed_out": false, ... "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }

Isso é para começar simples, porém no mundo real não
é.

Acostume-se com "bool" queries

{ "query": { "bool" : { "must" : [], "should"
: [], "must_not" : [], "filter": [] } } } GET http://localhost:9200/web/orders/_search

{ "query": { "bool" : { "must": { "term": {
"status": "received" } } } } } GET http://localhost:9200/web/orders/_search { "took": 24, ... "hits": { "total": 3041, "max_score": 2.4266458, "hits": [ ... ] } }

{ "query": { "constant_score": { "filter": { "bool" : {
"must": { "term": { "status": "received" } } } } } } } GET http://localhost:9200/web/orders/_search { "took": 24, ... "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }

A estrutura "Bool" é bem flexível

{ "query": { "constant_score": { "filter": { "bool" : {
"should": [ { "term": { "status": "received" } }, { "bool": { "must": { "term": { "customer": "Prof. Shaylee Greenholt" } … GET http://localhost:9200/web/orders/_search { "took": 12, ... "hits": { "total": 3041, "max_score": 1, "hits": [ ... ] } }

Quais os tipos de queries o elasticsearch possui?

• Term • Terms • Range • Exists • Missing
• Prefix { "term": { "status": "received" }} { "terms": { "status": ["received", "delivering"] }} { "range": { "total": {"gte": 100.5, "lte": 140.5 }}} { "exists": { "field": "region" }} { "missing": { "field": "region" }} { "prefix": { "customer": "Dolly" }}

• Wildcard • Regexp • Fuzzy • Type • Ids
{ "wildcard": { "customer": "Doll*" }} { "regexp": { "customer": { "value": "Doll*" }}} { "fuzzy": { "customer": { "value": "Doll*", "fuzziness": 2 }}} { "type": { "value": "orders"}} { "ids": { "type": "orders", "values": ["1", "2"]}}

Todas as queries que o Elasticsearch irá realizar, o "_score"
será calculado. Caso você queira que seja ignorado, utilize "constant_score".

Structure queries são boas para: • Filtrar documentos antes de
rodar as queries que você deseja calcular o "_score". • É rápido porque o Elasticsearch pode cachear e reutilizar nas próximas queries.

O que aprendemos? • Structure search • Use bool queries.
• Filtrar documentos antes de rodar queries full-text.

Full-text search

Temos duas diferenças quando falamos de buscas full-text.

Relevância

"O quão bem os documentos atenderam os critérios da query"

TF/IDF (Term freq./Inverted Document Freq.)

Proximidade por geolocalização Similaridade Fuzzy ...

_score max_score

Um query simples

{ "query": { "match": { "customer": "John" } } }
{ "_score": 4.6189003, "customer": "John Upton" }, { "_score": 4.6189003, "customer": "John Borer" }, { "_score": 4.6189003, "customer": "John Emard" }, { "_score": 4.06103, "customer": "John Runolfsdottir IV" }, { "_score": 3.8275056, "customer": "Mr. John Cartwright III" }, { "_score": 3.8275056, "customer": "John Hodkiewicz DDS" } GET http://localhost:9200/web/orders/_search

{ "query": { "match": { "customer": "Joh" } } }
{ "took": 15, ... "hits": { "total": 0, "max_score": null, "hits": [] } } GET http://localhost:9200/web/orders/_search

Elasticsearch (Lucene) persiste o dado de um jeito diferente do
que estamos acostumados.

Para entender uma busca full-text, primeiro vocês deve entender como
o Elasticsearch armazenada seus documentos.

É chamado de "Analysis"

O pipeline começa assim:

O pipeline começa com: • Criando um mapping ("Schema") para
o index. • Receber o dado via Index API. • Iterar sobre cada campo e ver ser o campo é analyzed. • Rodar o analyzer para o campo. • Persistir o dado.

O documento { "customer": "Dr. Emiliano, the Mitchell Sr.", "items":
[ { "id": 8510629, "value": 769 } ], "total": 2874.81, "created_at": "15/11/2015 17:00:57", "status": "delivered", "region": "sao_paulo", "shipping_fees": 719 }

"customer": "Dr. Emiliano, the Mitchell Sr." O documento

"Dr. Emiliano, the Mitchell Sr." O documento Analyzer: Tokenizer: whitespace
Token filter: lowercase Char Filter: html_strip

"Dr. Emiliano, the Mitchell Sr." Char filter Analyzer: Tokenizer: whitespace
Token filter: lowercase Char Filter: html_strip "Dr. Emiliano, the Mitchell Sr." string

"Dr. Emiliano, the Mitchell Sr." Tokenizer Analyzer: Tokenizer: whitespace Token
filter: lowercase Char Filter: html_strip "Dr." "Emiliano," "the" "Mitchell" "Sr." tokens

Token filter Analyzer: Tokenizer: whitespace Token filter: lowercase Char Filter:
html_strip "dr." "emiliano," "the" "mitchell" "sr." "Dr." "Emiliano," "the" "Mitchell" "Sr."

1 1 1 1 1 Persistindo "dr." "emiliano," "the" "mitchell"
"sr." Token Doc ID

"sr." Token Doc ID TF/IDF (Term Freq/Inverted Document Freq) irá calcular o score do documento..

{ "query": { "match": { "customer": "Mitchell" } } }
GET http://localhost:9200/web/orders/_search

"Mitchell" O termo buscado pelo usuário Analyzer: Tokenizer: whitespace Token
filter: lowercase Char Filter: html_strip

"Mitchell" O termo pesquisado Analyzer: Tokenizer: whitespace Token filter: lowercase
Char Filter: html_strip

"Mitchell" O termo pesquisado Analyzer: Tokenizer: whitespace Token filter: lowercase
Char Filter: html_strip "mitchell"

"sr." Token Doc ID "mitchell"

Quais as tipos de buscas o Elasticsearch tem?

• Match • multi_match • ... { "match": { "name":
"John" }} { "multi_match": { "query": "John", "fields": ["name.raw", "name.autocomplete"]}}

• Entender o processo de "analysis" é imprescindível para entender
como a busca full-text funciona. • Analyzer é composto por: tokenizers, token_filters and char filters. • Você tem que entender como o usuário irá buscar para saber qual a query mais apropriada para rodar. O que aprendemos?

Usamos uma "técnica" chamada Concept Search.

Primeiramente, uma busca em Information Retrieval é categorizada como "semântica"
ou "sintática"

Semântica • O cachorro grande (cachorro com no mínimo de
5KG). • Produtos infantis (produtos que a estampa é colorida).

Sintática • O cachorro grande. • Produtos infantis.

No caso do Elasticsearch que utiliza o Lucene...

Lucene é baseado em 2 principais conceitos.

Vector space model Boolean retrieval

Boolean retrieval

Vector space model • Como os documentos serão rankeados mais
que outros. • Um dos algoritmos utilizados são tf/idf (Term frequency/Inverse Document Frequency)

Eles não são otimizados para queries mais focadas pra e-commerce.

O que aprendemos? • Busca sintática vs semântica • Elasticsearch
é focado em busca sintática. • Boolean retrieval e Vector space Model são a base desse modelo engine de busca.

Precision e Recall

Atual Esperado OK

Precision Quando dos documentos relevantes disponíveis foram retornados na query.
"Quanto o resultado é completo"

Quando documentos relevantes foram retornados da query. "Quanto a busca
foi útil". Recall

Atual Esperado OK Lixo Significado

OK Um resultado que você quer

Atual Esperado OK Mais lixo

NAME = John AND SURNAME = Silva # => 100
NAME = John OR SURNAME = Silva # => 1000

NAME = John AND SURNAME = Silva # => 100
NAME = John OR SURNAME = Silva # => 1000 "John Silva" Sem resultado

Pipeline

Sinônimos

Atalho

Busca por conteúdo

• Precision e recall • Staged Search • Buscas mais
estritas para buscas mais abrangentes • Crie seu próprio pipeline • Conheça o tipo de buscas do seus usuários O que aprendemos?

Referências • https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html • https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html • https://www.youtube.com/playlist?list=PLZ4puV97Zwm2fEmTLrPsP7QgLsjnnQggX • Official PHP
elasticsearch package: https://github.com/elastic/elasticsearch-php • https://github.com/sleimanx2/plastic • https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Estamos Contratando! Dev BackEnd, Dev FrontEnd, UX [email protected]

Obrigado! @guilhermeguitte http://www.guitte.org

Desvendando as queries no Elasticsearch - v2

Desvendando as queries no Elasticsearch - v2

More Decks by Guilherme Guitte

Other Decks in Technology

Featured

Transcript