Stuff a Search Engine Can Do

1 João Duarte Log Whisperer @elastic Stuff a search engine
can do :slightly_smiling_face:

6 Apache Lucene Core Apache LuceneTM is a high-performance, full-featured
text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download. http://lucene.apache.org/core/

14 Elasticsearch Cluster

17 As a law stu-dent, I went on a few
job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...

18 As a law stu-dent, I went on a few
job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...

19 Stuff a search engine can do Agenda Document Analysis
1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2

20 Document Analysis Stuff a search engine can do As
a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ...

21 Document Analysis Stuff a search engine can do As
a law stu-dent, I went on a few job in-ter-views. At one, the in-ter-viewer’s first com-ment was “It’s so un- usual that I see a résumé with-out any typos.” “Are you se-ri-ous?” I said. She said, “Yes, prob-a-bly 90% of the résumés I get have ty-pos. And that in-cludes the ones we get from the top schools.” I got the job. Prob-a-bly there were bet-ter-qual-i-fied can-di-dates, but they dam-aged their chances with sloppy résumés. The irony is that those peo-ple, who most needed to hear the in-ter-viewer’s feed-back, weren’t in the room. Be-cause they never got an interview. … ... Analyzer

22 Stuff a search engine can do Anatomy of the
Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter

23 Stuff a search engine can do Anatomy of the
Analyzer: Elasticsearch comes with pre-built analyzers, you can create your own. https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html Document Analysis Character Filter 1 2 3 Tokenizer Token Filter

1 Searching and Ranking 3 Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 2 1

25 • Elasticsearch terms: ‒ An Index: data structure that
houses documents (think RDBMS "table"); ‒ Index a document: insert into an Index ‒ Document: a JSON object (hash map) Stuff a search engine can do Indexing $ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }'

26 Stuff a search engine can do Indexing token document_id
frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 # document id 1 {"text": "He who controls the spice, controls the universe."}

frequency He 1 1 who 1 1 controls 1 1 the 1 1 spice 1 1 universe 1 1 A 2 1 mad 2 1 man 2 1 sees 2 1 what 2 1 he 2 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."}

frequency He 1 1 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 A 2 1 mad 2,3 2 man 2,3 2 sees 2 1 what 2 1 he 2 1 What 3 1 if 3 1 a 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}

frequency he 1,2 2 who 1 1 controls 1 1 the 1,3 2 spice 1 1 universe 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 sees 2 1 what 2,3 2 if 3 1 controlled 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} Lower case token filter

frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} + Stemmer

31 Stuff a search engine can do Indexing # document
id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"} - Stopwords token document_id frequency he 1,2 2 who 1 1 control 1,3 2 the 1,3 2 spice 1 1 univers 1,3 2 a 2,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 if 3 1

frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2 # document id 1 {"text": "He who controls the spice, controls the universe."} # document id 2 {"text": "A mad man sees what he sees."} # document id 3 {"text": "What if a mad man controlled the universe?"}

1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 3

34 Stuff a search engine can do Structured Full-text Others
• Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking

35 Stuff a search engine can do Structured Full-text Others
• Similar to SQL • Find exact values • Ranges • Group by • Match • Match Phrase • Relevancy and boosting • More Like This • Multifield Search • Pipeline Aggregations • Geolocation • Proximity Matching Searching and Ranking

36 Stuff a search engine can do Searching and Ranking
GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2

GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice token document_id frequency he 1,2 2 who 1 1 control 1,3 2 spice 1 1 univers 1,3 2 mad 2,3 2 man 2,3 2 see 2 1 what 2,3 2

GET my_index/_search { "query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice

39 Stuff a search engine can do GET my_index/_search {
"query": { "match" : { "text" : { "query" : "control spice" } } } } token control spice Searching and Ranking

40 Stuff a search engine can do There are three
main factors of a document’s score: • TF (term frequency): The more a token appears in a doc, the more important it is • IDF (inverse document frequency): The more documents containing the term, the less important it is • Field length: shorter docs are more likely to be relevant than longer docs Searching and Ranking

44 Stuff a search engine can do "BM25 Demystified" by
Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking

45 Stuff a search engine can do "BM25 Demystified" by
Britta Weber https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25 Searching and Ranking

1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? 5 Indexing 1 2 4 3

47 5 Stuff a search engine can do Agenda Document
Analysis 1 Searching and Ranking Suggestions, More Like This, etc. 4 Would you like to know more..? Indexing 1 2 4 3 4

48 Code - https://github.com/elastic/ Documentation - https://www.elastic.co/guide/index.html Elasticsearch: The Definitive
Guide - https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html Discuss Forum - https://discuss.elastic.co/ Private or Public Training - https://training.elastic.co/ Subscriptions - https://www.elastic.co/subscriptions Stuff a search engine can do Would you like to know more?

49 Stuff a search engine can do The End. Thank
you!

Stuff a Search Engine Can Do

Stuff a Search Engine Can Do

More Decks by Elasticsearch Inc

Other Decks in Programming

Featured

Transcript