Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Language support and linguistics in Lucene, Sol...

Language support and linguistics in Lucene, Solr and ElasticSearch, and the eco-system

This is our Berlin Buzzwords 2013 overview talk on search and natural language processing.

See https://github.com/atilika/berlin-buzzwords-2013 for code examples, etc.

More Decks by アティリカ株式会社

Other Decks in Technology

Transcript

  1. About me • MSc. in computer science, University of Oslo,

    Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, technical sales, etc. in Tokyo, Japan • Founded ΞςΟϦΧגࣜձࣾ in October, 2009 • We help companies innovate using new technologies and good ideas • We do information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • We are a small company, but our customers are typically very big companies • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Working on Korean support from a code donation (LUCENE-4956) • Please write me on [email protected] or [email protected]
  2. About this talk • Basic searching and matching • Challenges

    with natural language • Basic measurements for search quality • Linguistics in Apache Lucene • Linguistics in ElasticSearch (quick intro) • Linguistics in Apache Solr • Linguistics in the NLP eco-system • Summary and practical advice
  3. Hands-on 1: Working with Apache Lucene analyzers Hands-on 4: Other

    text processing using OpenNLP Hands-on 3: Multi-lingual search with Apace Solr Hands-on 2: Multi-lingual search using ElasticSearch Hands-on demos
  4. Documents 1 Sushi is very tasty in Japan 2 Visiting

    the Tsukiji fish market is very fun Two documents (1 & 2) with English text 1
  5. Text segmentation 1 Sushi is very tasty in Japan 2

    Visiting the Tsukiji fish market is very fun 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun Documents are turned into searchable terms (tokenization) Two documents (1 & 2) with English text 1 2
  6. Text segmentation 1 Sushi is very tasty in Japan 2

    Visiting the Tsukiji fish market is very fun 1 Sushi is very tasty in Japan 2 Visiting the Tsukiji fish market is very fun 1 sushi is very tasty in japan 2 visiting the tsukiji fish market is very fun Documents are turned into searchable terms (tokenization) Two documents (1 & 2) with English text Terms/tokens are converted to lowercase form (normalization) 1 2 3
  7. Document indexing 1 sushi is very tasty in japan 2

    visiting the tsukiji fish market is very fun Tokenized documents with normalized tokens
  8. Document indexing sushi 1 is 1 2 very 1 2

    tasty 1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 1 sushi is very tasty in japan 2 visiting the tsukiji fish market is very fun Tokenized documents with normalized tokens Inverted index - tokens are mapped to the document ids that contain them
  9. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2
  10. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 query very tasty sushi
  11. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND very tasty sushi parsed query
  12. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND very tasty sushi parsed query
  13. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND very tasty sushi parsed query
  14. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND very tasty sushi parsed query
  15. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 1 hits AND very tasty sushi parsed query
  16. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2
  17. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 query visit fun market
  18. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND visit fun market parsed query
  19. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND visit fun market parsed query
  20. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND visit fun market parsed query visit ≠ visiting
  21. Searching sushi 1 is 1 2 very 1 2 tasty

    1 in 1 japan 1 visiting 2 the 2 tsukiji 2 fish 2 market 2 fun 2 AND visit fun market parsed query no hits (all terms need to match)
  22. What’s the problem? Search engines are not magical answering machines

    They match terms in queries against terms in documents, and order matches by rank ! !
  23. Key takeaways Text processing affects search quality in big way

    because it affects matching The “magic” of a search engine is often provided by high quality text processing Garbage in 㱺 Garbage out ! !
  24. English Pale ale is a beer made through warm fermentation

    using pale malt and is one of the world's major beer styles.
  25. English How do we want to index world's? ? Pale

    ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.
  26. English How do we want to index world's? ? Should

    a search for style match styles? And should ferment match fermentation? ? Pale ale is a beer made through warm fermentation using pale malt and is one of the world's major beer styles.
  27. German Das Oktoberfest ist das größte Volksfest der Welt und

    es findet in der bayerischen Landeshauptstadt München.
  28. German Das Oktoberfest ist das größte Volksfest der Welt und

    es findet in der bayerischen Landeshauptstadt München. The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich.
  29. German Das Oktoberfest ist das größte Volksfest der Welt und

    es findet in der bayerischen Landeshauptstadt München. The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich. How do we want to search ü, ö and ß? ?
  30. German Das Oktoberfest ist das größte Volksfest der Welt und

    es findet in der bayerischen Landeshauptstadt München. The Oktoberfest is the world’s largest festival and it takes place in the Bavarian capital Munich. How do we want to search ü, ö and ß? ? Do we want a search for hauptstadt to match Landeshauptstadt? ?
  31. French Le champagne est un vin pétillant français protégé par

    une appellation d'origine contrôlée. Champagne is a French sparkling wine with a protected designation of origin.
  32. French Le champagne est un vin pétillant français protégé par

    une appellation d'origine contrôlée. How do we want to search é, ç and ô? ? Champagne is a French sparkling wine with a protected designation of origin.
  33. French Le champagne est un vin pétillant français protégé par

    une appellation d'origine contrôlée. How do we want to search é, ç and ô? ? Do we want a search for aoc to match appellation d'origine contrôlée? ? Champagne is a French sparkling wine with a protected designation of origin.
  34. Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا

    ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا
  35. Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا

    ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا Reads from right to left
  36. Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا

    ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world.
  37. Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا

    ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world. How do we want to search "َ:ْ#ـــــــ ِ <=ا? ?
  38. Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا

    ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world. How do we want to search "َ:ْ#ـــــــ ِ <=ا? ? Do we want to normalize diacritics? ?
  39. Arabic )+. م%1'ا ز35ر 75 ا95ر ":#ـــــــ<=ا "#$%&'ا ة3AB'ا %CD&E

    .F$%&'ا G'H&'ا IJ ب%&'ا Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world. How do we want to search "َ:ْ#ـــــــ ِ <=ا? ? Do we want to normalize diacritics? ? Diacritics normalized (removed)
  40. Arabic )ْ+,ِ. مَ%َ1'ا زْ3ُ5ُر 7ِ5 اً9ْ5َر "َ:ْ#ـــــــ ِ <=ا >?#ِ$َ%َ&'ا

    ةَ3ْAَB'ا %َC,َD,ْ&ُE .Fِ$َ%َ&'ا َG'Hَ&'ا IِJ ْبَ%َ&'ا Original Arabian coffee is considered a symbol of generosity among the Arabs in the Arab world. How do we want to search "َ:ْ#ـــــــ ِ <=ا? ? Do we want to correct the common spelling mistake for IِJ and ه? ? Do we want to normalize diacritics? ?
  41. Japanese Shall we go for a beer near JR Shinjuku

    station? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  42. Japanese Shall we go for a beer near JR Shinjuku

    station? What are the words in this sentence? ? What are the words in this sentence? Which tokens do we index? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  43. Japanese Shall we go for a beer near JR Shinjuku

    station? What are the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  44. Japanese ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ What are

    the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index? Shall we go for a beer near JR Shinjuku station? But how do we find the tokens? ? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  45. Japanese ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ What are

    the words in this sentence? ? Words are implicit in Japanese - there is no white space that separates them ! What are the words in this sentence? Which tokens do we index? Shall we go for a beer near JR Shinjuku station? But how do we find the tokens? ?
  46. Japanese Do we want ҿΉ (to drink) to match ҿΈ?

    ? Shall we go for a beer near JR Shinjuku station? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  47. Japanese Do we want ҿΉ (to drink) to match ҿΈ?

    ? Do we want űƄŖſ to match Ϗʔϧ? ? Shall we go for a beer near JR Shinjuku station? Does half-width match full-width? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  48. Japanese Do we want ҿΉ (to drink) to match ҿΈ?

    ? Do we want űƄŖſ to match Ϗʔϧ? ? Do we want (emoji) to match? ? Shall we go for a beer near JR Shinjuku station? Does half-width match full-width? ̧̟ ৽॓ Ӻ ͷ ۙ͘ʹ ϏʔϧΛҿΈʹߦ͜͏ ͔ʁ
  49. Common traits •Segmenting source text into tokens • Dealing with

    non-space separated languages • Handling punctuation in space separated languages • Segmenting compounds into their parts • Apply relevant linguistic normalizations • Character normalization • Morphological (or grammatical) normalizations • Spelling variations • Synonyms and stopwords
  50. Key take-aways • Natural language is very complex • Each

    language is different with its own set of complexities • We have had a high level look at languages • But there is also... • Search needs per-language processing • Many considerations to be made (often application-specific) Greek Hebrew Chinese Korean Russian Thai Spanish and many more ... Japanese English German French Arabic
  51. Precision Fraction of retrieved documents that are relevant precision =

    | { relevant docs } ∩ { retrieved docs } | | { retrieved docs } |
  52. Recall | { relevant docs } ∩ { retrieved docs

    } | | { relevant docs } | recall = Fraction of relevant documents that are retrieved
  53. Precision vs. Recall Should I optimize for precision or recall?

    ? That depends on your application ! A lot of tuning work is in practice often about improving recall without hurting precision !
  54. Index document or query Lucene analysis chain / Analyzer 1.

    Analyzes queries or documents in a pipelined fashion before indexing or searching 2. Analysis itself is done by an analyzer on a per field basis 3. Key plug-in point for linguistics in Lucene Simplified architecture
  55. What does an Analyzer do? ? ! Analyzers take text

    as its input and turns it into a stream of tokens Analyzers
  56. What does an Analyzer do? ? ! Analyzers take text

    as its input and turns it into a stream of tokens Tokens are produced by a Tokenizer ! Analyzers
  57. What does an Analyzer do? ? ! Analyzers take text

    as its input and turns it into a stream of tokens Tokens are produced by a Tokenizer Tokens can be processed further by a chain of TokenFilters downstream ! ! Analyzers
  58. Analyzer high-level concepts Tokenizer Reader TokenFilter TokenFilter TokenFilter Reader •

    Stream to be analyzed is provided by a Reader (from java.io) • Can have chain of associated CharFilters (not discussed) Tokenizer • Segments text provider by reader into tokens • Most interesting things happen in incrementToken() method TokenFilter • Updates, mutates or enriches tokens • Most interesting things happen in incrementToken() method TokenFilter ... TokenFilter ...
  59. StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée.

    Le champagne est protégé par une appellation d'origine contrôlée FrenchAnalyzer
  60. StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée.

    Le champagne est protégé par une appellation d'origine contrôlée ElisionFilter FrenchAnalyzer
  61. StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée.

    Le champagne est protégé par une appellation d'origine contrôlée ElisionFilter Le champagne est protégé par une appellation origine contrôlée FrenchAnalyzer
  62. StandardTokenizer Le champagne est protégé par une appellation d'origine contrôlée.

    Le champagne est protégé par une appellation d'origine contrôlée ElisionFilter Le champagne est protégé par une appellation origine contrôlée FrenchAnalyzer LowerCaseFilter
  63. LowerCaseFilter le champagne est protégé par une appellation origine contrôlée

    StopFilter champagne protégé appellation origine contrôlée
  64. LowerCaseFilter le champagne est protégé par une appellation origine contrôlée

    StopFilter champagne protégé appellation origine contrôlée FrenchLightStemFilter
  65. LowerCaseFilter le champagne est protégé par une appellation origine contrôlée

    StopFilter champagne protégé appellation origine contrôlée champagn proteg apel origin control FrenchLightStemFilter
  66. FrenchAnalyzer champagn proteg apel origin control Le champagne est protégé

    par une appellation d'origine contrôlée. FrenchLightStemFilter StandardTokenizer ElisionFilter LowerCaseFilter StopFilter
  67. Analyzer processing model •Analyzers provide a TokenStream • Retrieve it

    by calling tokenStream(field, reader) • tokenStream() bundles together tokenizers and any additional filters necessary for analysis •Input is advanced by incrementToken() • Information about the token itself is provided by so-called TokenAttributes attached to the stream • Attribute for term text, offset, token type, etc. • TokenAttributes are updated on incrementToken()
  68. Hands-on: Working with analyzers in code See demo code on

    http://github.com/atilika/berlin-buzzwords-2013
  69. Synonyms • Synonyms are flexible and easy-to-use • Very powerful

    tools for improving recall • Two types of synonyms • One way/mapping “sparkling wine => champagne” • Two way/equivalence “aoc, appellation d'origine contrôlée” • Can be applied index-time or query-time • Apply synonyms on one side - not both • Best practice is to apply synonyms query-side • Allows for updating synonyms without reindexing • Allows for turning synonyms on and off easily
  70. ElasticSearch linguistics highlights • Uses Lucene analyzers, tokenizers & filters

    • Analyzers are made available through a provider interface • Some analyzers available through plugins, i.e. kuromoji, smartcn, icu, etc. • Analyzers can be set up in your mapping • Analyzers can also be chosen based on a field in your document, i.e. a lang field
  71. Linguistics in Solr • Uses Lucene analyzers, tokenizers & filters

    • Linguistic processing is defined by field types in schema.xml • Different processing can be applied on indexing and querying side if desired • A rich set of pre-defined and ready-to-use per- language field types are available • Defaults can be used as starting points for further configuration or as they are
  72. French in schema.xml <!-- French --> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">

    <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- removes l', etc --> <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" enablePositionIncrements="true"/> <filter class="solr.FrenchLightStemFilterFactory"/> <!-- less aggressive: <filter class="solr.FrenchMinimalStemFilterFactory"/> --> <!-- more aggressive: <filter class="solr.SnowballPorterFilterFactory" language="French"/> --> </analyzer> </fieldType> <!-- French --> <field name="title" type="text_fr" indexed="true" stored="true"/> <field name="body" type="text_fr" indexed="true" stored="true"/> <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true"/>
  73. Arabic in schema.xml <!-- Arabic --> <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">

    <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- for any non-arabic --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" enablePositionIncrements="true"/> <!-- normalizes alef maksura to yeh, etc --> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.ArabicStemFilterFactory"/> </analyzer> </fieldType> <!-- Arabic --> <field name="title" type="text_ar" indexed="true" stored="true"/> <field name="body" type="text_ar" indexed="true" stored="true"/> <dynamicField name="*_ar" type="text_ar" indexed="true" stored="true"/>
  74. Field types in schema.xml • text_ar Arabic • text_bg Bulgarian

    • text_ca Catalan • text_cjk CJK • text_cz Czech • text_da Danish • text_de German • text_el Greek • text_es Spanish • text_eu Basque • text_fa Farsi • text_fi Finnish • text_fr French • text_ga Irish • text_gl Galician • text_hi Hindi • text_hu Hungarian • text_hy Armenian • text_id Indonedian • text_it Italian • text_lv Latvian • text_nl Dutch • text_no Norwegian • text_pt Portuguese • text_ro Romanian • text_ru Russian • text_sv Swedish • text_th Thai • text_fr Turkish
  75. Field types in schema.xml Coming soon! LUCENE-4956 • text_ar Arabic

    • text_bg Bulgarian • text_ca Catalan • text_cjk CJK • text_cz Czech • text_da Danish • text_de German • text_el Greek • text_es Spanish • text_eu Basque • text_fa Farsi • text_fi Finnish • text_fr French • text_ga Irish • text_gl Galician • text_hi Hindi • text_hu Hungarian • text_hy Armenian • text_id Indonedian • text_it Italian • text_lv Latvian • text_nl Dutch • text_no Norwegian • text_pt Portuguese • text_ro Romanian • text_ru Russian • text_sv Swedish • text_th Thai • text_fr Turkish • text_ko Korean
  76. Index id ... title ... body ... <add> <doc> <field>

    ∙∙∙ </field> </doc> </add> UpdateRequestHandler handles request 1. Receives a document via HTTP in XML (or JSON, CSV, ...) 2. Converts document to a SolrInputDocument 3. Activates the update chain Adding document details
  77. Index id ... title ... body ... UpdateRequestHandler handles request

    1. Receives a document via HTTP in XML (or JSON, CSV, ...) 2. Converts document to a SolrInputDocument 3. Activates the update chain <add> <doc> <field> ∙∙∙ </field> </doc> </add> Adding document details
  78. Index id ... title ... body ... Update chain of

    UpdateRequestProcessors 1. Processes a document at a time with operation (add) 2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired Adding document details
  79. Index id ... title ... body ... Update chain of

    UpdateRequestProcessors 1. Processes a document at a time with operation (add) 2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired Adding document details
  80. Index id ... title ... body ... Update chain of

    UpdateRequestProcessors 1. Processes a document at a time with operation (add) 2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired Adding document details
  81. Index id ... title ... body ... Update chain of

    UpdateRequestProcessors 1. Processes a document at a time with operation (add) 2. Plugin logic can mutate SolrInputDocument, i.e. add fields or do other processing as desired Adding document details
  82. Index id ... title ... body ... lang ... Update

    chain of UpdateRequestProcessors 1. Update processor added a lang field by analyzing body 2. Finish by calling RunUpdateProcessor (usually) Adding document details
  83. Index id ... title ... body ... lang ... Update

    chain of UpdateRequestProcessors 1. Update processor added a lang field by analyzing body 2. Finish by calling RunUpdateProcessor (usually) Adding document details
  84. Index id ... title ... body ... lang ... Update

    chain of UpdateRequestProcessors 1. Update processor added a lang field by analyzing body 2. Finish by calling RunUpdateProcessor (usually) id ... title ... body ... lang ... Adding document details
  85. Index id ... title ... body ... lang ... id

    ... title ... body ... lang ... Lucene analyzer chain 1. Fields are analyzed individually Adding document details
  86. Index id ... title ... body ... lang ... id

    ... title ... body ... lang ... Lucene analyzer chain 1. No analysis on id Adding document details
  87. Index id ... title ... body ... lang ... title

    ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details
  88. Index id ... title ... body ... lang ... title

    ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details
  89. Index id ... title ... body ... lang ... title

    ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details
  90. Index id ... title ... body ... lang ... title

    ... body ... lang ... Lucene analyzer chain 1. Field title being processed id ... Adding document details
  91. Index id ... title ... body ... lang ... title

    ... body ... lang ... Lucene analyzer chain 1. Field body being processed id ... Adding document details
  92. Index id ... title ... body ... lang ... title

    ... lang ... Lucene analyzer chain 1. Field body being processed id ... body ... Adding document details
  93. Index id ... title ... body ... lang ... title

    ... lang ... Lucene analyzer chain 1. Field body being processed id ... body ... Adding document details
  94. Index id ... title ... body ... lang ... title

    ... lang ... Lucene analyzer chain 1. Field body being processed id ... body ... Adding document details
  95. Index id ... title ... body ... lang ... title

    ... lang ... Lucene analyzer chain 1. Field lang being processed 2. User a different analyzer chain id ... body ... Adding document details
  96. Index id ... title ... body ... lang ... title

    ... Lucene analyzer chain 1. Field lang being processed 2. User a different analyzer chain id ... body ... lang ... Adding document details
  97. id ... title ... body ... lang ... Index Lucene

    analyzer chain 1. All fields analyzed Adding document details
  98. Multi-language challenges •How do we detect language accurately? • Indexing

    side is feasible (accuracy > 99.1%), but query side is hard because of ambiguity •How to deal with language query side? • Supply language to use in the application (best if possible) • Search all relevant language variants (OR query) • Search a fallback field using n-gramming • Boost important language or content Not knowing query term language will most likely impact negatively on overall rank
  99. Basis Technology • High-end provider of text analytics software •

    Rosette Linguistics Platform (RLP) highlights • Language and encoding identification (55 languages and 45 encodings) • Segmentation for Chinese, Japanese and Korean • De-compounding for German, Dutch, Korean, etc. • Lemmatization for a range of languages • Part-of-speech tagging for a range of language • Sentence boundary detection • Named entity extraction • Name indexing, transliteration and matching • Integrates well with Lucene/Solr
  100. Apache OpenNLP • Machine learning toolkit for NLP • Implements

    a range of common and best-practice algorithms • Very easy-to-use tools and APIs targeted towards NLP • Features and applications • Tokenization • Sentence segmentation • Part-of-speech tagging • Named entity recognition • Chunking • Licensing terms • Code itself has an Apache License 2.0 • Some models are available, but licensing terms and F-scores are unclear... • See LUCENE-2899 for OpenNLP a Lucene Analyzer (work-in-progress)
  101. Hands-on: Basic text processing with OpenNLP See demo code on

    http://github.com/atilika/berlin-buzzwords-2013
  102. Summary •Getting languages right is a hard problem • Linguistics

    helps improve search quality •Linguistics in Lucene, ElasticSearch and Solr • A wide range of languages are supported out-of-the-box • Considerations to be made on indexing and query side • Lucene Analyzers work on a per-field level • Solr UpdateRequestProcessors work on the document level • Solr has functionality for automatically detecting language (available in ElasticSearch as a plugin) •Linguistics options also available in the eco-system
  103. Practical advice • Understand your content and your users’ needs

    • Understand your language and its issues • Understand what users want from search • Do you have issues with recall? • Consider synonyms, stemming • Consider compound-segmentation for European languages • Consider WordDelimiterFilter, phonetic matching • Do you have issues with precision? • Consider using ANDs instead of ORs for terms • Consider improving content quality? Search fewer fields? • Is some content more important than other? • Consider boosting content with a boost query
  104. Thanks you Jan Høydahl www.cominvent.com Thanks for some slide material

    Bushra Zawaydeh Thanks for fun Arabic language lessons Gaute Lambertsen Thanks for helping talk preparations
  105. Example code •Example code is available on Github • https://github.com/atilika/berlin-buzzwords-2013

    •Get started using • git clone git://github.com/atilika/berlin-buzzwords-2013.git • less berlin-buzzwords-2013/README.md •Contact us if you have any questions • [email protected]