Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Поиск? Sphinx!

Поиск? Sphinx!

Sphinx считается одним из самых быстрых и гибких поисковых движков на рынке, но не является "коробочным" решением, чем отпугивает многих разработчиков. Я расскажу как быстро поднять полнотекстовый поиск для своего проекта на базе Sphinx, почему он крут и какие существуют интеграционные решения для Python.

Avatar for Roman Zaiev

Roman Zaiev

June 08, 2013
Tweet

More Decks by Roman Zaiev

Other Decks in Programming

Transcript

  1. полнотекстовый поиск фильтрация сортировка группировка сниппеты работа с существующими документами

    широкий набор атрибутов быстрая индексация гибкий язык запросов морфологический анализатор управление релевантностью масштабируемость скорость качество стоимость Обычные требования к поиску
  2. ПОИСК до 250 запросов в секунду на каждое ядро с

    1 000 000 документов * * зависит от размера индекса
  3. до 256 полей для индексации на один индекс до 32-х

    атрибутов различных типов * хватит на все случаи жизни
  4. $ apt-get install sphinx $ brew install sphinx Ubuntu OS

    X $ ./configure --with-pgsql $ make $ make install DIY
  5. sphinx.conf | connection source common { type = pgsql sql_host

    = localhost sql_user = sphinx sql_pass = sphinx sql_db = megaportal }
  6. sphinx.conf | source description source company: common { sql_query =\

    SELECT company.id, \ company.name, \ company.date_created \ FROM company sql_field_string = name sql_attr_timestamp = date_created }
  7. sphinx.conf | source description source company: common { sql_query =\

    SELECT company.id, \ company.name, \ company.date_created \ FROM company sql_field_string = name sql_attr_timestamp = date_created } выборка атрибуты
  8. sphinx.conf | index description index common { type = plain

    morphology = stem_en, stem_ru min_word_len = 2 charset_type = utf-8 html_strip = 1 html_remove_elements = script html_index_attrs = img=alt,title; a=title; min_stemming_len = 3 min_infix_len = 3 enable_star = 1 ... }
  9. sphinx.conf | index description index common { type = plain

    morphology = stem_en, stem_ru min_word_len = 2 charset_type = utf-8 html_strip = 1 html_remove_elements = script html_index_attrs = img=alt,title; a=title; min_stemming_len = 3 min_infix_len = 3 enable_star = 1 ... } «джентельменский набор» десятки других опций
  10. sphinx.conf | index description index company: common { source =

    company path = /path/to/index_files/megaportal }
  11. sphinx.conf | indexer tuning searchd { listen = 9306:mysql41 log

    = /path/to/logs/searchd.log query_log = /path/to/logs/query.log pid_file = /path/to/pid/searchd.pid }
  12. sphinx.conf | fatality source common { type = pgsql sql_host

    = localhost sql_user = sphinx sql_pass = sphinx sql_db = megaportal } source company: common { sql_query =\ SELECT company.id, \ company.name, \ company.date_created \ FROM company sql_field_string = name sql_attr_timestamp = date_created } index common { type = plain morphology = stem_en, stem_ru min_word_len = 2 charset_type = utf-8 html_strip = 1 html_remove_elements = script html_index_attrs = img=alt,title; a=title; min_stemming_len = 3 min_infix_len = 3 enable_star = 1 } index company: common { source = company path = /path/to/index_files/megaportal } indexer { mem_limit = 512M } searchd { listen = 9306:mysql41 log = /path/to/logs/searchd.log query_log = /path/to/logs/query.log pid_file = /path/to/pid/searchd.pid } описание документов описание индексов indexer + searchd
  13. Sphinx 2.0.6-release (r3473) Copyright (c) 2001-2012, Andrew Aksyonoff Copyright (c)

    2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/path/to/sphinx.conf'... indexing index 'common'... ERROR: index 'common': no valid sources configured; skipping. indexing index 'company'... collected 1066244 docs, 28.7 MB sorted 2.3 Mhits, 100.0% done total 1066244 docs, 28735925 bytes total 11.452 sec, 2509034 bytes/sec, 93097.50 docs/sec total 60 reads, 0.019 sec, 495.0 kb/call avg, 0.3 msec/call avg total 127 writes, 0.053 sec, 471.7 kb/call avg, 0.4 msec/call avg rotating indices: successfully sent SIGHUP to searchd (pid=56606).
  14. $ mysql -h 0 -P 9306 Welcome to the MySQL

    monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 2.0.6-release (r3473) Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql>
  15. mysql> select * from company where match('тнк'); +--------+--------+--------------+----------+ | id

    | status | date_created | owner_id | +--------+--------+--------------+----------+ | 5015 | 6 | 2008 | 5019 | | 25502 | 3 | 2009 | 25507 | | 39771 | 6 | 2009 | 39776 | | 152307 | 1 | 2010 | 152380 | | 183905 | 3 | 2010 | 184097 | | 194302 | 6 | 2010 | 194517 | | 218982 | 1 | 2011 | 219439 | | 235881 | 3 | 2011 | 236408 | | 287319 | 3 | 2011 | 288131 | | 338476 | 3 | 2011 | 339574 | | 340073 | 6 | 2011 | 341177 | | 471410 | 2 | 2012 | 473498 | | 513023 | 0 | 2012 | 515276 | | 768093 | 1 | 2012 | 770983 | | 823359 | 6 | 2012 | 826706 | | 915374 | 3 | 2013 | 919765 | +--------+--------+--------------+----------+ 16 rows in set (0.00 sec)
  16. mysql> select * from company where match('тнк') and date_created between

    2010 and 2012; +--------+--------+--------------+----------+ | id | status | date_created | owner_id | +--------+--------+--------------+----------+ | 152307 | 1 | 2010 | 152380 | | 183905 | 3 | 2010 | 184097 | | 194302 | 6 | 2010 | 194517 | | 218982 | 1 | 2011 | 219439 | | 235881 | 3 | 2011 | 236408 | | 287319 | 3 | 2011 | 288131 | | 338476 | 3 | 2011 | 339574 | | 340073 | 6 | 2011 | 341177 | | 471410 | 2 | 2012 | 473498 | | 513023 | 0 | 2012 | 515276 | | 768093 | 1 | 2012 | 770983 | | 823359 | 6 | 2012 | 826706 | +--------+--------+--------------+----------+ 12 rows in set (0.01 sec)
  17. mysql> SHOW META; +---------------+--------+ | Variable_name | Value | +---------------+--------+

    | total | 6 | | total_found | 6 | | time | 0.000 | | keyword[0] | тнк | | docs[0] | 16 | | hits[0] | 16 | +---------------+--------+ 6 rows in set (0.00 sec) mysql> DESC company; +--------------+-----------+ | Field | Type | +--------------+-----------+ | id | integer | | name | field | | status | uint | | date_created | timestamp | | owner_id | uint | +--------------+-----------+ 5 rows in set (0.00 sec)
  18. «Порционный» запрос source company: common { sql_query_range =\ SELECT MIN(id),

    MAX(id) from company sql_range_step = 10000 sql_query =\ SELECT company.id, \ company.name, \ company.date_created \ FROM company \ WHERE company.id BETWEEN $start AND $end sql_field_string = name sql_attr_timestamp = date_created }
  19. Расширяем «охват» индекса source company: common { ... sql_query =\

    SELECT company.id, \ company.name, \ company.date_created, \ "user".email as owner_email \ FROM company \ LEFT JOIN "user" \ ON company.owner_id = "user".id \ WHERE company.id BETWEEN $start AND $end sql_field_string = name sql_field_string = owner_email sql_attr_timestamp = date_created }
  20. One-to-many? M2M? MVA! source company: common { ... sql_field_string =

    name sql_field_string = owner_email sql_attr_timestamp = date_created sql_attr_multi =\ uint products from ranged-query; \ SELECT company_id, id FROM product \ WHERE id >= $start AND id <= $end; \ SELECT MIN(id), MAX(id) FROM product }
  21. mysql> select name, products from company where match('тнк') and products

    in (109503, 1123362); +------------------------------+----------+ | name | products | +------------------------------+----------+ | ТНК-Транс | 109503 | | ООО «ТНК-Транс» | 1123362 | +------------------------------+----------+ 2 rows in set (0.00 sec)
  22. source company: common { sql_query_pre = SET @maxts:=(SELECT NOW()) sql_query

    = SELECT company.id, company.name FROM company \ WHERE company.created_at < @maxts ... sql_query_post =\ REPLACE INTO search_deltacounters \ VALUES (@id, 'company_tmp', @maxts) sql_query_post_index =\ DELETE FROM search_deltacounters \ WHERE tablename='company' sql_query_post_index =\ UPDATE search_deltacounters \ SET tablename='company' WHERE tablename='company_tmp' } Модифицируем основной индекс
  23. source company_delta: company { sql_query_pre =\ SET @maxts=(SELECT maxts FROM

    search_deltacounters \ WHERE tablename='company') sql_query = SELECT company.id, company.name FROM company \ WHERE company.created_at >= @maxts ... sql_query_post = sql_query_post_index = } Создаём дельту index company_delta: common { source = company_delta path = /path/to/indexes/megaportal }
  24. mysql> select * from company, company_delta where match('тнк'); +--------+--------+--------------+----------+ |

    id | status | date_created | owner_id | +--------+--------+--------------+----------+ | 5015 | 6 | 2008 | 5019 | | 25502 | 3 | 2009 | 25507 | | 39771 | 6 | 2009 | 39776 | | 152307 | 1 | 2010 | 152380 | | 183905 | 3 | 2010 | 184097 | | 194302 | 6 | 2010 | 194517 | | 218982 | 1 | 2011 | 219439 | | 235881 | 3 | 2011 | 236408 | | 287319 | 3 | 2011 | 288131 | | 338476 | 3 | 2011 | 339574 | | 340073 | 6 | 2011 | 341177 | | 471410 | 2 | 2012 | 473498 | | 513023 | 0 | 2012 | 515276 | | 768093 | 1 | 2012 | 770983 | | 823359 | 6 | 2012 | 826706 | | 915374 | 3 | 2013 | 919765 | +--------+--------+--------------+----------+ 16 rows in set (0.00 sec)
  25. :(

  26. sphinxit | config class SearchConfig(object): DEBUG = True WITH_META =

    True WITH_STATUS = True SEARCHD_CONNECTION = { 'host': '127.0.0.1', 'port': 9306, }
  27. sphinxit | usage from sphinxit.core.processor import Search, Snippet company_search =

    ( Search( indexes=['company'], config=SearchConfig ) .match('ТНК') ) SELECT * FROM company WHERE MATCH('ТНК')
  28. sphinxit | usage search = Search(['company'], config=SearchConfig) search = (

    search .match('ТНК') .select('id', 'name') .options( ranker='proximity', max_matches=100, ) .order_by('name', 'desc') ) SELECT id, name FROM company WHERE MATCH('ТНК') ORDER BY name DESC OPTION max_matches=100, ranker=proximity
  29. sphinxit | usage search = Search(['company'], config=SearchConfig) search = (

    search .match('ТНК') .filter(date_created__lte=datetime.date.today()) ) results = search.ask() SELECT * FROM company WHERE MATCH('ТНК') AND date_created<=1370552400
  30. sphinxit | result { u'result': [ { 'date_created': 2008L, 'owner_email':

    u'[email protected]', 'products': u'2,5', 'id': 5015L, 'name': u'\u0422\u041d\u041a', }, { 'date_created': 2009L, 'owner_email': u'[email protected]', 'products': u'2,5', 'id': 25502L, 'name': u'\u0422\u041d\u041a \u0418\u043d\u0442\u0435\u0440\u043', } ], ... u'meta': { u'total': u'16', u'total_found': u'16', u'docs[0]': u'16', u'time': u'0.000', u'hits[0]': u'16', u'keyword[0]': u'\u0442\u043d\u043a' } }
  31. sphinxit | update syntax search = Search(['company'], config=SearchConfig) search =

    ( search .match('ТНК') .update(products=(5,2)) .filter(id__gt=1) ) UPDATE company SET products=(5,2) WHERE MATCH('ТНК') AND id>1
  32. sphinxit | snippets syntax snippets = ( Snippet(index='company', config=SearchConfig) .for_query("Me

    amore") .from_data("amore", "amore mia") ) CALL SNIPPETS ( ('amore', 'me amore'), 'company', 'Me amore' ); { u'result': [ {'snippet': u'<b>amore</b>'}, {'snippet': u'<b>amore</b> mia'} ] }
  33. sphinxit | snippets syntax snippets = ( Snippet(index='company', config=SearchConfig) .for_query("Me

    amore") .from_data("amore mia") .options( before_match='<strong>', after_match='</strong>', ) ) CALL SNIPPETS ( 'amore mia', 'company', 'Me amore', '<strong>' AS before_match, '</strong>' AS after_match )