trade • Author of Haystack (& Tastypie) + = * Haystack is a semi-popular search library for Django * This talk isn’t directly about Django, but everyone writing Python code is some capacity should benefit
thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
- tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
of data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
of data • Space to store the data * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
of data • Space to store the data • CPU needed to run the search * Frequency - tens of times a day? hundreds? thousands? * Intended - just another feature? frequent? core/indispensable? * Amount - few records or many * Space - tiny, single field documents? or huge docs with lots of fields & content? * CPU - how intensive are the searches?
the_data: # ... Why is this wrong? * Python is SLOW (compared to other things) * So very RAM/CPU inefficient * I/O wait * Worst way to look for the actual text
this wrong? * Still having to read in all the data into RAM * Grep is smart enough to stream chunks of the data off the disk to not consume all the data at once * Shelling out hurts
bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
Manually looking for a substring * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
Manually looking for a substring • Happens for every query * Reading everything into RAM is bad because that’s a lot of unused data * Substrings suck because you have to search the whole thing, perhaps character by character * Running through everything on every query is tons of slow I/O
= { "hello": ['doc1'], "world": ['doc1', 'doc2'], "travel": ['doc2'], "welcome": ['doc2'], # ... } * Think a Python dictionary * Split your documents on whitespace to tokenize up the content * Talk about stemming, stop words, etc. * Keys in the dictionary are the (now-unique) words * Values in the dictionary are document ids
words = set(open(the_file).read().split()) for word in words: index.setdefault(word, set()) index[word].add(the_file) # Later... return index.get(query, set()) Why is this inefficient? * RAM-hungry * Ugly global state * Lose the whole index between restarts * Only exact word matching * No complex queries
schema = Schema(content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(content=u"This is the first d ocument we've added!") writer.add_document(content=u"The second one is e ven more interesting!") writer.commit()
• Good query support Why Not? • Requires filesystem access • Expects a structured query object * Filesystem access means distributed setups hurt * Replication of the index between servers * Things falling out of sync
support • Awesome ops story • NO XML! Why Not? • JVM noms the RAMs • Many servers are best * I’d be remiss not to mention this also has a Haystack backend
• Good search needs good content * Garbage in, garbage out * Clean data makes for a happy engine * Strip out HTML, unnecessary data, meaningless numbers, etc.
• Good search needs good content • Feed the beast • Update documents out of process * Updates should happen in a different thread/process/queue * Especially in a web environment (don’t block the response waiting on the engine) * No real reason to make the user wait, especially on data we already have/can rebuild
solr.search(u"world") for result in results: print '<a href="/docs/{0}">{1}</a>'.format( result['id'], result['title'] ) * Just the results is pretty easy * We could be denorming/storing other data for display here
results = solr.search(u"world", **kwargs) for result in results: print results.highlighting[result['id']] * With highlights, things get a little more interesting * Most engines can create some (non-standard) HTML * Kinda yuck, but can be liveable
2012-10-31])") * Shown here is the Lucene-style syntax (Whoosh, Solr, ES) * Xapian has similar faculities, but you go about it a very different way * Can express many & varied queries here
results = solr.search(u"world", **kwargs) # ... print results.facets['facet_fields']['author'] Caveats: * You need to be storing additional fields * You need to be storing exact (non-stemmed/post-processed) data in those fields * This kills searching in those fields, so you may need to duplicate
Geospatial search • Specialization • Contextual search * Something the engines don’t specifically have but you can easily build * Gives the user the ability to search within a given “silo” * Even better, start to narrow it based on where they already are
Geospatial search • Specialization • Contextual search • Real-time * The fresher the results, the better * Be careful not to swamp yourself * Sometimes you can fake it (within 1 minute, within 5 minutes, etc)
Geospatial search • Specialization • Contextual search • Real-time • New ways to present search results * Everyone is used to Google-style results (though they’re getting richer) * Provide more context * Make them appear in new places