Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The first step self made full text search

Avatar for ryokato ryokato
September 17, 2019
6.9k

The first step self made full text search

Avatar for ryokato

ryokato

September 17, 2019
Tweet

Transcript

  1. ࿩͞ͳ͍͜ͱ w &MBTUJDTFBSDI4PMSͱ͍ͬͨશจݕࡧΤϯδϯΛ࢖ͬͨݕࡧΞϓϦέʔγϣϯ ͷ࿩ɾQZUIPOͰ࢖͏UJQT w ஌Γ͍ͨਓ͸1Z$PO+1ͷൃද ˞ ݟ͍ͯͩ͘͞PSݕࡧٕज़ษڧձ΁ w ΋͘͠͸ੋඇۭ͖࣌ؒʹ࿩͠·͠ΐ͏

    w ΫϩʔϦϯάεΫϨΠϐϯάʹ͍ͭͯ w ؾʹͳΔਓ͸ɺຊΛಡΉPS!TIJOZPSLF͞Μ ˞ !WBBBBBORVJTI͞Μ ˞ ͷϒϩάΛݟͯ ͍ͩ͘͞ w ˞IUUQTTMJEFTIJQDPNVTFST!JLUBLBIJSPQSFTFOUBUJPOT%3+YK,G#'&(43DW:K$"G w ˞IUUQTTIJOZPSLFIBUFOBCMPHDPNFOUSZLPXBLVOBJDSBXMBOETDSBQJOH w ˞IUUQTWBBBBBBORVJTIIBUFOBCMPHDPNFOUSZ
  2. ෼Ͱ࡞ΔQZUIPO੡ͷݕࡧΤϯδϯ def add_text(text: str): with open("db.txt", "a") as f: f.write(text

    + "\n") def search(keyword: str): with open("db.txt", "r") as f: return [l.strip() for l in f if keyword in l] if __name__ == "__main__": texts = [ "Beautiful is better than ugly.", "Explicit is better than implicit.", "Simple is better than complex." ] for text in texts: add_text(text) results = search("Simple") for result in results: print(result)
  3. શจݕࡧʹ͍ͭͯ (SFQܕ ˞  w ઢܗ૸ࠪΛߦ͏ w ݱࡏͷίϯϐϡʔλʔͰ͸ɺγΣʔΫεϐΞશू ໿ສޠ ن໛ͷจষʹର

    ͯ͠ͷ୯७ͳΫΤϦʹରͯ͠͸͜ΕͰॆ෼ͱ͍͏આ΋͋Δ ˞  ࡧҾ ΠϯσοΫε ܕ ˞  w ͋Β͔͡Ίݕࡧର৅ͱͳΔจॻ܈Λ૸ࠪͯ͠ࡧҾσʔλΛ࡞͓ͬͯ͘ ϕΫτϧܕ w ಛ௃ϕΫτϧΛ࡞੒ͯ͠ϕΫτϧؒͷڑ཭Λܭࢉ ˞IUUQTKBXJLJQFEJBPSHXJLJ&"&&"$&#" ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛
  4. సஔΠϯσοΫε ٯΠϯσοΫε w ୯ޠͱͦΕؚ͕·Ε͍ͯΔจষͷϚοϐϯάΛอ࣋͢ΔΠϯσοΫεܕͷσʔλ ߏ଄ ˞  w ࣙॻͱϙεςΟϯάͰߏ੒͞ΕΔ 1ZUIPO

    &MBTUJDTFBSDI            ࣙॻ ϙεςΟϯά ϙεςΟϯάϦετ ϙεςΟϯά W w ˞XJLJQFEJB w ˞ਤʮ৘ใݕࡧͷجૅʯڞཱग़൛ W సஔΠϯσοΫε
  5. సஔΠϯσοΫεͷΠϯσοΫε୯Ґ w Ϩίʔυ୯ҐసஔΠϯσοΫε w ୯ޠͱ୯ޠΛؚΉจষ จষJE ΛϦετͯ࣋ͭ͠ w JH1ZUIPO< 

       > w ϝϦοτγϯϓϧͰ࣮૷͠΍͍͢ɻσΟεΫ༰ྔ͕গͳ͍ɻ w σϝϦοτػೳੑ͕๡͍͠ w ୯ޠ୯ҐసஔΠϯσοΫε w ୯ޠͱ୯ޠΛؚΉจষ จষJE Ћͷ৘ใ JH୯ޠͷग़ݱҐஔ  w JH1ZUIPO<   > w ϝϦοτػೳੑ͕ߴ͍ɻྫ͑͹ϑϨʔζݕࡧ͕Ͱ͖Δɻ w σϝϦοτσΟεΫ༰ྔ͕ଟ͍ɻ w ˞ʮ৘ใݕࡧͷجૅʯڞཱग़൛
  6. ܗଶૉղੳ XIJUFTQBDF ͱ/HSBN ܗଶૉղੳ /HSBN τʔΫφΠζ JHʮ౦ژʯʮ౎஌ࣄʯ JHʮ౦ژʯʮژ౎ʯʮ౎஌ʯʮ஌ࣄʯ τʔΫϯ਺ গͳ͍

    ଟ͍ ΠϯσοΫε αΠζ খ͍͞ େ͖͍ ݕࡧ࿙Ε ଟ͍ গͳ͍ ৽ޠରԠ ✕ ˓ ϊΠζ গͳ͍ ଟ͍
  7. ΠϯσοΫεੜ੒ॲཧ w ςΩετશମʹॲཧΛՃ͑Δ DIBSpMUFS  w JHIUNMλάΛআ͘ খจࣈ େจࣈ ʹ͢Δ

    w ༩͑ΒΕͨςΩετΛ෼ׂ͢Δ UPLFOJ[FS  w ෼ׂ͞ΕͨUPLFOʹॲཧΛՃ͑Δ UPLFOpMUFS  w JHਖ਼نԽ͢Δ ετοϓϫʔυΛআ͘ w ͜ΕΒͷॲཧΛ·ͱΊͯ"OBMZ[FSͱΑͿ
  8. "OBMZ[FSॲཧ $IBSpMUFS 5PLFOJ[FS 5PLFOpMUFS lI4JNQMFJTCFUUFSUIBODPNQMFYIz lTJNQMFJTCFUUFSUIBODPNQMFYz <lTJNQMFz lJTz lCFUUFSz lUIBOz

    lDPNQMFYz> <lTJNQMFz lJTz lCFUUFSHPPEz lUIBOz lDPNQMFYz> <lTJNQMFz lHPPEz lUIBOz lDPNQMFYz> 5FYU 5PLFOT "OBMZ[FS
  9. γϯϓϧͳݕࡧΤϯδϯΛ࡞Δ w ཁ݅ w QZDPOKQͷτʔΫΛݕࡧͰ͖Δ w 5JUMFͱৄࡉ w ࿦ཧݕࡧ BOEPSOPU

    ʹରԠ w ݁Ռ͸είΞ 5'*%' ॱʹฦ͢ w ϑϨʔζݕࡧ͸ରԠ͠ͳ͍ w ෳ਺ϑΟʔϧυͷݕࡧ͸ରԠ͠ͳ͍ w υΩϡϝϯτͷߋ৽͸͠ͳ͍ w ్தͰυΩϡϝϯτͷ௥ՃΛ͠ͳ͍
  10. "OBMZ[FS࣮૷ w ӳޠͱ೔ຊޠͦΕͧΕͷ"OBMZ[FSΛ࡞੒ w ڞ௨ w IUNMλάআ֎ w ετοϓϫʔυআ֎ w

    ΞϧϑΝϕοτ͸͢΂ͯখจࣈʹม׵ w TUFBNJOHॲཧ w JHEPHTˠEPH w ೔ຊޠ w ܗଶૉղੳ෼ׂ w ॿࢺɾ෭ࢺɾه߸আ֎ w ӳޠ w 8IJUFTQBDF෼ׂ w ॿࢺ౳͸ετοϓϫʔυͰରԠ
  11. $IBSpMUFS w ڞ௨ͷΠϯλʔϑΣΠεΛ࡞੒ w ਖ਼نදݱͰIUNMλάΛআڈ w ΞϧϑΝϕοτΛMPXFSDBTFʹม׵ class CharacterFilter: @classmethod

    def filter(cls, text: str): raise NotImplementedError class HtmlStripFilter(CharacterFilter): @classmethod def filter(cls, text: str): html_pattern = re.compile(r"<[^>]*?>") return html_pattern.sub("", text) class LowercaseFilter(CharacterFilter): @classmethod def filter(cls, text: str): return text.lower()
  12. 5PLFOJ[FS w ڞ௨ͷΠϯλʔϑΣΠεΛ४උ w ܗଶૉղੳʹ͸+BOPNFΛ࢖༻ w ஫ҙ w +BOPNFͷ5PLFOJ[FSΦϒδΣΫτͷ ॳظԽ͸ίετ͕ߴ͍ͨΊɼΠϯελ

    ϯεΛ࢖͍ճ͢ɻ ˞ from janome.tokenizer import Tokenizer tokenizer = Tokenizer() class BaseTokenizer: @classmethod def tokenize(cls, text): raise NotImplementedError class JanomeTokenizer(BaseTokenizer): @classmethod def tokenize(cls, text): return (t for t in cls.tokenizer.tokeniz e(text)) class WhitespaceTokenizer(BaseTokenizer): @classmethod def tokenize(cls, text): return (t[0] for t in re.finditer(r"[^ \ t\r\n]+", text)) ˞IUUQTHJUIVCDPNNPDPCFUBKBOPNF ˞IUUQTNPDPCFUBHJUIVCJPKBOPNFUPLFOJ[FS
  13. 5PLFOpMUFS w ڞ௨ͷΠϯλʔϑΣΠεΛ४උ w 104'JMUFSͷ౎߹্UPLFOҾ਺͸TUSͱ KBOPNFͷUPLFOΦϒδΣΫτΛࢦఆ STOPWORDS = ("is", "was",

    "to", "the") def is_token_instance(token): return isinstance(token, Token) class TokenFilter: @classmethod def filter(cls, token): """ in: sting or janome.tokenizer.Token """ raise NotImplementedError class StopWordFilter(TokenFilter): @classmethod def filter(cls, token): if isinstance(token, Token): if token.surface in STOPWORDS: return None if token in STOPWORDS: return None return token
  14. 5PLFOpMUFS w 4UFNJOHॲཧ w ୯ޠͷޠװΛऔΓग़͢ॲཧ w JHʮEPHTʯˠʮEPHʯ w OMULTUFNύοέʔδΛར༻ ˞

     from nltk.stem.porter import PorterStemmer ps = PorterStemmer() class Stemmer(TokenFilter): @classmethod def filter(cls, token: str): if token: return ps.stem(token) ˞IUUQXXXOMULPSHBQJOMULTUFNIUNM
  15. 5PLFOpMUFS w 104'JMUFS w ಛఆͷ඼ࢺΛআ֎͢Δ w KBOPNFͷUPLFOΦϒδΣΫτͷ QBSU@PG@TQFFDIͰ൑ఆ class POSFilter(TokenFilter):

    """ ೔ຊޠͷॿࢺ/ه߸Λআ͘ϑΟϧλʔ """ @classmethod def filter(cls, token): """ in: janome token """ stop_pos_list = ("ॿࢺ", "෭ࢺ", "ه߸") if any([token.part_of_speech.startswith(pos) for pos in stop_pos_list]): return None return token
  16. "OBMZ[FS w #BTFDMBTT w Ϋϥεม਺ͰࢦఆͰ͖ΔΑ͏ʹ͢Δ w UPLFOJ[FS  w DIBSpMFS

     w UPLFO@pMUFS w pMUFS͸ෳ਺ࢦఆ͢ΔͨΊ഑ྻ w લ͔Βॱ൪ʹॲཧ͍ͯ͘͠ class Analyzer: tokenizer = None char_filters = [] token_filters = [] @classmethod def analyze(cls, text: str): text = cls._char_filter(text) tokens = cls.tokenizer.tokenize(text) filtered_token = (cls._token_filter(token) for t oken in tokens) return [parse_token(t) for t in filtered_token i f t] @classmethod def _char_filter(cls, text): for char_filter in cls.char_filters: text = char_filter.filter(text) return text @classmethod def _token_filter(cls, token): for token_filter in cls.token_filters: token = token_filter.filter(token) return token
  17. "OBMZ[FS w ݸผͷ"OBMZ[FS w ೔ຊޠ༻ͷ+BQBOFTF5PLFOJ[FS w ӳޠ༻ͷ&OHMJTI5PLFOJ[FS w ͔ΜͨΜʹผͷ"OBMZ[FS΋࡞੒Մೳ class

    JapaneseAnalyzer(Analyzer): tokenizer = JanomeTokenizer char_filters = [HtmlStripFilter, LowercaseFilter] token_filters = [StopWordFilter, POSFilter, Stemmer] class EnglishAnalyzer(Analyzer): tokenizer = WhitespaceTokenizer char_filters = [HtmlStripFilter, LowercaseFilter] token_filters = [StopWordFilter, Stemmer]
  18. *OEFYFS w จষΛड͚औΓసஔΠϯσοΫεΛ࡞੒ w ϝϞϦ্ʹҰ࣌తͳసஔΠϯσοΫε EJDUJPOBZ Ͱอ࣋ w ϝϞϦ্ʹҰఆ਺ͷసஔΠϯσοΫε͕ ཷ·ͬͨΒετϨʔδʹอଘ

    class InvertedIndex: def __init__( self, token_id: int, token: str, postings_list=[ ], docs_count=0 ) -> None: self.token_id = token_id self.token = token self.postings_list = [] self.__hash_handle = {} self.docs_count = 0 def add_document(doc: str): """ υΩϡϝϯτΛσʔλϕʔεʹ௥Ճ͠సஔΠϯσοΫεΛߏங͢Δ """ if not doc: return # # จॻIDͱจষ಺༰ΛجʹϛχసஔΠϯσοΫε࡞੒ text_to_postings_lists(doc) # # # Ұఆ਺ͷυΩϡϝϯτ͕ϛχసஔΠϯσοΫε͕ཷ·ͬͨΒ Ϛʔδ if len(TEMP_INVERT_INDEX) >= LIMIT: for inverted_index in TEMP_INVERT_INDEX.values() : save_index(inverted_index)
  19. *OEFYFS w ϙεςΟϯάϦετͷ࡞੒ w BOBMZ[FSͰUPLFOΛ࡞੒ w είΞܭࢉͷͨΊจॻதͷτʔΫϯ૯਺Λ͋ Β͔͡Ίܭࢉ͓ͯ͘͠ w ʮ୯ޠ୯ҐసஔΠϯσοΫεʯ࡞੒

    w ϑϨʔζݕࡧΛ͠ͳ͍ͨΊ w ͨͩ͠ɺείΞܭࢉͷͨΊͦͷτʔΫϯ͕จ ॻதʹ͍ͭ͋͘Δ͔ܭࢉͯ͠ΠϯσοΫεʹ ΋͓ͬͯ͘ w 1ZUIPO<EPD EPD> def text_to_postings_lists(text) -> list: """ จষ୯ҐͷసஔϦετΛ࡞Δ """ tokens = JapaneseAnalyzer.analyze(text) token_count = len(tokens) document_id = save_document(text, token_count) cnt = Counter(tokens) for token, c in cnt.most_common(): token_to_posting_list(token, document_id, c) def token_to_posting_list(token: str, document_id: int, token_count: int): """ token͔Βposting listΛ࡞Δ """ token_id = get_token_id(token) index = TEMP_INVERT_INDEX.get(token_id) if not index: index = InvertedIndex(token_id, token) posting = "{}: {}".format(str(document_id), str(token_count)) index.add_posting(posting) TEMP_INVERT_INDEX[token_id] = index
  20. *OEFYFS࡞੒Πϝʔδ w EPDVNFOU w EPDʮࢲ͸ࢲͰ͢ɻʯ w EPDʮࢲͱQZUIPOɻʯ w 5PLFO w

    UPLFOࢲ w UPLFOQZUIPO w τʔΫϯ਺ w EPD w EPD w ϙεςΟϯάϦετ w 1ZUIPO<> w ࢲ< >
  21. 4UPSBHF w σʔλϕʔεεΩʔϚ w %PDVNFUTUBCMFʹςΩετΛอଘ w ݕࡧର৅ͷϑΟʔϧυͷςΩετΛUFYU ϑΟʔϧυͰอ͓࣋ͯ͘͠ class Documents(Base):

    __tablename__ = "documents" id = Column(Integer, primary_key=True) text = Column(String) token_count = Column(Integer) date = Column(String) time = Column(String) room = Column(String) title = Column(String) abstract = Column(String) speaker = Column(String) self_intro = Column(String) detail = Column(String) session_type = Column(String) class Tokens(Base): __tablename__ = "tokens" id = Column(Integer, primary_key=True) token = Column(String) class InvertedIndexDB(Base): __tablename__ = "index" id = Column(Integer, primary_key=True) token = Column(String) postings_list = Column(String) docs_count = Column(Integer) token_count = Column(Integer)
  22. 4UPSBHF w σʔλϕʔεॲཧͳͲͷVUJMTؔ਺Λ࣮૷ w τʔΫϯͷ௥Ճɾऔಘ w υΩϡϝϯτͷ௥Ճɾऔಘ w సஔΠϯσοΫεͷ௥Ճɾߋ৽ɾऔಘ def

    add_token(token: str) -> int: SESSION = get_session() token = Tokens(token=token) SESSION.add(token) SESSION.commit() token_id = token.id SESSION.close() return token_id def fetch_doc(doc_id): SESSION = get_session() doc = SESSION.query(Documents).filter(Documents.id = = doc_id).first() SESSION.close() if doc: return doc else: return None
  23. 4FBSDIFS w RVFSZΛड͚औͬͯ w ΫΤϦͷύʔε 1BSTFS"OBMZ[FS  w ݁ՌͷEPD@JEΛऔಘ .FSHFS

     w จষΛετϨʔδ͔Βऔಘ 'FUDIFS  w είΞॱʹฒͼସ͑Δ 4PSUFS def search_by_query(query): if not query: return [] # parse parsed_query = tokenize(query) parsed_query = analyzed_query(parsed_query) rpn_tokens = parse_rpn(parsed_query) # merge doc_ids, query_postings = merge(rpn_tokens) print(doc_ids, query_postings) # fetch docs = [fetch_doc(doc_id) for doc_id in doc_ids] # sort sorted_docs = sort(docs, query_postings) return [_parse_doc(doc) for doc, _ in sorted_docs]
  24. ٯϙʔϥϯυه๏ ޙஔه๏ w ԋࢉࢠΛඃԋࢉࢠͷޙΖʹஔ͘ه๏ ˞  w JH w ʮ

    ʯˠʮ ʯ w ʮ      ʯˠʮ  ʯ w ʮQZUIPO"/%ݕࡧʯˠʮQZUIPOݕࡧ"/%ʯ w ϝϦοτ w ࣜͷධՁ ܭࢉ ͕γϯϓϧʹͳΔ w ઌ಄͔ΒධՁ͢Δ͚ͩͰࡁΉ w ˞XJLJQFEJBIUUQTKBXJLJQFEJBPSHXJLJ &&%&#$&"&#&&"& #
  25. 1BSTFS import re REGEX_PATTERN = r"\s*(\d+|\w+|.)" SPLITTER = re.compile(REGEX_PATTERN) LEFT

    = True RIGHT = False OPERATER = {"AND": (3, LEFT), "OR": (2, LEFT), "NOT": (1 , RIGHT)} def tokenize(text): return SPLITTER.findall(text) def parse_rpn(tokens: list): ɹɹɹ# ΞϧΰϦζϜͷ࣮૷ • ΞϧΰϦζϜΛ࣮૷ • தஔه๏Λޙஔه๏ʹ͢Δ͚ͩͳͷͰ࣮૷ ͢Δͷ͸จࣈྻͱԋࢉࢠ ׅހͷΈ • ਖ਼نදݱͰεϖʔεͱׅހͰ෼ׂ • JH<l"z l"/%z l l l#z 03l l$z l z> • ར༻͢ΔΦϖϨʔλʔͱɺͦͷ༏ઌ౓ɺ ݁߹ํ๏Λࢦఆ • ༏ઌ౓ • /0503"/% • ݁߹ • "/% 03ࠨ݁߹ • /05ӈ݁߹ • ʮ""/% #03$ ʯ • ɹˠɹ<" # $ 03 "/%>
  26. "OBMZ[FS w సஔΠϯσοΫε࡞੒࣌ʹࣙॻʹ࢖ͬ ͨUPLFOͰݕࡧ͢Δඞཁ͕͋Δ w *OEFY࣌ʹ࢖ͬͨ"OBMZ[FSΛͦͷ··࢖ͬ ͯ΋0, w +BQBOFTF5PLFOJ[FSͱ&OHMJTI5PLFOJ[FS ʹରԠ͢Δ

    w ݕࡧ࣌ʹ͸ӳ୯ޠͰݕࡧ w 8IJUFTQBDFUPLFOJ[FS͸ߟྀ͠ͳ͍ w ࠓճ͸෼ׂ͞ΕͨτʔΫϯΛ03ݕࡧ ͱͯ͠ѻ͏ w ʮػցֶश"/%QZUIPOʯ w ˠʮػց03ֶश"/%QZUIPOʯ w ٯϙʔϥϯυه๏ม׵લʹ"OBMZ[F͢ Δ def analyzed_query(parsed_query): return_val = [] for q in parsed_query: if q in OPRS: return_val.append(q) else: analyzed_q = JapaneseAnalyzer.analyze(q) if analyzed_q: tmp = " OR ".join(analyzed_q) return_val += tmp.split(" ") return return_val
  27. ٯϙʔϥϯυه๏ධՁ w ࡞੒ͨ͠ٯϙʔϥϯυه๏ΛධՁ͍ͯ͘͠ w खॱ w ʮ  ʯ 

        ͷ৔߹  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̏ ΛελοΫʹੵΉTUBDL<̏>  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̐ ΛελοΫʹੵΉTUBDL<̏ ̐>  ԋࢉࢠͷ৔߹ τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<> ܭࢉ ̏ʴ̐ ͯ͠ελοΫʹੵΉTUBDL<̓>  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̍ ΛελοΫʹੵΉTUBDL<̓ ̍>  ඃԋࢉࢠͷ৔߹ ԋࢉࢠ ̎ ΛελοΫʹੵΉTUBDL<̓ ̍ ̎>  ԋࢉࢠͷ৔߹ τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<̓> ܭࢉ ̍̎ ͯ͠ελοΫʹੵΉTUBDL<̓ ̍>  ԋࢉࢠͷ৔߹ τοϓ̎ͭΛελοΫ͔ΒऔΓग़͢TUBDL<> ܭࢉ ̓ ̍ ͯ͠ελοΫʹੵΉTUBDL<̓>
  28. .FSHF w ٯϙʔϥϯυه๏ධՁखॱ௨Γʹ࣮૷ w UPLFO͕ඃԋࢉࢠτʔΫϯͷ৔߹ʹ ϙεςΟϯάϦετΛऔಘޙTUPDLʹ ௥Ճ w UPLFO͕ԋࢉࢠͷ৔߹͸ɺTUPDL͔Β UPQ̎ͭΛऔΓग़ͯ͠NFSHFޙTUPDL

    ʹ௥Ճ w ϙεςΟϯάϦετ͸είΞܭࢉΑ͏ ʹEJDUJPOBSZͰ؅ཧ def merge(tokens: list): target_posting = {} stack = [] for token in tokens: if token not in OPRS: token_id = get_token_id(token) postings_list = fetch_postings_list(token_id ) # scoreܭࢉ༻ʹอ࣋ target_posting[token] = postings_list # doc_idͷΈΛstackʹ௥Ճ doc_ids = set([p[0] for p in postings_list]) stack.append(doc_ids) # token͕operaterͩͬͨ৔߹ else: if not stack: raise if len(stack) == 1: # NOTͷΈڐ༰ if token == "NOT": # NOTͷॲཧ return not_doc_ids, {} else: raise doc_ids1 = stack.pop() doc_ids2 = stack.pop() stack.append(merge_posting(token, doc_ids1, doc_ids2))
  29. .FSHF w ʮ/05IPHFʯͷରԠ w ඃԋࢉࢠ̎ͭͷධՁҎ֎ͷධՁํ๏Λ/05 ͷΈڐՄ w UPLFO͕ԋࢉࢠͷ৔߹ʹTUBDLͷαΠζͱ UPLFOͷத਎Ͱ൑ఆ #

    ʮNOT hogeʯରԠ if len(stack) == 1: # NOTͷΈڐ༰ if token == "NOT": # NOTͷॲཧ doc_ids = stack.pop() not_doc_ids = fetch_not_docs_id(doc_ids) return not_doc_ids, {} else: raise
  30. .FSHF w ΦϖϨʔλʔΛϢʔβʔ͕هड़͠ͳ͍ ৔߹ͷରԠ w ʮQZUIPOػցֶशʯ w ˠʮQZUIPO03ػցֶशʯ  w

    ˠʮQZUIPO"/%ػցֶशʯ  w ࠓճ͸03Λ࠾༻ w ΦϖϨʔλʔΛهड़͠ͳ͍ඃԋࢉࢠ ͱධՁͷ਺͕߹Θͳ͍ w ࠷ޙ·ͰධՁͯ͠΋TUBDL͕̎ͭҎ্͋Δ w શͯͷτʔΫϯΛ࣮૷ͨ͋͠ͱʹTUBDL ͕ʹͳΔ·Ͱ03NFSHFΛ܁Γฦ͢ for token in tokens: # ධՁॲཧ while len(stack) != 1: doc_ids1 = stack.pop() doc_ids2 = stack.pop() stack.append(merge_posting("OR", doc_ids1, doc_ids2) )
  31. είΞϦϯά w 5'*%' w 5'UFSNGSFRVFODZ  w จॻதͷ୯ޠͷׂ߹ w ୯ޠ

    U จষதͷ୯ޠ਺ 5  w J%'JOWFSUEPDVNFOUGSFRVFODZ w શจॻதͷͦͷ୯ޠΛؚΉจষͷׂ߹ w MPH ͦͷ୯ޠΛؚΉจॻ਺ % શจॻ਺ "  w จষதʹΑ͘ग़͖ͯͯɺଞͷจষʹͰͯ͜ͳ͍΋ͷ΄ͲείΞ͕ߴ͍
  32. είΞϦϯά w ݕࡧʹ͓͍ͯͷ*%'  w Ωʔϫʔυ͚ͩͰݕࡧ͢Δͱશจจॻͷ*%'͸ಉ͡ w 5'͚ͩͰΑͦ͞͏ w ͨͩ͠ɺ"/%

    03Ͱݕࡧͨ͠ͱ͖ʹҙຯ͕มΘͬͯ͘Δɻ w ෳ਺ͷΫΤϦʹର͢Δจॻ܊ͷൺֱͷͨΊͷॏΈ෇͚ͱߟ͑Δ w ʮ""/%#ʯͰݕࡧͨ͠ͱ͖ͷจॻЋͷ5'*%'஋ w จॻЋͷ5'*%'஋ΫΤϦ"ͷ5'*%' ΫΤϦ#ͷ5'*%'
  33. 4PSUFS w είΞܭࢉʹ࢖͏஋͸΄΅ΠϯσοΫ ε࣌ʹूܭ͍ͯ͠Δ΋ͷΛ࢖͏ w จॻதͷݕࡧτʔΫϯ਺ U  w QPTUJOHͰอ࣋

    w  w จॻதͷτʔΫϯ਺ 5  w %PDVNFOUTอଘ࣌ʹҰॹʹอଘ w EPDUPLFO@DPVOU w ݕࡧτʔΫϯ U ΛؚΉจॻ਺ %  w UͷϙεςΟϯάϦετͷ௕͞ w શจॻ਺ "  w %PDVNFOUTUBCMFͷΧ΢ϯτ def sort(doc_ids, query_postings): docs = [] all_docs = count_all_docs() for doc_id in doc_ids: doc = fetch_doc(doc_id) doc_tfidf = 0 for token, postings_list in query_postings.items (): idf = math.log10(all_docs / len(postings_lis t)) + 1 posting = [p for p in postings_list if p[0] == doc.id] if posting: tf = round(posting[0] [1] / doc.token_count, 2) else: tf = 0 token_tfidf = tf * idf doc_tfidf += token_tfidf docs.append((doc, doc_tfidf)) return sorted(docs, key=lambda x: x[1], reverse=True )
  34. σϓϩΠ5JQT w ϝϞϦΛଟΊʹ͢Δ w ੑೳ͕͕͋Δ w +BOPNF͕QJQͷϏϧυ࣌ʹ.#d.#ఔ౓ϝϞϦ͕ඞཁ ˞  w

    IFSPLV͸ݫ͍͠ w GSFFɾIPCZ͸✕ w 4UBOEBSE΋͓ͦΒ͘✕ w &$UNJDSP΋✕ w &$UTNBMM͸˓ w ˞IUUQTNPDPCFUBHJUIVCJPKBOPNF
  35. վળ఺ w ΠϯσΫγϯάɾݕࡧͷ଎౓վળ w ϙεςΟϯάϦετͷѹॖ w ޮ཰ͷ͍͍ΞϧΰϦζϜͱσʔλߏ଄Λ࢖͏ w ಈతͳΠϯσοΫεͷ࡞੒ w

    ΫΤϦ֦ு w ϑϨʔζݕࡧ w ෳ਺ϑΟʔϧυରԠ w ԋࢉࢠͷ֦ு w ετϨʔδ෦෼ ϑΝΠϧγεςϜ ࣮૷ w ϕΫτϧݕࡧ w ෼ࢄԽ w FUD
  36. վળͷͨΊͷࢀߟࢿྉ w ॻ੶ w ʮݕࡧΤϯδϯࣗ࡞ೖ໳ʯٕज़ධ࿦ࣾ w ʮ৘ใݕࡧͷجૅʯڞཱग़൛ w ʮߴ଎จࣈྻղੳͷੈքʯؠ೾ॻళ w

    ʮ%FFQ-FBSOJOHGPS4FBSDIʯ."//*/( w 044 w ʮ8IPPTIʯ w IUUQTCJUCVDLFUPSHNDIBQVUXIPPTITSDEFGBVMUEPDT TPVSDFJOUSPSTU w ʮ&MBTUJDTFBSDIʯ w IUUQTHJUIVCDPNFMBTUJDFMBTUJDTFBSDI w ʮ"QBDIF-VDFOFʯ w IUUQMVDFOFBQBDIFPSHDPSFEPDVNFOUBUJPOIUNM