Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hybrid Keyword Extraction Automated by Python W...

tetsuya0617
September 15, 2019

Hybrid Keyword Extraction Automated by Python With Cloud Speech To Text API and Video API

This is my poster in Pycon Japan 2019 🇯🇵 (https://pycon.jp/2019/sessions?category=poster)

tetsuya0617

September 15, 2019
Tweet

More Decks by tetsuya0617

Other Decks in Technology

Transcript

  1. Hybrid Keyword Extraction Automated by Python With Cloud Speech To

    Text API and Video API Tetsuya Hirata Background Search indexes included only title names of each video content Zero hit keywords are 240 (29.2%) Hit keywords are 583 (70.8%) Extracted automatically keywords from speech and texts in the videos with Cloud Speech To Text API and Video API Reading comprehension (Classical Japanese/Chinese Classics/Contemporary Japanese) Being able to search the name of character, works, grammar ex)ʮປ૲ࢠʯʮ࢙هʯʮେೲݴʯʮٯઆʯ Mathematics Being able to search mathematical symbols ex) ʮΞϧϑΝʯʮύΠʯʮαΠϯίαΠϯʯ ʮྦྷ৐ʯ English Being able to search the name of grammar and english vocabularies ex) ʮҙࢤະདྷʯʮinʯʮalsoʯ The code is automized by python and dockerized Needed to write the extra code to access to Google Cloud Storage(GCS) in order to load movie and audio files, and upload the results for backup log on GCS. → Abstracts Bigquery and Google Cloud Storage client functionality for easy access Solution to improve search indexes Evaluation Data Set: Movie Files(mp4) about mathematics, reading comprehension (Classical Japanese, Chinese Classics, and Contemporary Japanese), and English Targeted Keywords: Zero hit keywords searched by more than five unique users from 2018/04/01 to 2019/06/14. How to evaluate the keywordsɿ Looked at whether the number of hits improved before and after the new keywords extracted from speech and text in the movies, and checked if the keywords are related to each subject or not. Results Future Works Procedures for extracting keywords from movie contents tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. (Rajaraman, A.; Ullman, J.D. (2011). "Data Mining”. Mining of Massive Datasets. pp. 1–17.) The number of a term / a document A total sum of number of terms / a document A total number of documents The number of document frequencies which a term occurs Term frequency is a term that occurs in a document. Inverse document frequency is an inverse function of the number of documents in which it occurs. Architecture with Cloud Speech Text Video API The N of keys which hit movie contents: 247 keywords The N of keywords which improved hits: 162 / 247 keywords Of the keywords, 130/162 (80%) results in related movie contents . Outlier value: 32 keywords The N of keywords which improved zero hits: 66 / 162 keywords Of the keywords, 52/66 (79%) results in related movie contents. Outlier value: 14 keywords Qualitative results Quantitive results Improve code related to Input / Output Improve stop-words and dictionary - Cut noise words and automize the cycles to add stop- words - Exploratory search keywords not related to movie contents and add them to dictionary - Improve keyword list registered in database by hearing from creators of the movies or experts in pedagogical domain (t: term d: document f: frequency d: document) MeCab TF-IDF Video API Cloud Speech To Text API GCS Excel Elastic Search .csv.gzip “… ҰਓҰਓͷऑ఺ʹ߹Θͤͯಈըͱ໰୊Λ͓קΊ͢Δͷ͕Ϋϥογʔֶशಈըੜె͸ࣗ਎ͷֶश ঢ়گʹ߹Θͤͯͭ·͍͍ͣͯΔ෼໺·ͰḪͬͯͷֶश΍ϨϕϧΞοϓΛ͢Δ͜ͱ͕Ͱ͖·͢ϕωο ηͷςετͷडݧ݁Ռ͔Βࠃޠ਺ֶӳޠʹ͓͚Δੜె͕ͨͪࠓऔΓ૊Ή΂ֶ͖शίϯςϯπΛ͓ ͢͢Ί͠·͢·ֶͨशϚοϓͰ͸ੜెҰਓͻͱΓͷաڈʹड͚ͨςετ݁ՌΛ౿·֤͑ڭՊͷ୯ ݩ͝ͱʹֶश౸ୡκʔϯΛදࣔۤखॱʹฒΜͰ͍·͢ͷͰੜెͷۤख෼໺ΛҰ໨Ͱ೺Ѳ͢Δ͜ͱ͕ Ͱ͖·͢…” (https://www.youtube.com/watch?v=XjrBkhaV5Aw) 1.6016 ͩ͜ͱʹͳΔͩΖ͏ʯ<ܦݧ>Λද͢ະདྷ 1.6016 am reading 0.906650007 1.6016 ʮ΋͠΋͏Ұ౓ಡΜͩΒ͜ͷຊΛ3ճಡΜ 1 1.6016 ׬ྃͰ ʮ(ͦͷ࣌·Ͱʹ)…ͨ͜͠ͱʹͳ 1.6016 ಈࢺͷ੍࣌<Ϩϕϧ6> 0.95717603 1.7017 will have read 0.993844032 (https://www.youtube.com/watch?v=XjrBkhaV5Aw) 1. Input audio files (.flac) and movie files into Cloud Speech To Text API and Video API 2. Pre-process output data with MeCab and calculate TF-IDF 3. Extract keywords and upload them on GCS 4. Add keywords list to excel file which have master info about movies 5. The excel file is stored in S3 and the keywords are registered in elastic search. Steps to extract and register keywords Tokenizers (MeCab) """ All subjects """ CONTENT_WORD_POS = ('໊ࢺ') EXCLUDE_WORD_POS = ('ඇཱࣗ', '୅໊ࢺ') EXCLUDE_WORD = ('͊', '͌', '͎', '͐', '͒', 'Ό', 'Ύ', 'ΐ', ‘Η’) """ Contemporary Japanese """ INCLUDE_WORD_POS = ('ݻ༗໊ࢺ') Remove noise with regular expression All subjects: ‘[Ớʀ”ʨʩʔʊˏɾʁʛʹʆˇʻʼʮʯɺ]' English: ‘[0-9̌-̕_!-/:-@¥[-`{-~]' Classical Japanese/Chinese Classics: '[Ν-ϰa-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]'
 Others : ’[a-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]' PyPI: https://pypi.org/project/gcp-accessor/0.0.1/