Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning and Natural Language Processin...

Machine Learning and Natural Language Processing on Treasure CDP

PLAZMA TD Internal Day: TD Tech Talk 2018: https://techplay.jp/event/650390

Video: https://youtu.be/RzQT_9jcrx8?t=2h4m17s

Takuya Kitazawa

February 19, 2018
Tweet

More Decks by Takuya Kitazawa

Other Decks in Technology

Transcript

  1. Machine Learning and Natural Language Processing on Treasure CDP Takuya

    Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer
  2. Word-based customer tagging and categorization (2017) Store customers’ browsing log

    from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment
  3. ML-related capability on / (1/2) Classification — Soft Confidence-Weighted, Random

    Forest, Logistic Regression, … ‣ Binary “Likely to buy our product?” “Is this email spam?” ‣ Multi-class “Will be sunny, cloudy, or rainy?” “Which group does this user belong?” Regression — Random Forest, AdaDelta, Factorization Machines, … ‣ “Tomorrow’s temperature” “Estimated product sales in next month” “This user’s annual income” Recommendation — Matrix Factorization, Factorization Machines, … ‣ “Customers who bought this also bought …” Anomaly Detection — Local Outlier Detection, ChangeFinder, … ‣ “Suddenly increased # of visitors on our web site”
  4. Natural Language Processing — Sentence tokenization, Find singular form of

    English word, … ‣ ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ 
 ["Hello", "world"] ‣ 
 apple Clustering — Latent Dirichlet Allocation, Probabilistic Latent Semantic Analysis ‣ “Which articles are similar to this one?”
 Geospatial Functions ‣ “I love to see map around specific pair of latitude and longitude” select tokenize('Hello, world!') select singularize('apples') ML-related capability on / (2/2)
  5. Use case: ML-based customer segmentation at OISIX 1. Predict probability

    of churn 2. Aggressively reach out “likely to churn” customers https://www.slideshare.net/TaisukeFukawa/hivemall-meetup-vol2-oisix Web Mobile Customer attr. Behavior on web Complaint log Source Signed-up services Actions (direct) Actions (indirect) Point Call Guide to success UI OISIX’s data
  6. Real-world ML workflow Problem What you want to “predict” Hypothesis

    & Proposal Evaluate Build machine learning model Historical data Cleanse data Ship to production Sufficient accuracy? Which columns should we use? Extract Filter Interpolate Normalize … … Query Query Query Query Train data Get features Train … Query Query Query Test data Get features Predict … Accuracy Query Query Query Query
  7. Digdag…! Evaluate Build machine learning model Cleanse data Extract Filter

    Interpolate Normalize … Train data Get features Train … … Test data Get features Predict … Accuracy Query Query Query Query Query Query Query Query Query Query Query
  8. +preprocess: _parallel: true +train: td>: ../queries/preprocess_train.sql create_table: train +test: td>:

    ../queries/preprocess_test.sql create_table: test +logress_train: td>: queries/logress_train.sql create_table: logress_model +compute_downsampling_rate: td>: queries/downsampling_rate.sql engine: presto store_last_results: true +logress_predict: td>: queries/logress_predict.sql create_table: prediction +evaluate: td>: queries/evaluate.sql store_last_results: true +show_accuracy: echo>: "Logloss (smaller is better): ${td.last_results.logloss}"
  9. A Customer Data Platform is a marketer-controlled integrated customer database

    that can support coordinated programs across multiple channels. Treasure CDP
 ID Unification, Segmentation, Syndication Workflow, Query, Reporting, Data Warehouse, Machine Learning Data Collection ID Unification, Segmentation, Syndication Campaign Execution
  10. “customer” = attributes + behaviors on CDP application Time Host

    Path Browser … 1514899923 takuti.me /about Chrome … 1517305451 takuti.me / Safari … 1518765966 takuti.me /note Chrome … … … … … … Age 24 Sex Man Email [email protected] Address Nakano, Tokyo, Japan … … Time Item ID Referrer OS … 1513080070 XXX twitter.com macOS … 1515488949 YYY google.com iOS … 1518766618 ZZZ facebook.com Android … … … … … … … cdp_customer_id “aaa-bbb-cccc”
  11. 1. Word-based customer tagging and categorization for Japanese and English

    Store customers’ browsing log from TD JavaScript SDK STEP 1 Extract keywords from each article STEP 2 Aggregate customers’ visits as td_interest_words and td_affinity_categories Society Olympic game medal president citizen rule law data cloud CDP Create audience politics law US nation equation math curry rice history Science Food, Culture td_client_id XXX-YYY-ZZZZZ td_title Today’s news td_description The Olympic game has been started … td_host www.td-news.com td_path /2017/10/01/olympic td_client_id XXX-YYY-ZZZZZ td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment
  12. Challenges Short input texts and wide-ranging content type depending on

    data Unsupervised customer categorization with less false positives Tokenizing new words չײ׷ٍؗٝպװչַ׵ְַ♳䩛ך넝加ׁ׿պכ♧⽃铂 Non-ML (!), deterministic customer profiling based on Wikipedia mining and TF-IDF weighting
  13. Digdag workflow built by API Preprocess SELECT ${join_column_name}, concat(td_host, td_path)

    AS article_id, concat( -- remove site name which commonly occurs at the foot of page title regexp_replace( -- "(xxx)" is generally meaningless, accessory part of page title regexp_replace( td_title, '[(ʢ].+?[)ʣ]', '' ), '[|-] .+$', '' ), ' ', coalesce(td_description, '') ) AS content FROM ${behavior} WHERE td_title IS NOT NULL AND TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-90d'))
  14. Digdag workflow built by API Tokenize (Japanese) SELECT article_id, word

    FROM article t1 LATERAL VIEW explode( tokenize_ja( normalize_unicode(content, 'NFKC'), "normal", array(“a”,”about","above","across","after","again",...), array(“෭ࢺ”,”ॿࢺ","ಈࢺ","ه߸","໊ࢺ-਺","෭ࢺ-Ұൠ","ॿࢺ-ಛघ","ಈࢺ-઀ඌ",...), "https://s3.amazonaws.com/td-cdp-tagging/stable/kuromoji-user-dict-neologd.csv.gz" ) ) t2 AS word WHERE length(word) >= 2 AND word RLIKE '^[͊-ΜʔΝ-ϲʔҰ-ᴱa-zA-Z̰-͉̖-̯ɾʂʁ]+$' -- acceptable characters AND word NOT RLIKE '^([^Ұ-ᴱ]{1,2}|[͊-Μʔ]{1,3})$' -- even if word consists of acceptable characters, reject "len-2 non-kanji word" and "len-3 hiragana-only word"
  15. Digdag workflow built by API TF-IDF weighting and keyword extraction

    takuti.me/note/tf-idf article_keyword AS ( SELECT tf.article_id, tf.word, tfidf(tf.freq, df.cnt, ${td.last_results.n_article}) AS tfidf FROM tf JOIN df ON tf.word = df.word WHERE df.cnt >= 2 AND df.cnt <= ${Math.max(100000, td.last_results.n_article / 2)} -- ignore too common words ) SELECT each_top_k( 20, article_id, tfidf, article_id, word ) AS (rank, score, article_id, word) FROM ( SELECT article_id, word, tfidf FROM article_keyword CLUSTER BY article_id ) t
  16. Aggregate over customers’ behaviors STEP 1 STEP 2 Society Olympic

    game medal president citizen rule law data cloud CDP politics law US nation equation math curry rice history Science Food, Culture sum() l1_normalize() each_top_k() td_interest_words Next: Map words into categories td_affinity_categories JOIN
  17. Map words into IAB categories in relational schema support.aerserv.com/hc/en-us/articles/207148516-List-of-IAB-Categories cdp_customer_id

    word score TF-IDF aaa-bbb-cccc politics 0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category probability anime IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 td_interest_words Mapping table JOIN
  18. Join “inverted” mapping table cdp_customer_id word score TF-IDF aaa-bbb-cccc politics

    0.3 aaa-bbb-cccc law 0.2 … … … ddd-eee-ffff math 0.7 … … … xxx-yyy-zzzz history 0.4 word category:probability anime [ IAB1:0.4, IAB5:0.1, IAB9:0.5 ] politics [ IAB11:0.8, … ] … … coffee [ IAB8:0.9, … ] td_interest_words Mapping table SELECT sum(score * probability) GROUP BY cdp_customer_id, category
  19. Create mapping table from Wikipedia dump word category probability anime

    IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 Corpus <word, score> pairs of articles
  20. Create mapping table from Wikipedia dump word category probability anime

    IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 Entertainment … … … … … … Find related articles from root category github.com/takuti/fastcat
  21. Create mapping table from Wikipedia dump word category probability anime

    IAB1 Arts & Entertainment 0.4 anime IAB5 Education 0.1 anime IAB9 Hobbies & Interests 0.5 politics IAB11 Law, Gov’t & Politics 0.8 … … … coffee IAB8 Food & Drink 0.9 IAB category Wikipedia category English Japanese IAB1 Arts & Entertainment Entertainment 㬗嚂 IAB2 Automotive Automobilities 荈⹛鮦 … … … IAB23 Religion & Spirituality Religion 㸹侄 … … … Corpus <word, score> 1) Aggregate word scores per category
 2) Normalize them per word
  22. Put sub categories in parallel, and filter out unconfident ones

    cdp_customer_id td_affinity_main_categories td_affinity_sub_categories aaa-bbb-cccc [ IAB11, IAB23 ] [ IAB2-4, IAB11-1, IAB12-3 ] ddd-eee-ffff [ IAB9, IAB15 ] [ IAB8-3, IAB20-8 ] … … … xxx-yyy-zzzz [ IAB14 ] [ IAB14-1, IAB14-3, IAB19-7 ]
  23. Challenges Guessing feature representation along with detecting “categorical” and “quantitative”

    columns to apply min-max normalization Calibrating number of positive/negative samples for differently sized data 1SPWJEJOHFOPVHIJOGPSNBUJPOUPSFGJOFGFBUVSFTBOEQSFWFOUˑMFBLBHF˒ FWFOGPSOPO.-FYQFSUT
  24. For sampled values: ‣ Column name, type ‣ Cardinality ‣

    Mean, variance, percentile ‣ Regular expression ‣ … Guess feature representation API
  25. Calibrating # of samples: Over-sample minor class takuti.me/note/adjusting-for-oversampling-and-undersampling WITH label2cnt

    AS ( SELECT map_agg(label, cnt) AS kv FROM ( SELECT label, CAST(COUNT(1) AS double) AS cnt FROM cdp_tmp_${model_table_name}_samples_${scope} GROUP BY label ) t ) SELECT -- If % of minor samples is very small (less than 0.1%), -- amplify them so that at least 1% of samples are occupied by the minors. IF(kv[1] / kv[0] < 0.001, -- % of positive samples is less than 0.1% cast(floor(0.01 / (kv[1] / kv[0])) AS integer), 1) AS pos_oversample_rate, IF(kv[0] / kv[1] < 0.001, -- % of negative samples is less than 0.1% cast(floor(0.01 / (kv[0] / kv[1])) AS integer), 1) AS neg_oversample_rate, -- Amplify very small data regardless of its label, because tiny dataset -- possibly shows poor accuracy. IF(${td.last_results.num_samples} > 100000, 1, 10) AS all_oversample_rate FROM label2cnt Negative samples Positive samples
  26. To refine predictive model and prevent leakage: Show evaluation results

    and feature importance Audience Segment 80% 20% Predict Train Test Accuracy AUC, LogLoss Model for validation Model for production
  27. td_client_id XXX-YYY-ZZZZZ td_ip 192.168.0.1 td_referrer http://google.com/… spend_time 1.5 … …

    td_interest_words Olympic, baseball, game td_affinity_categories Sports, Entertainment Audience Segment Already “converted” customers Build predictive model Guess how to cleanse data Evaluation Japan google.com 1.5 accuracy Sufficient? Audience Unlikely Marginally Possibly Likely 12 20 3 34 40 72 58 82 93 99 78 GUESS Automatically select and transform customer attributes 1ST PASS Treasure CDP does everything for you FROM 2ND PASS You can make your predictive model better with ML experts SCORE CUSTOMERS SYNDICATE Overview: How predictive customer scoring works
  28. How enterprise-grade ML/NLP solution should be Scalable Digdag, Hivemall, Presto,

    Hadoop, Embulk, … Accurate with no crucial mistakes and trivial false positives Interpretable in terms of both algorithm and UI design for all users
  29. MVP = classic algorithms and heuristics because there is no

    free lunch ajustchicago.org/2016/01/aint-no-free-lunch
  30. Machine Learning and Natural Language Processing on Treasure CDP Takuya

    Kitazawa @takuti Data Science Engineer at Treasure Data, Inc. and Apache Hivemall Committer