Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Double helix multi-stage text classification mo...

Double helix multi-stage text classification model to enhance chat user experience in e-commerce website

The study emphases on text-based classification on the chat feature to detect emptiness product stock from the seller’s text messages.

This research has been presented at The conference of the International Federation of Classification Societies (IFCS) on August 2019, Thessaloniki, Greece

Co-author: Abdullah Ghifari, Rya Meyvriska

Fiqry Revadiansyah

August 29, 2019
Tweet

More Decks by Fiqry Revadiansyah

Other Decks in Research

Transcript

  1. Bukalapak IFCS – 2019 - Thessaloniki The 2019 conference of

    the International Federation of Classification Societies Thessaloniki, Greece 2019 Fiqry Revadiansyah | Data Scientist | PT. Bukalapak.com Double helix multi-stage text classification model to enhance chat user experience in e-commerce website
  2. Bukalapak IFCS – 2019 - Thessaloniki 01 About Bukalapak and

    Beyond 02 Problem Statement | Users Problem 03 Research Idea and Methodology 04 Data-driven Insight | Data-driven Decision
  3. Bukalapak Inside Bukalapak Bukalapak is one of the largest Tech

    Unicorn companies in Southeast Asia. Bukalapak was founded in 2010 and now has more than 50 million active users with more than half a million transactions per day from various products, including small kiosks and the e- commerce platform. As of today, Bukalapak with its 2600 employees is also available internationally by widening its market to ASEAN countries to contribute more in the ASEAN digital playground in the form of BukaGlobal.
  4. Our Business Size Today Empowering individuals and SME’s in Indonesia

    Bukalapak Active Access/sec 100K+ Inside Bukalapak Sellers 4Mio+ Mitra Bukalapak 500K+ Age 18 - 35 70% *data as per January 2019 © Employer Branding Bukalapak 2019
  5. Our Team Size Today Empowering individuals and SME’s in Indonesia

    Bukalapak Business Squad *data as per January 2019 80+ Inside Bukalapak 100+ 4 PB+ Big Data stored (Structured + Unstructured) BI Specialists, Data Scientists, & Data Engineers © Employer Branding Bukalapak 2019
  6. Currently we have ~4 PB of Data stored in our

    system Bukalapak Inside Bukalapak Data lake & Datawarehousing Experiment & track everything Real-time information Distributed query
  7. Bukalapak Buyer A Open Marketplace – Open Discussion IFCS –

    2019 - Thessaloniki Hi, is it ready (stock available) ? Unfortunately, no. Just sold out yesterday. I forgot to update the product stock. Sorry. What an unfortunate fate. Ok, thanks, gonna find to the other. € 200 Forgot to update their product stock Seller B
  8. Bukalapak Seller B Open Marketplace – Open Discussion IFCS –

    2019 - Thessaloniki Hi, is it ready (stock available) ? @#$Y#@$@$(@ € 200 Forgot to update their product stock Over hundreds of our prospective buyers
  9. Bukalapak Seller B Open Marketplace – Open Shop Across Any

    Other E-commerce IFCS – 2019 - Thessaloniki
  10. Bukalapak How could we help them to locate the solution?

    IFCS – 2019 - Thessaloniki Seller B Our prospective buyers Chat Chat Text data Text data Provide an automatic solution Our sellers
  11. Bukalapak How could we help them to locate the solution?

    IFCS – 2019 - Thessaloniki Seller B Our prospective buyers Chat Chat Text data Text data Provide an automatic solution Intent Empty Not Empty 01 Our sellers 02
  12. Bukalapak Inspired by the architecture of our DNA helix IFCS

    – 2019 - Thessaloniki Our prospective buyers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text Our sellers
  13. Bukalapak Inspired by the architecture of our DNA helix IFCS

    – 2019 - Thessaloniki Our prospective buyers Our sellers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text “Is it ready?” “Any blue color?” “Free shipping? “Hi, good morning!” “I’ve transferred the money to xxx” “Available for two products?” Messages Messages “Yes its ready” “No, how about red?” “Please use voucher” “Hi, happy shopping” “Okay, I will proceed” “Unfortunately, no. Just one. Take it?”
  14. Bukalapak Inspired by the architecture of our DNA helix IFCS

    – 2019 - Thessaloniki Our prospective buyers Our sellers Chat Chat Text data Text data Intent A Intent A Buyer text Seller text Text Preprocessing STAGE I – Buyer Text STAGE II – Seller Text 1 Classify the intent 2 Asking Product Stock Other Run independently Text Preprocessing 1 Classify the intent 2 Empty Not Empty/ Available
  15. Bukalapak IFCS – 2019 - Thessaloniki Start 1. Data Retrieval

    2. Data Preprocessing 3. Feature Extraction & Engineering 7. Hyperparameter Optimization TF-IDF Normalization FastText Word2Vec 4. Train Test Split Model 1 – TFIDF Model 2 – FastText Model 3 – Word2Vec 5. Data Training Train Data – TF-IDF Train Data – FastText Train Data – Word2Vec Test Data – TF-IDF Test Data – FastText Test Data – Word2Vec 6. Model Evaluation and Selection End General Methodology
  16. Bukalapak IFCS – 2019 - Thessaloniki 1. General Methodology Double-helix

    Multi Stage Methodology Buyer Text - Start Ask the Product Stock Ask Other 2. Match the Result End 1. General Methodology Seller Text - Start Other Class Product is Empty First Helix Second Helix
  17. 1. Data Retrieval Bukalapak IFCS – 2019 - Thessaloniki Buyer

    and Text Conversation We took the conversation data which contains `emptiness product stock` symptom, such as containing “stock is empty”, “sold out”, etc. (More slang words in Bahasa language) within the first quarter of 2019. We captured the chat which sent by the buyer also as the conversation initiator. “ “
  18. 2. Data Preprocessing Bukalapak IFCS – 2019 - Thessaloniki Data

    Preprocessing used in Bahasa Indonesia Language. It is obvious that the preprocessing stage is one of the most difficult task in NLP, because we have to understand the intent for each message and transform to the better structure for our machine learning. We done mainly six (6) steps, from punctuation removal to the stopwords removal. For example here is the complete chat message from the buyer to the seller. “Hi! could I ask, that this camera is available or not? Ty!” “ “ Remove Punctuation Tokenization Slang Words Transform Stemming Spell Checker Stopwords Removal Hi could I ask that this camera is available or not Ty “Hi” “could“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “ty” “Hi” “could“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “can“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “can“ “I“ “ask“ ”that” “this” “camera” “is” “available” “or” “not” “thank” “Hi” “I“ “ask“ “camera” “available” “thank”
  19. 3. Feature Extraction and Engineering Bukalapak IFCS – 2019 -

    Thessaloniki Having a set of text data, we have to transform it into numeric values, so that our computer could understand and learn from the data. We used 3 distinguish feature extraction for NLP, Normalized TF-IDF (Term Frequency – Inverse Document Frequency) , FastText , and Word2Vec. Academic Paper Related 1. TF-IDF: Allahyari, et al. 2017. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. In Processings of KDD Bigdas, Halifax, Canada, August 2017, 13 pages. 2. FastText: Joulin, et al. 2017. FastText.zip: Compressing Text Classification Models. 5th International Conference on Learning Representations Proceedings. 3. Word2Vec: Mikolov, et al. 2013. Distributed Representations of Words and Phrases and their Compositionality. “ “ The appearance frequency for each word vs the sentence, and the whole document TF-IDF(1) TF Formula IDF Formula Normalization Pre-trained DNN Model from Facebook Research FastText(2) Pre-trained DNN Model using skip grams Word2Vec(3)
  20. Word2Vec: Train Data FastText: Test Data 70% Word2Vec: Train Data

    70% FastText: Train Data 4. Train Test Split Bukalapak IFCS – 2019 - Thessaloniki We split the data to get training and testing dataset within a proportion of 70%: 30%, randomly,for each feature. “ “ 70% Normalized TF- IDF: Train Data 30% Normalized TF- IDF: Test Data 70% Train Dataset 30% Test Dataset
  21. 5. Data Training Bukalapak IFCS – 2019 - Thessaloniki We

    train our dataset, as well buyer and seller data, in order to get the prediction. We used several machine learning algorithms, in which we compare the accuracy result for each model. The Machine Learning algorithmmthat we used are: 1. Logistic Regression (Baseline Model) 2. K-Nearest Neighbor 3. Naïve Bayes 4. Decision Tree 5. Random Forest 6. Gradient Boosting Classifier 7. Extreme Gradient Boosting “ “ Logistic Regression (Baseline Model) K-Nearest Neighbor Naïve Bayes Decision Tree Classifier Random Forest Classifier Gradient Boosting Classifier Extreme Gradient Boosting Classifier
  22. 5. Data Training Bukalapak IFCS – 2019 - Thessaloniki We

    trained our data using K-Fold Cross Validation, which is used to evaluate the performance of the model by subset the data into train and validation by K- fold. “ “ K-Fold Cross Validation
  23. 6. Model Evaluation and Selection Bukalapak IFCS – 2019 -

    Thessaloniki In order to evaluate the model performance, we used several metrics to measure, such as accuracy score, negative recall, and AUC score. However, we focused to push down the False Positive by having high recall value. Higher accuracy score stands for a balanced-performance to predict both 0 and 1 classes. Higher recall stands for a very good performance on avoiding the false positive case. Higher AUC score stands for a good reliability of the model performance to predict imbalance dataset. “ “ = + + + + = + () = + () = +
  24. 7. Hyperparameter Optimization Bukalapak IFCS – 2019 - Thessaloniki The

    chosen model will be tuned (their hyperparameter) in order to get the optimum and higher accuracy for the train dataset. We used Randomized Search and Bayesian Search to get the best params from the hyperparameter space on the best selected models. Academic Paper Related 1. Random Search: Zabinsky, Zelda B. 2009. Random Search Algorithms. University of Washington, USA. 2. Bayesian Search: Shahriari, Bobak, et al. 2016. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proceedings of the IEEE (Volume: 104, Issue 1). “ “ Bayesian Search Random Search
  25. Result and Discussion Bukalapak IFCS – 2019 - Thessaloniki 04

    Data-driven Insight | Data-driven Decision
  26. Bukalapak IFCS – 2019 - Thessaloniki Results Model TFIDF Word2Vec

    FastText Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Logistic Regression 85.37% 56.20% 0.37 88.52% 63.28% 0.33 73.65% 0.00% 0.58 K-Nearest Neighbor 84.18% 58.39% 1.80 87.90% 70.33% 1.56 68.89% 10.00% 0.63 Naïve Bayes 85.52% 59.29% 0.39 84.95% 66.88% 1.04 73.65% 0.00% 0.58 Decision Tree 84.04% 62.57% 4.10 84.28% 72.50% 5.43 73.65% 0.00% 0.58 Random Forest 84.65% 60.20% 1.16 88.33% 71.77% 0.95 73.65% 0.00% 0.58 Gradient Boosting Classifier 84.52% 52.25% 0.38 88.09% 69.96% 0.31 73.65% 0.00% 0.58 Extreme Gradient Boosting 83.23% 49.54% 0.40 88.42% 69.96% 0.30 73.65% 0.00% 0.58 Model TFIDF Word2Vec FastText Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Accuracy Neg Recall Log Loss Logistic Regression 85.84% 85.66% 0.37 84.46% 79.74% 0.43 53.96% 100.00% 0.69 K-Nearest Neighbor 81.40% 80.09% 0.97 84.17% 84.28% 1.59 47.77% 20.14% 0.72 Naïve Bayes 86.55% 82.75% 0.35 83.45% 84.81% 0.74 53.96% 100.00% 0.69 Decision Tree 85.69% 85.13% 3.16 76.83% 77.59% 8.00 53.96% 100.00% 0.69 Random Forest 86.55% 88.04% 0.55 83.02% 86.95% 1.05 53.96% 100.00% 0.69 Gradient Boosting Classifier 86.12% 87.79% 0.33 84.17% 84.00% 0.38 53.96% 100.00% 0.69 Extreme Gradient Boosting 84.12% 87.26% 0.35 83.45% 84.27% 0.37 53.96% 100.00% 0.69 Seller Model Buyer Model
  27. Hyperparameter Tuning Results Bukalapak IFCS – 2019 - Thessaloniki XGBoost

    (Extreme Gradient Boosting) model outperform the other models in the seller text classification, while Random Forest model perfectly fit on the buyer text classification. Buyer Model (First Helix) Random Forest Tuned by Random Search -> AUC: 88.28% Seller Model (Second Helix) XGBoost Tuned by Bayesian Hyperopt -> AUC: 92.31% “ “
  28. What is Next? Bukalapak IFCS – 2019 - Thessaloniki Buyer

    Model (First Helix) Random Forest Tuned by Random Search -> AUC: 88.28% Seller Model (Second Helix) XGBoost Tuned by Bayesian Hyperopt -> AUC: 92.31% “ “ Store Weights (Model parameters) Integrate with the Microservices Deploy
  29. Thank you Bukalapak IFCS – 2019 - Thessaloniki Fiqry Revadiansyah

    Data Scientist @fiqryr Abdullah Ghifari Data Scientist @abdullahghifari Rya Meyvriska Data Scientist @rya_mey