Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples

Text is an unstructured data format so it needs a process to transfer text from human language to a machine-readable format to facilitate further processing.

Python Conference Indonesia 2020
14-15 November, 2020 [Online]

https://pycon.id

Kuncahyo Setyo Nugroho

November 15, 2020
Tweet

More Decks by Kuncahyo Setyo Nugroho

Other Decks in Technology

Transcript

  1. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools,

    and Examples Kuncahyo Setyo Nugroho PyCon ID - 14 November 2020
  2. Hello, call me Cahyo :) ▪ Software Engineer ITC and

    Data Center Universitas Widyagama, Malang - Indonesia ▪ Master Student Faculty of Computer Science Universitas Brawijaya, Malang - Indonesia Intelligent Systems Laboratory Affective Computing Research Interest Group (ACRIG)
  3. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools,

    and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
  4. Text → Unstructured Data Today, more than 80% of the

    data is unstructured. To achieve better insights or build better algorithms, it is necessary to ‘play’ around the data to make it clean. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
  5. How would machine read this? Sources: twitter.com, play.google.com, tokopedia.com, shopee.co.id,

    suratwarga.malangkab.go.id Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
  6. Garbage ‘in’, garbage ‘out’ Text Preprocessing Pipeline for Bahasa using

    Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Garbage Data Garbage Result
  7. Talk Outline 1. Introduction ▪ Why this topic? ▪ What’s

    text preprocessing? ▪ Text preprocessing pipeline 2. Text Preprocessing ▪ Python library ▪ Text preprocessing techniques using python ▪ Rule of thumb. Do you need all techniques? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
  8. ▪ Why Text Preprocessing? Text preprocessing is a severely overlooked

    topic. Sometimes, wrong techniques of text preprocessing. ▪ Why Bahasa (Indonesia)? Bahasa is one of the top ten languages spoken throughout the world. However, there are still few studies that have been published in international journals or proceedings. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Why this Topic?
  9. An approach for cleaning and preparing text data for use

    in a specific task. A task is combination of approach and domain. Task = approach + domain What’s Text Preprocessing? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 To reduce indexing (or data) file size of the text. Improve the efficiency and effectiveness.
  10. ‘Pipeline’ refer to a finite of steps taking ‘raw’ text

    as input and returning properly preprocessed ‘clean’ text as output. Text preprocessing pipeline may vary as well, depending on the task. Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
  11. 1. NLTK (Natural Language Toolkit); 2. Sastrawi; 3. SpaCy; 4.

    Flair; 5. Pandas; 6. Scikit-learn; 7. etc. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Python Library for Text Preprocessing
  12. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Lower Case & Remove Whitespace Simplest and most effective of text preprocessing. Indonesia ≠ INDONESIA ≠ indonesia
  13. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove HTML tag
  14. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove URLs & Email
  15. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Numbers & Punctuations Removing number and punctuation from the text. like ‘123....?!’ and also the symbols like ‘@#$’
  16. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Emoji
  17. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Emoticon
  18. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoji Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, → grinning_face
  19. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoticon Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, :-) → happy_face_smiley
  20. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Non-ASCII Characters https://www.ascii-code.com
  21. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Slang Word Normalization Generating normal word from the slang words. For example, gmn → bagaimana, jwb → jawab, gueh → saya
  22. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stemming Process of reducing inflection in words. For example, the words ‘mendengarkan’, ‘dengarkan’, ‘didengarkan’ will be transformed into the word ‘dengar’.
  23. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Word & Sentence Tokenize Process of separating the text (word or sentence) into pieces called tokens. Words, numbers, symbols, punctuation marks and other important entities can be considered tokens. ‘Selamat datang di Pycon ID 2020.’ → ‘Selamat’ ‘datang’ ‘di’ ‘Pycon’ ‘ID’ ‘2020’ ‘.’
  24. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stopwords Removal (NLTK) Removing low information (noise) words from text. Examples of stopwords in Indonesian are ‘yang’, ‘dan’, ‘di’, ‘dari’, etc.
  25. Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

    Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stopwords Removal (Sastrawi)
  26. Combine into Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa

    using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 https://github.com/ksnugroho/pycon-id-2020
  27. Rule of thumb. Do you need all techniques? There are

    no definite rules for the steps in text preprocessing. Not all tasks need the same level of text preprocessing. For some tasks, you can get away with the minimum effort. Must Do: ▪ Noise removal ▪ Lowercasing (can be task dependent in some cases) Should Do: ▪ Simple normalization Task Dependent: ▪ Advanced normalization ▪ Stop-word removal ▪ Stemming ▪ Text enrichment / augmentation Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
  28. Thank you ! See you on PyCon ID 2021 :)

    Would love to connect, feel free to reach out. Discussion, any question? https://www.linkedin.com/in/ksnugroho https://github.com/ksnugroho