Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools,
and Examples Kuncahyo Setyo Nugroho PyCon ID - 14 November 2020

Hello, call me Cahyo :) ▪ Software Engineer ITC and
Data Center Universitas Widyagama, Malang - Indonesia ▪ Master Student Faculty of Computer Science Universitas Brawijaya, Malang - Indonesia Intelligent Systems Laboratory Affective Computing Research Interest Group (ACRIG)

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools,
and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Text → Unstructured Data Today, more than 80% of the
data is unstructured. To achieve better insights or build better algorithms, it is necessary to ‘play’ around the data to make it clean. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

How would machine read this? Sources: twitter.com, play.google.com, tokopedia.com, shopee.co.id,
suratwarga.malangkab.go.id Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Garbage ‘in’, garbage ‘out’ Text Preprocessing Pipeline for Bahasa using
Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Garbage Data Garbage Result

Talk Outline 1. Introduction ▪ Why this topic? ▪ What’s
text preprocessing? ▪ Text preprocessing pipeline 2. Text Preprocessing ▪ Python library ▪ Text preprocessing techniques using python ▪ Rule of thumb. Do you need all techniques? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

▪ Why Text Preprocessing? Text preprocessing is a severely overlooked
topic. Sometimes, wrong techniques of text preprocessing. ▪ Why Bahasa (Indonesia)? Bahasa is one of the top ten languages spoken throughout the world. However, there are still few studies that have been published in international journals or proceedings. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Why this Topic?

An approach for cleaning and preparing text data for use
in a specific task. A task is combination of approach and domain. Task = approach + domain What’s Text Preprocessing? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 To reduce indexing (or data) file size of the text. Improve the efficiency and effectiveness.

‘Pipeline’ refer to a ﬁnite of steps taking ‘raw’ text
as input and returning properly preprocessed ‘clean’ text as output. Text preprocessing pipeline may vary as well, depending on the task. Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

1. NLTK (Natural Language Toolkit); 2. Sastrawi; 3. SpaCy; 4.
Flair; 5. Pandas; 6. Scikit-learn; 7. etc. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Python Library for Text Preprocessing

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Lower Case & Remove Whitespace Simplest and most effective of text preprocessing. Indonesia ≠ INDONESIA ≠ indonesia

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove HTML tag

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove URLs & Email

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Numbers & Punctuations Removing number and punctuation from the text. like ‘123....?!’ and also the symbols like ‘@#$’

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Emoji

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Emoticon

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoji Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, → grinning_face

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoticon Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, :-) → happy_face_smiley

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Non-ASCII Characters https://www.ascii-code.com

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Slang Word Normalization Generating normal word from the slang words. For example, gmn → bagaimana, jwb → jawab, gueh → saya

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stemming Process of reducing inﬂection in words. For example, the words ‘mendengarkan’, ‘dengarkan’, ‘didengarkan’ will be transformed into the word ‘dengar’.

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Word & Sentence Tokenize Process of separating the text (word or sentence) into pieces called tokens. Words, numbers, symbols, punctuation marks and other important entities can be considered tokens. ‘Selamat datang di Pycon ID 2020.’ → ‘Selamat’ ‘datang’ ‘di’ ‘Pycon’ ‘ID’ ‘2020’ ‘.’

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stopwords Removal (NLTK) Removing low information (noise) words from text. Examples of stopwords in Indonesian are ‘yang’, ‘dan’, ‘di’, ‘dari’, etc.

Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stopwords Removal (Sastrawi)

Combine into Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa
using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 https://github.com/ksnugroho/pycon-id-2020

Rule of thumb. Do you need all techniques? There are
no deﬁnite rules for the steps in text preprocessing. Not all tasks need the same level of text preprocessing. For some tasks, you can get away with the minimum effort. Must Do: ▪ Noise removal ▪ Lowercasing (can be task dependent in some cases) Should Do: ▪ Simple normalization Task Dependent: ▪ Advanced normalization ▪ Stop-word removal ▪ Stemming ▪ Text enrichment / augmentation Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020

Thank you ! See you on PyCon ID 2021 :)
Would love to connect, feel free to reach out. Discussion, any question? https://www.linkedin.com/in/ksnugroho https://github.com/ksnugroho

Text Preprocessing Pipeline for Bahasa using Py...

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples

Kuncahyo Setyo Nugroho

More Decks by Kuncahyo Setyo Nugroho

Other Decks in Technology

Featured

Transcript

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools,

Hello, call me Cahyo :) ▪ Software Engineer ITC and

Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools,

Text → Unstructured Data Today, more than 80% of the

How would machine read this? Sources: twitter.com, play.google.com, tokopedia.com, shopee.co.id,

Garbage ‘in’, garbage ‘out’ Text Preprocessing Pipeline for Bahasa using

Talk Outline 1. Introduction ▪ Why this topic? ▪ What’s

▪ Why Text Preprocessing? Text preprocessing is a severely overlooked

An approach for cleaning and preparing text data for use

‘Pipeline’ refer to a ﬁnite of steps taking ‘raw’ text

1. NLTK (Natural Language Toolkit); 2. Sastrawi; 3. SpaCy; 4.

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Text Preprocessing Techniques Text Preprocessing Pipeline for Bahasa using Python:

Combine into Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa

Rule of thumb. Do you need all techniques? There are

Thank you ! See you on PyCon ID 2021 :)