Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples
Text is an unstructured data format so it needs a process to transfer text from human language to a machine-readable format to facilitate further processing.
Python Conference Indonesia 2020
14-15 November, 2020 [Online]
Data Center Universitas Widyagama, Malang - Indonesia ▪ Master Student Faculty of Computer Science Universitas Brawijaya, Malang - Indonesia Intelligent Systems Laboratory Affective Computing Research Interest Group (ACRIG)
data is unstructured. To achieve better insights or build better algorithms, it is necessary to ‘play’ around the data to make it clean. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
suratwarga.malangkab.go.id Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
text preprocessing? ▪ Text preprocessing pipeline 2. Text Preprocessing ▪ Python library ▪ Text preprocessing techniques using python ▪ Rule of thumb. Do you need all techniques? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
topic. Sometimes, wrong techniques of text preprocessing. ▪ Why Bahasa (Indonesia)? Bahasa is one of the top ten languages spoken throughout the world. However, there are still few studies that have been published in international journals or proceedings. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Why this Topic?
in a specific task. A task is combination of approach and domain. Task = approach + domain What’s Text Preprocessing? Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 To reduce indexing (or data) file size of the text. Improve the efficiency and effectiveness.
as input and returning properly preprocessed ‘clean’ text as output. Text preprocessing pipeline may vary as well, depending on the task. Text Preprocessing Pipeline Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020
Flair; 5. Pandas; 6. Scikit-learn; 7. etc. Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Python Library for Text Preprocessing
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Lower Case & Remove Whitespace Simplest and most effective of text preprocessing. Indonesia ≠ INDONESIA ≠ indonesia
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Remove Numbers & Punctuations Removing number and punctuation from the text. like ‘123....?!’ and also the symbols like ‘@#$’
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoji Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, → grinning_face
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Emoticon Conversion https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py For example, :-) → happy_face_smiley
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Slang Word Normalization Generating normal word from the slang words. For example, gmn → bagaimana, jwb → jawab, gueh → saya
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stemming Process of reducing inflection in words. For example, the words ‘mendengarkan’, ‘dengarkan’, ‘didengarkan’ will be transformed into the word ‘dengar’.
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Word & Sentence Tokenize Process of separating the text (word or sentence) into pieces called tokens. Words, numbers, symbols, punctuation marks and other important entities can be considered tokens. ‘Selamat datang di Pycon ID 2020.’ → ‘Selamat’ ‘datang’ ‘di’ ‘Pycon’ ‘ID’ ‘2020’ ‘.’
Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020 Stopwords Removal (NLTK) Removing low information (noise) words from text. Examples of stopwords in Indonesian are ‘yang’, ‘dan’, ‘di’, ‘dari’, etc.
no definite rules for the steps in text preprocessing. Not all tasks need the same level of text preprocessing. For some tasks, you can get away with the minimum effort. Must Do: ▪ Noise removal ▪ Lowercasing (can be task dependent in some cases) Should Do: ▪ Simple normalization Task Dependent: ▪ Advanced normalization ▪ Stop-word removal ▪ Stemming ▪ Text enrichment / augmentation Text Preprocessing Pipeline for Bahasa using Python: Concept, Steps, Tools, and Examples Kuncahyo Setyo Nugroho | Present in PyCon ID 2020