Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Gigaword Corpus: Data Ingestion, Man...

Data Intelligence
June 28, 2017
590

Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP

Rebecca Bilbro Bytecubed & District Data Labs
Audience level: Intermediate
Topic area: Modeling

Description

As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. In this talk we walk through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.

Data Intelligence

June 28, 2017
Tweet

Transcript

  1. • Me and my motivation • Why make a custom

    corpus? • Things likely to go wrong ◦ Ingestion ◦ Management ◦ Loading ◦ Preprocessing ◦ Analysis • Lessons we learned • Open source tools we made
  2. The Natural Language Toolkit import nltk moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) print(moby.similar("ahab"))

    print(moby.common_contexts(["ahab", "starbuck"])) print(moby.concordance("monstrous", 55, lines=10))
  3. Gensim + Wikipedia import bz2 import gensim # Load id

    to word dictionary id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt') # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!) mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2')) # Do latent Semantic Analysis and find 10 prominent topics lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) lsa.print_topics(10)
  4. Partisan Discourse: Architecture Initial Model Debate Transcripts Submit URL Preprocessing

    Feature Extraction Evaluate Model Fit Model Model Storage Model Monitoring Corpus Storage Corpus Monitoring Classification Feedback Model Selection
  5. RSS import os import requests import feedparser feed = "http://feeds.washingtonpost.com/rss/national"

    for entry in feedparser.parse(feed)['entries']: r = requests.get(entry['link']) path = entry['title'].lower().replace(" ", "-") + ".html" with open(path, 'wb') as f: f.write(r.content)
  6. Ingestion • Scheduling • Adding new feeds • Synchronizing feeds,

    finding duplicates • Parsing different feeds/entries into a standard form • Monitoring Storage • Database choice • Data representation, indexing, fetching • Connection and configuration • Error tracking and handling • Exporting
  7. And as the corpus began to grow … … new

    questions arose about costs (storage, time) and surprising results (videos?).
  8. Post + title + url + content + hash() +

    htmlize() Feed + title + link + active OPML Reader + categories() + counts() + __iter__() + __len__() ingest() Configuration + logging + database + flags Exporter + export() + readme() Admin + ingest_feeds() + ingest_opml() + summary() + run() + export() Utilities + timez Logging + logger + mongolog Ingest + feeds() + started() + finished() + process() + ingest() Feed Sync + parse() + sync() + entries() Post Wrangler + wrangle() + fetch() connect() <OPML /> Production-grade ingestion: Baleen
  9. From each doc, extract html, identify paras/sents/words, tag with part-of-speech

    Raw Corpus HTML corpus = [(‘How’, ’WRB’), (‘long’, ‘RB’), (‘will’, ‘MD’), (‘this’, ‘DT’), (‘go’, ‘VB’), (‘on’, ‘IN’), (‘?’, ‘.’), ... ] Paras Sents Tokens Tags
  10. Streaming Corpus Preprocessing Tokenized Corpus CorpusReader for streaming access, preprocessing,

    and saving the tokenized version HTML Paras Sents Tokens Tags Raw Corpus
  11. Data Loader Text Normalization Text Vectorization Feature Transformation Estimator Data

    Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer Minke
  12. 1.5 M documents, 7,500 jobs, 524 GB (uncompressed) Keyphrase Graph:

    - 2.7 M nodes - 47 M edges - Average degree of 35
  13. Meaningfully literate data products rely on… ...a data management layer

    for flexibility and iteration during modeling. Feature Analysis Algorithm Selection Hyperparameter Tuning
  14. corpus ├── citation.bib ├── feeds.json ├── LICENSE.md ├── manifest.json ├──

    README.md └── books ├── 56d629e7c1808113ffb87eaf.html ├── 56d629e7c1808113ffb87eb3.html └── 56d629ebc1808113ffb87ed0.html └── business ├── 56d625d5c1808113ffb87730.html ├── 56d625d6c1808113ffb87736.html └── 56d625ddc1808113ffb87752.html └── cinema ├── 56d629b5c1808113ffb87d8f.html ├── 56d629b5c1808113ffb87d93.html └── 56d629b6c1808113ffb87d9a.html └── cooking ├── 56d62af2c1808113ffb880ec.html ├── 56d62af2c1808113ffb880ee.html └── 56d62af2c1808113ffb880fa.html Preprocessing Transformer Raw CorpusReader Tokenized Corpus Post-processed CorpusReader Meaningfully literate data products rely on… ...a custom CorpusReader for streaming, and also intermediate storage.