Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Practical transfer learning for NLP with spaCy ...

Practical transfer learning for NLP with spaCy and Prodigy

Transfer learning has been called "NLP's ImageNet moment". Recent work has shown that models can be initialized with detailed, contextualised linguistic knowledge, drawn from huge samples of data. In this talk, I'll explain spaCy's new support for efficient and easy transfer learning, and show you how it can kickstart new NLP projects with our annotation tool, Prodigy.

Ines Montani

January 28, 2019
Tweet

More Decks by Ines Montani

Other Decks in Programming

Transcript

  1. Language is more than 
 just words NLP has always

    struggled to get beyond a 
 “bag of words” Word2Vec (and GloVe, FastText etc.) let us pretrain word meanings How do we learn the meanings of words 
 in context? Or whole sentences?
  2. Language model pretraining ULMFiT, ELMo: Predict the next word based

    on the previous words
 BERT: Predict a word given the surrounding context
  3. Bringing language modelling into production Take what’s proven to work

    in research, 
 provide fast, production-ready 
 implementations. Performance target: 10,000 words per second Production models need to be cheap to run (and not require powerful GPUs)
  4. Language Modelling with Approximate Outputs We train the CNN to

    predict the vector of each word based on its context Instead of predicting the exact word, we predict the rough meaning – much easier! Meaning representations learned with Word2Vec, GloVe or FastText Kumar, Sachin, and Yulia Tsvetkov. "Von Mises-Fisher Loss for Training Sequence to 
 Sequence Models with Continuous Outputs." arXiv preprint arXiv:1812.04616 (2019)
  5. Pretraining with spaCy $ pip install spacy-nightly $ spacy download

    en_vectors_web_lg $ spacy pretrain ./reddit-100k.jsonl en_vectors_web_lg ./output_dir
  6. Pretraining with spaCy $ pip install spacy-nightly $ spacy download

    en_vectors_web_lg $ spacy pretrain ./reddit-100k.jsonl en_vectors_web_lg ./output_dir reddit-100k.jsonl
  7. Pretraining with spaCy $ pip install spacy-nightly $ spacy download

    en_vectors_web_lg $ spacy pretrain ./reddit-100k.jsonl en_vectors_web_lg ./output_dir $ spacy train en ./model_out ./data/train 
 ./data/dev --pipeline tagger,parser 
 --init-tok2vec ./output_dir/model-best.t2v ✓ Saved best model to ./model_out/model-best import spacy nlp = spacy.load("./model_out/model-best") doc = nlp("This is a sentence.") for token in doc: print(token.text, token.pos_, token.dep_) application.py
  8. Pretraining with spaCy GloVe LMAO LAS ❌ ❌ 79.1 ✅

    ❌ 81.0 ❌ ✅ 81.0 ✅ ✅ 82.4 Labelled attachment score (dependency parsing)
 on Universal Dependencies data (English-EWT) $ pip install spacy-nightly $ spacy download en_vectors_web_lg $ spacy pretrain ./reddit-100k.jsonl en_vectors_web_lg ./output_dir $ spacy train en ./model_out ./data/train 
 ./data/dev --pipeline tagger,parser 
 --init-tok2vec ./output_dir/model-best.t2v ✓ Saved best model to ./model_out/model-best
  9. Pretraining with spaCy GloVe LMAO LAS ❌ ❌ 79.1 ✅

    ❌ 81.0 ❌ ✅ 81.0 ✅ ✅ 82.4 Labelled attachment score (dependency parsing)
 on Universal Dependencies data (English-EWT) Stanford '17 82.3 Stanford '18 83.9 3MB $ pip install spacy-nightly $ spacy download en_vectors_web_lg $ spacy pretrain ./reddit-100k.jsonl en_vectors_web_lg ./output_dir $ spacy train en ./model_out ./data/train 
 ./data/dev --pipeline tagger,parser 
 --init-tok2vec ./output_dir/model-best.t2v ✓ Saved best model to ./model_out/model-best
  10. Move fast and train things 1. Pre-train models with general

    knowledge about the language using raw text. 2. Annotate a small amount of data specific to your application. 3. Train a model and try it in your application. 4. Iterate on your code and data.
  11. Move fast and train things 1. Pre-train models with general

    knowledge about the language using raw text. 2. Annotate a small amount of data specific to your application. 3. Train a model and try it in your application. 4. Iterate on your code and data.
  12. Prodigy https://prodi.gy scriptable annotation tool full data privacy: runs on

    your own hardware active learning for better example selection optimized for efficiency and fast iteration $ prodigy ner.teach product_ner en_core_web_sm /data.jsonl --label PRODUCT $ prodigy db-out product_ner > annotations.jsonl
  13. Iterate on your code 
 and your data Try out

    more ideas quickly. Most ideas 
 don’t work – but some succeed wildly. Figure out what works before trying to scale it up. Build entirely custom solutions so nobody can lock you in.