Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Practical Tips for Bootstrapping Information Ex...

Practical Tips for Bootstrapping Information Extraction Pipelines

In this presentation, I will build on Ines Montani's keynote, "Applied NLP in the Age of Generative AI", by demonstrating how to create an information extraction pipeline. The talk will focus on using the spaCy NLP library and the Prodigy annotation tool, although the principles discussed will also apply to other frameworks.

Matthew Honnibal

August 09, 2024
Tweet

Resources

spaCy: Industrial-Strength NLP

https://spacy.io

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text.

Prodigy: Radically efficient machine teaching

https://prodi.gy

Prodigy is a modern annotation tool for creating training data for machine learning models. It’s so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration.

spacy-llm: Integrating LLMs into structured NLP pipelines

https://github.com/explosion/spacy-llm

spacy-llm features a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.

A practical guide to human-in-the-loop distillation

https://explosion.ai/blog/human-in-the-loop-distillation

This blog post presents practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

https://explosion.ai/blog/sp-global-commodities

A case study on S&P Global’s efficient information extraction pipelines for real-time commodities trading insights in a high-security environment using human-in-the-loop distillation.

More Decks by Matthew Honnibal

Other Decks in Programming

Transcript

  1. 900+ companies 10k+ users Alex Smith Developer Kim Miller Analyst

    GPT-4 API Modern scriptable annotation tool for machine learning developers prodigy.ai PRODIGY
  2. We’re back to running Explosion as a smaller, independent-minded and

    self-su ff icient company. explosion.ai/blog/back-to-our-roots BACK TO OUR ROOTS
  3. We’re back to running Explosion as a smaller, independent-minded and

    self-su ff icient company. explosion.ai/blog/back-to-our-roots Consulting open source developer tools BACK TO OUR ROOTS
  4. WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into

    data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more.
  5. WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into

    data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline.
  6. WHAT I MEAN BY INFORMATION EXTRACTION 📝 Turn text into

    data. Make a database from earnings reports, or skills in job postings, or product feedback in social media – many more. 🗂 Lots of subtasks. Text classification, named entity recognition, entity linking, relation extraction can all be part of an information extraction pipeline. 🎯 Mostly static schema. Most people are solving one problem at a time, so that’s what I’ll focus on.
  7. COMPANY COMPANY named entity recognition Database “Hooli raises $5m to

    revolutionize search, led by ACME Ventures”
  8. COMPANY COMPANY named entity recognition MONEY currency normalization Database “Hooli

    raises $5m to revolutionize search, led by ACME Ventures”
  9. COMPANY COMPANY named entity recognition MONEY currency normalization 5923214 1681056

    custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”
  10. COMPANY COMPANY named entity recognition MONEY currency normalization INVESTOR entity

    relation extraction 5923214 1681056 custom database lookup entity disambiguation Database “Hooli raises $5m to revolutionize search, led by ACME Ventures”
  11. 💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖

    texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION
  12. 💬 question ⚙ text-to-SQL query data 📦 NLP pipeline 📖

    texts + RIE: RETRIEVAL VIA INFORMATION EXTRACTION RAG: RETRIEVAL-AUGMENTED GENERATION 💬 question ⚙ vectorizer query answers 📚 vector DB 📖 snippets + ⚙ vectorizer
  13. SUPERVISED LEARNING IS STILL VERY STRONG Example data is super

    powerful. Example data can do things that instructions can’t.
  14. SUPERVISED LEARNING IS STILL VERY STRONG Example data is super

    powerful. Example data can do things that instructions can’t. In-context learning can’t use examples scalably.
  15. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈
  16. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮
  17. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚
  18. product vision 👁 chicken-and- egg problem KNOW YOUR ENEMIES What

    makes supervised learning hard? accuracy estimate 📈 training & evaluation 🔮 labelled data 📚 annotation scheme 🏷
  19. RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at

    all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling…
  20. RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at

    all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out?
  21. RESULTS ARE HARD TO INTERPRET 😬 Model doesn’t train at

    all. Is the data messed up somehow? 🤨 Model learns barely better than chance. Could be data, hyper-parameters, modelling… 🥹 Results are decent! But can it be better? How do I know if I’m missing out? 🤔 Results are too good to be true. Probably messed up the data…
  22. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION
  23. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST
  24. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “I can’t connect to this site.”
  25. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET “I can’t connect to this site.”
  26. This is the bit that’s broken. HYPOTHESIS If this bit

    is broken, what should I expect to see? QUESTION Is that what actually happens? TEST “Maybe it’ll work if I reconnect to the wi-fi or if I restart my router.” SOLUTION MINDSET SCIENTIFIC MINDSET “If the problem is between me and the site, other sites won’t load either. If the problem is between me and the router, I won’t be able to ping it.” “I can’t connect to this site.”
  27. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge?
  28. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn?
  29. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training?
  30. EXAMPLES OF DEBUGGING TRAINING 📉 What happens if I train

    on a tiny amount of data? Does the model converge? 🔀 What happens if I randomize the training labels? Does the model still learn? 🪄 Are my model weights changing at all during training? 🧮 What’s the mean and variance of my gradients?
  31. 📈 Better needs to look better. You need it to

    not be like this: 📦 Larger models are often less practical.
  32. 📈 Better needs to look better. You need it to

    not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples.
  33. 📈 Better needs to look better. You need it to

    not be like this: 📦 Larger models are often less practical. 🤏 You need it to work with small samples. 🌪 Large models are less stable with small batch sizes.
  34. task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm

    prompt model & transform output to structured data GPT-4 API
  35. task- specific output 💬 prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm

    prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION
  36. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION
  37. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular
  38. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast
  39. distilled task-specific components 📦 📦 📦 task- specific output 💬

    prompt 📖 text 🔮 PROTOTYPE github.com/explosion/spacy-llm prompt model & transform output to structured data GPT-4 API 📖 text task- specific output PRODUCTION modular small & fast data-private
  40. config.cfg spacy.io/usage/large-language-models model and provider ⏺ ⏺ ⏺ task definition

    and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙
  41. config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and

    provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙
  42. config.cfg spacy.io/usage/large-language-models label definitions to use in prompt model and

    provider ⏺ ⏺ ⏺ task definition and labels Named Entity Recognition, Text Classification, Relation Extraction, … component ⚙ example from case study explosion.ai/blog/sp-global-commodities
  43. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need?
  44. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection
  45. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision.
  46. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb.
  47. TRAINING =============== Train curve diagnostic =============== Training 4 times with

    25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ Prodigy How much data do you need? Accuracy 0 25 50 75 100 % of examples 25 50 75 100 125 150 projection EVALUATION ⚠ You need enough data to avoid reporting meaningless precision. 📊 Ten samples per significant figure is a good rule of thumb. 1,000 samples is pretty good – enough for 94% vs. 95%.
  48. KEEP TASKS SMALL GOOD for i in range(rows): access_data(array[i]) ✅

    BAD for j in range(columns): access_data(array[:, j]) ❌
  49. KEEP TASKS SMALL Humans have a cache, too! GOOD for

    i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌
  50. KEEP TASKS SMALL Humans have a cache, too! GOOD for

    i in range(rows): access_data(array[i]) ✅ BAD for j in range(columns): access_data(array[:, j]) ❌ DO THIS for annotation_type in annotation_types: for example in examples: annotate(example, annotation_type) ✅ NOT THIS for example in examples: for annotation_type in annotation_types: annotate(example, annotation_type) ❌
  51. USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-

    based, initial trained model, an LLM – or a combination of all.
  52. USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-

    based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥
  53. USE MODEL ASSISTANCE 🔮 Suggest annotations however you can. Rule-

    based, initial trained model, an LLM – or a combination of all. Suggestions improve e iciency. Common cases are common, so getting them preset speeds up annotation a lot. 🔥 Suggestions improve accuracy. You need the common cases to be annotated consistently. Humans suck at this. 📈
  54. prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save

    annotations to recipe function with workflow [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  55. prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save

    annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  56. ✨ Starting the web server at localhost:8080 ... Open the

    app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  57. ✨ Starting the web server at localhost:8080 ... Open the

    app and start annotating! GPT-4 API prodigy.ai/docs/large-language-models $ prodigy ner.llm.correct todo_eval ./config.cfg ./examples.jsonl dataset to save annotations to recipe function with workflow raw data 🤠 You Developer [components.llm.model] @llm_models = "spacy.GPT-4.v2" ⚙
  58. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model.
  59. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production.
  60. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself.
  61. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small.
  62. 📒 🔮 Form and falsify hypotheses. ⚗ Prioritize robustness. Scale

    down and iterate. Imagine you’re the model. Finish the pipeline to production. Be agile and annotate yourself. Keep tasks small. Use model assistance.