10 Years of Open Source: Navigating the Next AI Revolution

Ines Montani Explosion

Open-source library for industrial-strength natural language processing spacy.io 255m+ downloads

Open-source library for industrial-strength natural language processing spacy.io ChatGPT can
write spaCy code! 255m+ downloads

Modern scriptable annotation tool for machine learning developers prodigy.ai 900+
companies 10k+ users

Alex Smith Developer Kim Miller Analyst GPT-4 API Modern scriptable
annotation tool for machine learning developers prodigy.ai 900+ companies 10k+ users

^ first commit to spaCy

^ first commit to spaCy spaCy is first released spacy.io

OUR DEVELOPMENT PHILOSOPHY OUR DEVELOPMENT PHILOSOPHY “Let T h e
m W rite Code” spacy.fyi/ltwc

m W rite Code” spacy.fyi/ltwc Good tools help people do their work. You don’t have to do their work for them.

m W rite Code” spacy.fyi/ltwc Good tools help people do their work. You don’t have to do their work for them. ["go", "swim"]

m W rite Code” spacy.fyi/ltwc Good tools help people do their work. You don’t have to do their work for them. ["go", "swim"] spaCy

m W rite Code” spacy.fyi/ltwc Good tools help people do their work. You don’t have to do their work for them. You can reinvent the wheel, but don’t try to reinvent the road. ["go", "swim"] spaCy

everyone gets excited about chat bots

“knocker-uppers”

The Window K nocking Machine Tes t ines.io/blog/window-knocking-machine-test “knocker-uppers”

The Window K nocking Machine Tes t ines.io/blog/window-knocking-machine-test Are you
designing a window-knocking machine or an alarm clock? “knocker-uppers”

Hello, I ’ m Toni ’ s virtual assistant and
I help schedule appointments. Do you have time at 1pm on Monday? No, but Tuesday would work for me. Okay, please confirm: Tuesday at 1pm? 1pm is unideal but 3pm would work. Toni doesn ’ t have availability at 3pm but I could offer a slot at 4pm or 5 : 30pm. Which time zone is this by the way? I ’ m in CET.

I help schedule appointments. Do you have time at 1pm on Monday? No, but Tuesday would work for me. Okay, please confirm: Tuesday at 1pm? 1pm is unideal but 3pm would work. Toni doesn ’ t have availability at 3pm but I could offer a slot at 4pm or 5 : 30pm. Which time zone is this by the way? I ’ m in CET. Calendly

I help schedule appointments. Do you have time at 1pm on Monday? No, but Tuesday would work for me. Okay, please confirm: Tuesday at 1pm? 1pm is unideal but 3pm would work. Toni doesn ’ t have availability at 3pm but I could offer a slot at 4pm or 5 : 30pm. Which time zone is this by the way? I ’ m in CET. Calendly “window-knocking machine” “alarm clock”

everyone gets excited about chat bots

deep learning is widely adopted everyone gets excited about chat bots

Software 1.0 Software 1.0 📄 code 💾 program compiler

Software 1.0 Software 1.0 📄 code 💾 program compiler Software
2.0 Software 2.0 📊 data 🔮 model algorithm

2.0 Software 2.0 📊 data 🔮 model algorithm ✅ tests 📈 evaluation

2.0 Software 2.0 📊 data 🔮 model algorithm ✅ tests 📈 evaluation refactoring refactoring

2.0 Software 2.0 📊 data 🔮 model algorithm ✅ tests 📈 evaluation refactoring refactoring iteration iteration

Prodigy is first released prodigy.ai

language model pre-training works ^ ^ Prodigy is first released
prodigy.ai

language model pre-training works ^ ^ Prodigy is first released
prodigy.ai few-shot in-context learning works ^ ^

spaCy v3 is first released

i U se cases i n industr y generative tasks
📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering predictive tasks 🔖 entity recognition 🔗 relation extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification

i U se cases i n industr y generative tasks
📖 single/multi-doc summarization 🧮 reasoning ✅ problem solving ✍ paraphrasing 🖼 style transfer ⁉ question answering predictive tasks 🔖 entity recognition 🔗 relation extraction 👫 coreference resolution 🧬 grammar & morphology 🎯 semantic parsing 💬 discourse structure 📚 text classification structured data many industry problems have remained the same, they just changed in scale

spaCy v3 is first released

spaCy v3 is first released in-context learning gains traction

human-facing systems machine-facing models ChatGPT GPT-4 A I products are
m ore t h an jus t a model

human-facing systems machine-facing models ChatGPT GPT-4 most important di erentiation
is product, not just technology A I products are m ore t h an jus t a model

human-facing systems machine-facing models ChatGPT GPT-4 UI / UX marketing
customization most important di erentiation is product, not just technology A I products are m ore t h an jus t a model

customization most important di erentiation is product, not just technology swappable components based on research, impacts are quantifiable A I products are m ore t h an jus t a model

customization speed accuracy latency cost most important di erentiation is product, not just technology swappable components based on research, impacts are quantifiable A I products are m ore t h an jus t a model

customization speed accuracy latency cost But what about the data? most important di erentiation is product, not just technology swappable components based on research, impacts are quantifiable A I products are m ore t h an jus t a model

customization speed accuracy latency cost But what about the data? User data is an advantage for product, not the foundation for machine-facing tasks. most important di erentiation is product, not just technology swappable components based on research, impacts are quantifiable A I products are m ore t h an jus t a model

customization speed accuracy latency cost But what about the data? User data is an advantage for product, not the foundation for machine-facing tasks. You don’t need specific data to gain general knowledge. most important di erentiation is product, not just technology swappable components based on research, impacts are quantifiable A I products are m ore t h an jus t a model

spaCy v3 is first released in-context learning gains traction

spacy-llm is first released github.com/explosion/spacy-llm spaCy v3 is first released
in-context learning gains traction

task-specific output 💬 prompt 📖 text LLM spacy.io/usage/large-language-models spac y
-llm

task-specific output 💬 prompt 📖 text LLM prompt model &
transform output to structured data spacy.io/usage/large-language-models spac y -llm

transform output to structured data spacy.io/usage/large-language-models spac y -llm config.cfg Structured Data {} LLM Text

transform output to structured data spacy.io/usage/large-language-models unified, model-agnostic API spac y -llm config.cfg Structured Data {} LLM Text

transform output to structured data spacy.io/usage/large-language-models unified, model-agnostic API spac y -llm config.cfg Structured Data {} LLM Text entity recognition entity linking text classification relation extraction and more…

in-context learning gains traction

in-context learning gains traction LLMs and Generative AI fully hit the mainstream ChatGPT ⏺ ⏺ ⏺

E cono m ies of scale of scale output costs

OpenAI Google

OpenAI Google access to talent, compute etc.

OpenAI Google access to talent, compute etc. API request batching

OpenAI Google high tra ff ic 💧 💧 💧 💧 💧 💧 💧 💧 low tra ff ic batch 💧 💧 💧 💧 💧 💧 💧 💧 … access to talent, compute etc. API request batching

OpenAI Google you 🤠 high tra ff ic 💧 💧 💧 💧 💧 💧 💧 💧 low tra ff ic batch 💧 💧 💧 💧 💧 💧 💧 💧 … access to talent, compute etc. API request batching

human-in-the-loop distillation is promising prodigy.fyi/distillation

in the loop H uma n explosion.ai/blog/human-in-the-loop-distillation LLM

in the loop H uma n explosion.ai/blog/human-in-the-loop-distillation continuous evaluation baseline
LLM

LLM prompting

LLM prompting transfer learning CO M PO N EN T

LLM prompting transfer learning CO M PO N EN T distilled model

99% 99% Case Stud y : S&P Global • real-time
commodities trading insights by extracting structured attributes 6mb 6mb model size 16k+ 16k+ words/second F-score explosion.ai/blog/sp-global-commodities

commodities trading insights by extracting structured attributes • high-security environment 6mb 6mb model size 16k+ 16k+ words/second F-score explosion.ai/blog/sp-global-commodities

commodities trading insights by extracting structured attributes • high-security environment • used LLM during annotation 6mb 6mb model size 16k+ 16k+ words/second F-score explosion.ai/blog/sp-global-commodities

commodities trading insights by extracting structured attributes • high-security environment • used LLM during annotation • 10× data development speedup with humans and model in the loop 6mb 6mb model size 16k+ 16k+ words/second F-score explosion.ai/blog/sp-global-commodities

commodities trading insights by extracting structured attributes • high-security environment • used LLM during annotation • 10× data development speedup with humans and model in the loop • 8 market pipelines in production 6mb 6mb model size 16k+ 16k+ words/second F-score explosion.ai/blog/sp-global-commodities

human-in-the-loop distillation is promising prodigy.fyi/distillation

human-in-the-loop distillation is promising prodigy.fyi/distillation everyone is excited about chat
bots again

? ines.io/blog/window-knocking-machine-test

What ’ s the total services revenue from 2023? $2,923,531
How many clients is that in total? 29 ⏺ ⏺ ⏺ ? ines.io/blog/window-knocking-machine-test

What ’ s the total services revenue from 2023? $2,923,531
How many clients is that in total? 29 ⏺ ⏺ ⏺ 🔮 LLM 📚 database 🤖 agents ⚙ query Retrieval-Augmented Generation ? ines.io/blog/window-knocking-machine-test

2023 Year Services Type ACME Inc. FooBar GmbH NLPCorp XKCD
Ltd. Python AG 432,032 82,000 1,500 193,000 91,320 $ 2,625,032 Clients (28) Revenue What ’ s the total services revenue from 2023? $2,923,531 How many clients is that in total? 29 ⏺ ⏺ ⏺ 🔮 LLM 📚 database 🤖 agents ⚙ query Retrieval-Augmented Generation ? ines.io/blog/window-knocking-machine-test

2023 Year Services Type ACME Inc. FooBar GmbH NLPCorp XKCD
Ltd. Python AG 432,032 82,000 1,500 193,000 91,320 $ 2,625,032 Clients (28) Revenue A I still needs produc t decisions! Kim Miller Analyst What ’ s the total services revenue from 2023? $2,923,531 How many clients is that in total? 29 ⏺ ⏺ ⏺ 🔮 LLM 📚 database 🤖 agents ⚙ query Retrieval-Augmented Generation ? ines.io/blog/window-knocking-machine-test

human-in-the-loop distillation is promising prodigy.fyi/distillation everyone is excited about chat
bots again

Explosion goes back to independent-minded and self-su ff icient explosion.ai/blog/
back-to-our-roots human-in-the-loop distillation is promising prodigy.fyi/distillation everyone is excited about chat bots again

Explosion goes back to independent-minded and self-su ff icient explosion.ai/blog/
back-to-our-roots human-in-the-loop distillation is promising prodigy.fyi/distillation everyone is excited about chat bots again What’s next?

Cycle A doptio n

Cycle A doptio n rules and conditional logic

Cycle A doptio n rules and conditional logic applied workflow

Cycle A doptio n rules and conditional logic linear models
applied workflow

applied workflow applied workflow

applied workflow applied workflow combine new techniques with established workflows

Cycle A doptio n rules and conditional logic deep learning
linear models applied workflow applied workflow combine new techniques with established workflows

linear models chat bots applied workflow applied workflow combine new techniques with established workflows

linear models chat bots applied workflow applied workflow applied workflow combine new techniques with established workflows

transfer learning linear models chat bots applied workflow applied workflow applied workflow combine new techniques with established workflows

transfer learning linear models chat bots trans- formers applied workflow applied workflow applied workflow combine new techniques with established workflows

transfer learning linear models chat bots trans- formers applied workflow applied workflow applied workflow applied workflow combine new techniques with established workflows

transfer learning in-context learning linear models chat bots trans- formers applied workflow applied workflow applied workflow applied workflow combine new techniques with established workflows

transfer learning in-context learning linear models chat bots LLMs and GenAI trans- formers applied workflow applied workflow applied workflow applied workflow combine new techniques with established workflows

transfer learning in-context learning linear models chat bots LLMs and GenAI trans- formers applied workflow applied workflow applied workflow applied workflow applied workflow combine new techniques with established workflows

Summar y NAVIGATING AI & NLP NAVIGATING AI & NLP

Think beyond chat bots or human-shaped tasks. You don’t want to build a “window-knocking machine”.

Think beyond chat bots or human-shaped tasks. You don’t want to build a “window-knocking machine”. Structured Data {} Focus on your application. Consider what it really needs and let your data guide you.

Think beyond chat bots or human-shaped tasks. You don’t want to build a “window-knocking machine”. Stay ambitious. Don’t compromise on best practices, e iciency and privacy. Structured Data {} Focus on your application. Consider what it really needs and let your data guide you.

Think beyond chat bots or human-shaped tasks. You don’t want to build a “window-knocking machine”. Stay ambitious. Don’t compromise on best practices, e iciency and privacy. LLM Keep filling up your toolbox. Know the techniques you have available and apply the best ones to get the job done. Structured Data {} Focus on your application. Consider what it really needs and let your data guide you.

Explosion spaCy Prodigy Twitter Mastodon Bluesky explosion.ai spacy.io prodigy.ai @_inesmontani
@[email protected] @inesmontani.bsky.social LinkedIn

10 Years of Open Source: Navigating the Next AI...

10 Years of Open Source: Navigating the Next AI Revolution

Resources

spaCy: Industrial-Strength NLP

Prodigy: Radically efficient machine teaching

spacy-llm: Integrating LLMs into structured NLP pipelines

The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs

A practical guide to human-in-the-loop distillation

How S&P Global is making markets more transparent with NLP, spaCy and Prodigy

Let Them Write Code

The Window-Knocking Machine Test

More Decks by Ines Montani

Other Decks in Technology

Featured

Transcript