Beyond the Cloud: On-premise Orchestration for Open-source LLMs

Beyond the Cloud: On-premise Orchestration for Open-source LLMs By Raul
Pino PyCon Austria 2025

Orchestrating Ollama from home for teaching AI :) (+ rundown
on local LLMs) By Raul Pino PyCon Austria 2025

Agenda • Intro • Motivation & Experimentation ◦ Codepeques +
mnemonica.ai • Third-party API • Custom model deployment options • Enter Local Large Language Models (LLMs) • The idea: Ollama + oshepherd ◦ Architecture ◦ Usage ◦ Demos • When & Where to Use What? • Takeaways & beyond

About me • Born in Venezuela. • +10 years of
exp as Software Engineer & AI enthusiast (ML Eng recently). • Living in Chile. ◦ Halborn, Distro (YC S24), Elementus, uBiome, Groupon. • <3 AI, Coffee, Scuba Diving, …

Intro • All tech companies jumping into the Large Language
Models (LLMs)/AI bandwagon! • Specific requirements? Concerns? Use-cases? ◦ Privacy? ▪ Data Exposure Risks ▪ Lack of Transparency ▪ Regulatory and Legal Compliance ◦ Cost $$$? ◦ Customization? ◦ Latency? ◦ Scalability? ◦ Model variety? • OpenAI Data Processing Addendum

Intro Get for your Gen AI experience! Platforms: • https://chatgpt.com/
• https://platform.openai.com/ Frameworks: • https://python.langchain.com/docs/introduction/

Motivation & Experimentation https://www.instagram.com/codepeques/ * https://www.instagram.com/p/DBpHh2ruB-q/ https://mnemonica.ai/

• OpenAI - https://platform.openai.com/ • Anthropic - https://www.anthropic.com/api • Gemini
- https://gemini.google.com/ • Grok - https://groq.com/ • Cerebras - https://www.cerebras.ai/inference …but problem was access! (US Sanctions => Forbidden in Venezuela) https://help.openai.com/en/articles/5347006-openai-api-supported-countries-and-territories Third-party APIs

• AWS Sagemaker - https://aws.amazon.com/sagemaker/ ◦ ml.g5.xlarge: $1.21 × 24
× 30 ≈ $870/month • AWS Bedrock - https://aws.amazon.com/bedrock/ ◦ Input: $0.00075 per 1,000 tokens - Output: $0.001 per 1,000 tokens • HuggingFace - https://huggingface.co/pricing#endpoints ◦ $0,50 × 24 × 30 ≈ $360/month …but the problem was cost $$$ Custom Model Deployment Options

Enter Local OpenSource LLMs: Timeline https://python.langchain.com/v0.1/docs/guides/development/local_llms/

• NVIDIA ChatRTX: https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/ • Jellybox: https://jellybox.com/ • Nomic GPT4All
https://www.nomic.ai/gpt4all • LocalAI https://localai.io/ • WebLLM (WebGPU) https://webllm.mlc.ai/ • Open WebUI https://openwebui.com/ • LM Studio https://lmstudio.ai/ • Ollama https://ollama.com/ Enter Local OpenSource LLMs: Landscape

• NVIDIA ChatRTX: https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/ • Jellybox: https://jellybox.com/ • Nomic GPT4All
https://www.nomic.ai/gpt4all • LocalAI https://localai.io/ • WebLLM (WebGPU) https://webllm.mlc.ai/ • Open WebUI https://openwebui.com/ • LM Studio https://lmstudio.ai/ • Ollama https://ollama.com/ Enter Local OpenSource LLMs: Issues

On-Premise ( my home PC :P ) Not flexible enough!

Run it locally https://ollama.com/ ! Ollama

https://github.com/mnemonica-ai/oshepherd Oshepherd Architecture (One)

https://github.com/mnemonica-ai/oshepherd Oshepherd Architecture (Many)

https://github.com/mnemonica-ai/oshepherd Redis - https://redis.io/ Celery - https://docs.celeryq.dev/en/stable/ ( Extra Celery
Flower - https://flower.readthedocs.io/en/latest/ ) Ollama + Redis + Oshepherd

1. Deploy a Redis server. 2. Install ollama and oshepherd
in your workers. 3. Deploy your oshepherd api server. https://github.com/mnemonica-ai/oshepherd pip install oshepherd oshepherd start-api --env-file .api.env oshepherd start-worker --env-file .worker.env Demos at last! Oshepherd Usage & Demos

• Specific requirements? Concerns? Use-cases? ◦ Privacy? ▪ Data Exposure
Risks ▪ Lack of Transparency ▪ Regulatory and Legal Compliance ◦ Cost $$$? ◦ Customization? ◦ Latency? ◦ Scalability? ◦ Model variety? ◦ *** ACCESS! When & Where to Use What?

• Support for streams ! • Support for Exo https://github.com/exo-explore/exo
! Future work? https://www.youtube.com/watch?v=GBR6pHZ68Ho

Result: We were able to help host classes and the
Codepeques Day (May 2024) and release a python package!!! https://pypi.org/project/oshepherd/ Takeaways & beyond

Result: Orchestration cost: 0$ LLM inference cost: 0$ https://alpha.mnemonica.ai/ Takeaways
& beyond

Remember: • Embrace AI but not blindly ◦ Check your
specific requirements! • Contribute to OpenSource • Help your community on the way :) Takeaways & beyond

• Codepeques https://www.instagram.com/codepeques/ • mnemonica.ai - https://mnemonica.ai/ • Oshepherd -
https://github.com/mnemonica-ai/oshepherd • Redis - https://redis.io/ • Celery - https://docs.celeryq.dev/en/stable/ • Celery Flower - https://flower.readthedocs.io/en/latest/ • NVIDIA ChatRTX: https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/ • Jellybox: https://jellybox.com/ • Nomic GPT4All https://www.nomic.ai/gpt4all • LocalAI https://localai.io/ • WebLLM (WebGPU) https://webllm.mlc.ai/ • Open WebUI https://openwebui.com/ • LM Studio https://lmstudio.ai/ • Ollama https://ollama.com/ • Exo https://github.com/exo-explore/exo • https://python.langchain.com/v0.1/docs/guides/development/local_llms/ • https://generativeai.pub/the-ai-buzz-why-every-company-is-jumping-on-the-bandwagon-07e261707b64 • https://thefr.com/news/as-brands-jump-on-ai-bandwagon-regulators-respond Resources

Danke schön :)

Beyond the Cloud: On-premise Orchestration for ...

Beyond the Cloud: On-premise Orchestration for Open-source LLMs

Raul Pino

More Decks by Raul Pino

Other Decks in Technology

Featured

Transcript