on-premise) → Need to tune the latency to make the model faster ⚙ Customization & fine-tuning → No lock-in to a particular model 🔒 Security compliance & data residency / privacy
integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. https://github.com/microsoft/semantic-kernel
LLM serving for everyone vLLM is fast with: ✅ State-of-the-art serving throughput ✅ Efficient management of attention key and value memory with PagedAttention ✅ Continuous batching of incoming requests ✅ Fast model execution with CUDA/HIP graph ✅ Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache ✅ Optimized CUDA kernels https://github.com/vllm-project/vllm Throughput: Higher is better