dataset: Load data from Hugging Face by using Ray Dataset to load. • Preprocess dataset: using the Ray dataset function to tokenize data for the preprocessing process. • Fine-tune model: Using the Ray train function with the train function from Hugging Face to create a fine-tune foundation model. • Tune model: Ray provides a tuning function for hyperparameter tuning the model. https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a modelʼs parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware. PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to provide a faster and easier way to load, train, and use large models for inference. (Text copied from source) https://huggingface.co/docs/peft/en/index
to reduce the number of trainable parameters which speeds up finetuning large models and uses less memory. In PEFT, using LoRA is as easy as setting up a LoraConfig and wrapping it with get_peft_model() to create a trainable PeftModel. (Text copied from source) https://huggingface.co/docs/peft/en/task_guides/lora_based_methods
a useful technique for reducing memory-usage and accelerating inference especially when it comes to large language models (LLMs). There are several ways to quantize a model including: • optimizing which model weights are quantized with the AWQ algorithm • independently quantizing each row of a weight matrix with the GPTQ algorithm • quantizing to 8-bit and 4-bit precision with the bitsandbytes library • quantizing to as low as 2-bit precision with the AQLM algorithm However, after a model is quantized it isnʼt typically further trained for downstream tasks because training can be unstable due to the lower precision of the weights and activations. But since PEFT methods only add extra trainable parameters, this allows you to train a quantized model with a PEFT adapter on top! Combining quantization with PEFT can be a good strategy for training even the largest models on a single GPU. For example, QLoRA is a method that quantizes a model to 4-bits and then trains it with LoRA. This method allows you to finetune a 65B parameter model on a single 48GB GPU! (Text copied from source) https://huggingface.co/docs/peft/main/en/developer_guides/quantization
choosing wisely) ➢ Need to tune the latency to make the model faster 2. Customization & fine-tuning ➢ No lock-in to a particular model 3. Security compliance & data residency / privacy
library) for developing applications powered by large language models (LLMs). The main values of LangChain are: ✓ Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not ✓ Off-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks https://www.langchain.com/langchain
integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. Semantic Kernel achieves this by allowing you to define plugins that can be chained together in just a few lines of code. What makes Semantic Kernel special, however, is its ability to automatically orchestrate plugins with AI. With Semantic Kernel planners, you can ask an LLM to generate a plan that achieves a user's unique goal. Afterwards, Semantic Kernel will execute the plan for the user. https://github.com/microsoft/semantic-kernel
activities, including: • Model deployment and maintenance: deploying and managing LLMs on cloud platforms or on-premises infrastructure • Data management: curating and preparing training data, as well as monitoring and maintaining data quality • Model training and fine-tuning: training and refining LLMs to improve their performance on specific tasks • Monitoring and evaluation: tracking LLM performance, identifying errors, and optimizing models • Security and compliance: ensuring the security and regulatory compliance of LLM operations https://cloud.google.com/discover/what-is-llmops
companies such as OpenAI, Uber, and Cohere, which help deploy LLM models, is Anyscale. This platform allowed developers to serve LLM models, and it was developed by the Ray Team, but I would like to share some details about their tool, Ray-LLM. This tool solves a technical problem when serving the LLM model, which could be divided into three issues. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
native batching with different word sequences, but this method will concatenate new input sequence tokens into the end of the token batch to fill the batch sequence, which increases the throughput of the system. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
small model to speculate the K token ahead and then a large model to verify, and it will emit a token if the token isnʼt correct. These methods allow for faster forward passes per token, which reduces latency since large models only verify and can do it in parallel. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
to classify queries that feed into the model and then select a suitable model before feeding into the LLM. This method helps to create agents based on the LLM since each LLM has its own specificity for each task, so it could be a better idea to let the model that has more context answer it. (Text copied from blog post) https://medium.com/@RTae/leveraging-ray-and-vertex-ai-for-llmops-6b169f65ff3a
serving for everyone vLLM is fast with: ✅ State-of-the-art serving throughput ✅ Efficient management of attention key and value memory with PagedAttention ✅ Continuous batching of incoming requests ✅ Fast model execution with CUDA/HIP graph ✅ Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache ✅ Optimized CUDA kernels https://github.com/vllm-project/vllm