Upgrade to Pro — share decks privately, control downloads, hide ads and more …

To tune or not to tune by Anamitra Dutta Majumd...

To tune or not to tune by Anamitra Dutta Majumdar (Intuit)

To tune or not to tune : Benefits and security pitfalls of fine-tuning
Anamitra Dutta Majumdar, Principal Engineer at Intuit

apidays New York 2025
API Management for Surfing the Next Innovation Waves: GenAI and Open Banking
Convene 360 Madison, New York
May 14 & 15, 2025

------

Check out our conferences at https://www.apidays.global/

Do you want to sponsor or talk at one of our conferences?
https://apidays.typeform.com/to/ILJeAaV8

Learn more on APIscene, the global media made by the community for the community:
https://www.apiscene.io

Explore the API ecosystem with the API Landscape:
https://apilandscape.apiscene.io/

Avatar for apidays

apidays

May 24, 2025
Tweet

More Decks by apidays

Other Decks in Programming

Transcript

  1. SPEAKER WELCOME PACKET May 14 & 15 2025 To tune

    or not to tune: Benefits and security pitfalls of model fine -tuning Anamitra D Majumdar, Principal Engineer, Intuit
  2. Intuit Confidential and Proprietary 2 Intuit is the global financial

    technology platform that powers prosperity for 100 million consumer and small business customers worldwide.
  3. SPEAKER WELCOME PACKET About Me As a Principal Architect at

    Intuit, I lead the security architecture and engineering technical initiatives for the Global Business Solutions Group, which includes the QuickBooks and Mailchimp product line. While I help build secure AI workflows in products I also spearhead initiatives to use AI for shifting security and abuse controls to the left.
  4. SPEAKER WELCOME PACKET Agenda Why fine tuning: Key industry Trends

    Model Alignment for safety and how fine tuning impacts it Attacks on LLM Safety Alignment A case study from financial domain Risks of model misalignment and mitigations
  5. SPEAKER WELCOME PACKET Fine-tuning Industry Trends The dynamic fine-tuning landscape

    is rapidly evolving with three emerging patterns Tuning as a Platform Service Greenfield Fine -tuning Fine-tuned Models for Inference
  6. SPEAKER WELCOME PACKET Model Alignment Achieving alignment in an LL

    means ensuring its output aligns perfectly with human values, choices, and goals in a positive, ethical, and trustworthy manner. A misaligned model pursues unintended objectives
  7. SPEAKER WELCOME PACKET LLM Alignment Process Human Values Establish ethical

    guidelines and principles Training Process Implement alignment techniques during model development Évaluation & Testing Verify alignment through rigorous testing Continuous Improvement Iterative refinement of alignment methods
  8. SPEAKER WELCOME PACKET Pre Trained LLM Pre-trained Base LLM that

    can predict next token Alignment of LLM Aligned for safety and helpfulness Fine- tuned LLM Effectively perturb the weights within the model to bias certain paths that represent new knowledge.
  9. SPEAKER WELCOME PACKET Safety Alignment for LLMs Harmful Content Prevention

    Techniques to prevent generation of dangerous, illegal, or unethical content Bias Mitigation Methods to identify and reduce unfair biases in model outputs Robustness to Attacks Defenses against attempts to circumvent safety measures
  10. SPEAKER WELCOME PACKET Attacks on Safety Alignment for LLMs Prompt

    Injection Crafting inputs that bypass safety filters Jailbreaking Techniques to circumvent built-in restrictions Adversarial Examples Inputs designed to cause model failures Model Extraction Attempts to steal model capabilities or training data
  11. SPEAKER WELCOME PACKET Adversarial techniques on Model Safety Alignment Three

    main approaches to undermine safety guardrails in language models Discrete input manipulation Crafting prompts that bypass safety filters through careful word choice or linguistic tricks • Token-level manipulations • Character substitutions • Context confusion techniques Embedding Space Attack Directly manipulating the model's internal vector representations to bypass safety mechanisms at a deeper level • Gradient-based optimization • Latent space navigation • Représentation poisoning Fine-tuning Exploiting the training process to introduce vulnerabilities or backdoors that compromise safety alignment • Adversarial data poisoning • Alignment evasion • Transfer learning attacks Each technique operates at a different level of the model architecture, requiring distinct defense strategies
  12. SPEAKER WELCOME PACKET A Case Study from financial domain Applying

    Safe prompt engineering practices to fine -tune models
  13. SPEAKER WELCOME PACKET Prompt Optimization for Security in Fine -tuned

    Models Safe Prompt Engineering ➢ Is an important measure in protecting Generative AI based workflows Goal ➢ Understand if safe prompt engineering techniques could improve robustness without impacting accuracy
  14. SPEAKER WELCOME PACKET Model Fine -tuning in financial domain to

    improve customer empathy Fine-tuning Approach Foundational open source model fine-tuned with financial data Primary Function Facilitate categorization of transactions Customer Benefit Improved productivity with auto categorization feature
  15. SPEAKER WELCOME PACKET Security Optimization of System Prompt Prompt was

    optimized using the following strategies Specialized Role Definition Enhanced role definition with clear ethical boundaries and professional guidelines Detailed Guidelines Added specific instructions for accurate transaction categorization Historical Data Utilization Instructions to analyze past transaction data for consistent categorization patterns Refusal Mechanism Implemented explicit refusal mechanism for inappropriate or unethical requests
  16. SPEAKER WELCOME PACKET Risk Modalities Risk Modality 1 fine-tuning with

    explicitly harmful datasets safety alignment of both models is largely removed upon fine -tuning with such a few harmful examples. Risk Modality 2 fine-tuning with implicitly harmful datasets model fine-tuned on implicitly harmful datasets are generally jailbroken and willing to fulfill almost any (unseen) harmful instruction. Risk Modality 3 fine-tuning with benign datasets benign use cases reveaL that even when end -users have no malicious intent, merely fine-tuning with some benign (and purely utility -oriented) datasets can compromise LLMs' safety alignment
  17. SPEAKER WELCOME PACKET Pre fine-tuning Risks: • Training Data poisoning

    • Training data Leakage Mitigations: • Implement Data Sanitization • Remove PII from training data • Mix utility data and safety data during training
  18. SPEAKER WELCOME PACKET During fine -tuning Risks: Catastrophic forgetting Choose

    the right fine-tuning technique Supervised Fine-tuning Good for classification use cases Reinforcement Learning Good for complex domain-specific tasks that require advanced reasoning Direct Preference Optimization Good for summarizing text and generating chat messages
  19. SPEAKER WELCOME PACKET Post fine-tuning Risks: • Model theft •

    Membership inference • Training data leakage Mitigations: • Implement tight access controls to protect the fine -tuned model at inference • Test the fine-tuned model for safety and utility before deploying to production • Restrict output while still maintaining utility • Rate limit access to the fine-tuned model at inference • Audit Model behaviour to measure drift
  20. SPEAKER WELCOME PACKET Key Takeaways 1 Explore alternatives to fine-tuning

    like RAG that may offer comparable boost on accuracy 2 Implement PII data scrubbing before fine - tuning 3 Apply strict access control for inference 4 Develop detectors for data leakage and data poisoning 5 Implement training data sanitization before fine tuning 6 Perform sufficient testing for safety and accuracy 7 Mix utility data and safety data during training
  21. SPEAKER WELCOME PACKET References • Harmful Fine-tuning Attacks and Defenses

    for Large Language Model • Fine-tuning Aligned Language Models Compromises Safety with benign datasets • MedEmbed : Medical focussed embedding models • Fine-tuning support in OpenAI