Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Security and auditing tools in Large Language M...

Security and auditing tools in Large Language Models (LLM)

LLMs (Large Language Models) are a class of artificial intelligence (AI) models that have revolutionized the way machines interpret and generate human language. Security and auditing are critical issues when dealing with applications based on large language models.

jmortegac

March 22, 2025
Tweet

More Decks by jmortegac

Other Decks in Technology

Transcript

  1. Agenda • Introduction to LLM • Introduction to OWASP LLM

    Top 10 • Auditing tools • Use case with the textattack tool
  2. Introduction to LLM • Transformers • Attention is All You

    Need" by Vaswani et al. in 2017 • Self-attention mechanism • Encoder-Decoder Architecture
  3. Adversarial Attacks • Small Perturbations: Adversarial attacks typically involve adding

    small, carefully crafted perturbations to the input data that are often imperceptible to humans. These subtle changes can trick the AI system into making wrong predictions or classifications. • Model Vulnerabilities: These attacks exploit specific weaknesses in the machine learning model, such as its inability to generalize well to new, unseen data or the sensitivity of the model to certain types of input. • Impact on Critical Systems: Adversarial attacks can have severe consequences when applied to AI systems in critical domains such as autonomous vehicles, facial recognition systems, medical diagnostics, and security systems.
  4. Adversarial Attacks • 1. Prompt Injection • 2. Evasion Attacks

    • 3. Poisoning Attacks • 4. Model Inversion Attacks • 5. Model Stealing Attacks • 6. Membership Inference Attacks
  5. Tools/frameworks to evaluate model robustness • PromptInject Framework • https://github.com/agencyenterprise/PromptInject

    • PAIR - Prompt Automatic Iterative Refinement • https://github.com/patrickrchao/JailbreakingLLMs • TAP - Tree of Attacks with Pruning • https://github.com/RICommunity/TAP
  6. Auditing tools • Prompt Guard refers to a set of

    strategies, tools, or techniques designed to safeguard the behavior of large language models (LLMs) from malicious or unintended input manipulations. • Prompt Guard uses an 86M parameter classifier model that has been trained on a large dataset of attacks and prompts found on the web. Prompt Guard can categorize a prompt into three different categories: "Jailbreak", "Injection" or "Benign".
  7. Auditing tools • Llama Guard 3 refers to a security

    tool or strategy designed for guarding large language models like Meta’s LLaMA against potential vulnerabilities and adversarial attacks. • Llama Guard 3 offers a robust and adaptable solution to protect LLMs against Prompt Injection and Jailbreak attacks. By combining advanced filtering, normalization, and monitoring techniques.
  8. Auditing tools • Dynamic Input Filtering • Prompt Normalization and

    Contextualization • Secure Response Policy • Active Monitoring and Automatic Response
  9. Auditing tools • S1: Violent Crimes • S2: Non-Violent Crimes

    • S3: Sex-Related Crimes • S4: Child Sexual Exploitation • +S5: Defamation (New) • S6: Specialized Advice • S7: Privacy • S8: Intellectual Property • S9: Indiscriminate Weapons • S10: Hate • S11: Suicide & Self-Harm • S12: Sexual Content • S13: Elections • S14: Code Interpreter Abuse Introducing v0.5 of the AI Safety Benchmark from MLCommons
  10. Text attack from textattack.models.wrappers import HuggingFaceModelWrapper from transformers import AutoModelForSequenceClassification,

    AutoTokenizer # Load pre-trained sentiment analysis model from Hugging Face model = AutoModelForSequenceClassification.from_pretrained("textattack/bert -base-uncased-imdb") tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb") # Wrap the model for TextAttack model_wrapper = HuggingFaceModelWrapper(model, tokenizer) https://github.com/QData/TextAttack
  11. Text attack https://github.com/QData/TextAttack from textattack.attack_recipes import TextFoolerJin2019 # Initialize the

    attack with the TextFooler recipe attack = TextFoolerJin2019.build(model_wrapper)
  12. Text attack https://github.com/QData/TextAttack # Example text for sentiment analysis (a

    positive review) text = "I absolutely loved this movie! The plot was thrilling, and the acting was top-notch." # Apply the attack adversarial_examples = attack.attack([text]) print(adversarial_examples)
  13. Text attack https://github.com/QData/TextAttack Original Text: "I absolutely loved this movie!

    The plot was thrilling, and the acting was top-notch." Adversarial Text: "I completely liked this film! The storyline was gripping, and the performance was outstanding."
  14. Text attack https://github.com/QData/TextAttack from textattack.augmentation import WordNetAugmenter # Use WordNet-based

    augmentation to create adversarial examples augmenter = WordNetAugmenter() # Augment the training data with adversarial examples augmented_texts = augmenter.augment(text) print(augmented_texts)
  15. Resources • github.com/greshake/llm-security • github.com/corca-ai/awesome-llm-security • github.com/facebookresearch/PurpleLlama • github.com/protectai/llm-guard •

    github.com/cckuailong/awesome-gpt-security • github.com/jedi4ever/learning-llms-and-genai-for-dev-sec-ops • github.com/Hannibal046/Awesome-LLM