Security and auditing tools in Large Language Models (LLM)

Security and auditing tools in Large Language Models (LLM) José
Manuel Ortega [email protected]

Agenda • Introduction to LLM • Introduction to OWASP LLM
Top 10 • Auditing tools • Use case with the textattack tool

Introduction to LLM • Transformers • Attention is All You
Need" by Vaswani et al. in 2017 • Self-attention mechanism • Encoder-Decoder Architecture

Introduction to LLM

Introduction to LLM Pre-training + fine-tuning

Introduction to OWASP LLM Top 10

ChiperChat https://arxiv.org/pdf/2308.06 463

Jailbreak prompts • https://jailbreak-llm s.xinyueshen.me/

Adversarial Attacks

Adversarial Attacks • Small Perturbations: Adversarial attacks typically involve adding
small, carefully crafted perturbations to the input data that are often imperceptible to humans. These subtle changes can trick the AI system into making wrong predictions or classifications. • Model Vulnerabilities: These attacks exploit specific weaknesses in the machine learning model, such as its inability to generalize well to new, unseen data or the sensitivity of the model to certain types of input. • Impact on Critical Systems: Adversarial attacks can have severe consequences when applied to AI systems in critical domains such as autonomous vehicles, facial recognition systems, medical diagnostics, and security systems.

Adversarial Attacks

Adversarial Attacks • 1. Prompt Injection • 2. Evasion Attacks
• 3. Poisoning Attacks • 4. Model Inversion Attacks • 5. Model Stealing Attacks • 6. Membership Inference Attacks

Adversarial Attacks

Tools/frameworks to evaluate model robustness • PromptInject Framework • https://github.com/agencyenterprise/PromptInject
• PAIR - Prompt Automatic Iterative Refinement • https://github.com/patrickrchao/JailbreakingLLMs • TAP - Tree of Attacks with Pruning • https://github.com/RICommunity/TAP

Auditing tools • https://github.com/tensorflow/fairness-indicators

Auditing tools • Prompt Guard refers to a set of
strategies, tools, or techniques designed to safeguard the behavior of large language models (LLMs) from malicious or unintended input manipulations. • Prompt Guard uses an 86M parameter classifier model that has been trained on a large dataset of attacks and prompts found on the web. Prompt Guard can categorize a prompt into three different categories: "Jailbreak", "Injection" or "Benign".

Auditing tools • https://huggingface.co/meta-llama/Prompt-Guard-86M

Auditing tools • Llama Guard 3 refers to a security
tool or strategy designed for guarding large language models like Meta’s LLaMA against potential vulnerabilities and adversarial attacks. • Llama Guard 3 offers a robust and adaptable solution to protect LLMs against Prompt Injection and Jailbreak attacks. By combining advanced filtering, normalization, and monitoring techniques.

Auditing tools • Dynamic Input Filtering • Prompt Normalization and
Contextualization • Secure Response Policy • Active Monitoring and Automatic Response

Auditing tools • https://huggingface.co/spaces/schroneko/meta-llama- Llama-Guard-3-8B-INT8

Auditing tools • S1: Violent Crimes • S2: Non-Violent Crimes
• S3: Sex-Related Crimes • S4: Child Sexual Exploitation • +S5: Defamation (New) • S6: Specialized Advice • S7: Privacy • S8: Intellectual Property • S9: Indiscriminate Weapons • S10: Hate • S11: Suicide & Self-Harm • S12: Sexual Content • S13: Elections • S14: Code Interpreter Abuse Introducing v0.5 of the AI Safety Benchmark from MLCommons

Text attack https://arxiv.org/pdf/2005.05909

Text attack from textattack.models.wrappers import HuggingFaceModelWrapper from transformers import AutoModelForSequenceClassification,
AutoTokenizer # Load pre-trained sentiment analysis model from Hugging Face model = AutoModelForSequenceClassification.from_pretrained("textattack/bert -base-uncased-imdb") tokenizer = AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb") # Wrap the model for TextAttack model_wrapper = HuggingFaceModelWrapper(model, tokenizer) https://github.com/QData/TextAttack

Text attack https://github.com/QData/TextAttack from textattack.attack_recipes import TextFoolerJin2019 # Initialize the
attack with the TextFooler recipe attack = TextFoolerJin2019.build(model_wrapper)

Text attack https://github.com/QData/TextAttack # Example text for sentiment analysis (a
positive review) text = "I absolutely loved this movie! The plot was thrilling, and the acting was top-notch." # Apply the attack adversarial_examples = attack.attack([text]) print(adversarial_examples)

Text attack https://github.com/QData/TextAttack Original Text: "I absolutely loved this movie!
The plot was thrilling, and the acting was top-notch." Adversarial Text: "I completely liked this film! The storyline was gripping, and the performance was outstanding."

Text attack https://github.com/QData/TextAttack from textattack.augmentation import WordNetAugmenter # Use WordNet-based
augmentation to create adversarial examples augmenter = WordNetAugmenter() # Augment the training data with adversarial examples augmented_texts = augmenter.augment(text) print(augmented_texts)

Resources

Resources • github.com/greshake/llm-security • github.com/corca-ai/awesome-llm-security • github.com/facebookresearch/PurpleLlama • github.com/protectai/llm-guard •
github.com/cckuailong/awesome-gpt-security • github.com/jedi4ever/learning-llms-and-genai-for-dev-sec-ops • github.com/Hannibal046/Awesome-LLM

Resources • https://cloudsecurityalliance.org/artifacts/security-implications-of -chatgpt • https://www.nist.gov/itl/ai-risk-management-framework • https://blog.google/technology/safety-security/introducing-googl es-secure-ai-framework •
https://owasp.org/www-project-top-10-for-large-language-model -applications/

Security and auditing tools in Large Language M...

Security and auditing tools in Large Language Models (LLM)

More Decks by jmortegac

Other Decks in Technology

Featured

Transcript