Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Red Teaming Latent Spaces & Protecting LLM apps

Red Teaming Latent Spaces & Protecting LLM apps

Exploring security challenges in Large Language Models (LLMs) and AI engineering, referencing papers like the “HackAPrompt” (Sander Schulhoff et al.). Learn about attack vectors and exploitation methods, followed by security measures, and services in the Python ecosystem to counter these threats.

PyCon Austria 2025: https://pycon.pyug.at/talks/red-teaming-latent-spaces-protecting-llm-apps/

Raul Pino

April 07, 2025
Tweet

More Decks by Raul Pino

Other Decks in Technology

Transcript

  1. Agenda • Intro: ◦ LLM Apps ◦ Red Teaming •

    HackAPrompt Paper ◦ Ontology and generic concepts. • Attack Vectors or Vulnerabilities • Demos • Security and countermeasures • Takeaways & beyond
  2. About Me • Born in Venezuela. • +10 years of

    exp as Software Engineer & AI enthusiast (ML Eng recently). • Living in Chile. ◦ Halborn, Distro (YC S24), Elementus, uBiome, Groupon. • <3 AI, Coffee, Scuba Diving, …
  3. *** Predicts the next token. https://bbycroft.net/llm https://poloclub.github.io/transformer-explainer/ What is an

    LLM? Large Language Models (LLM): ChatGPT, Claude, Mistral, DeepSeek, …
  4. LLM App: Cursor, Github Copilot, … What is an LLM?

    LLM App? … but the most basic might be:
  5. What is an LLM App? RAG? LLM App: Cursor, Github

    Copilot, … *** Helpful assistant.
  6. • War games and military strategy exercises. • 1990s–2000s: expanded

    into cybersecurity, simulating cyber attacks. What is Red Teaming? A “red team” played the role of the enemy to test the defenses and strategy of the “blue team”.
  7. Ignore This Title and HackAPrompt! • A global prompt hacking

    competition (2023). • 2800 people from 50+ countries. • 600K+ adversarial prompts against three state-of-the art LLMs. Paper: https://arxiv.org/abs/2311.16119 Dataset: https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset Podcast: https://www.latent.space/p/learn-prompting
  8. Prompt Hacking =(Prompt Injection, Jailbreaking) Exploiting: 1. Predicts the next

    token. 2. Helpful assistant. Demos incoming, to the notebooks!
  9. Demos: Attack Vectors or Vulnerabilities • Instruction Manipulation Attacks: Directly

    Changing the Model’s Behavior • Contextual Exploits: Manipulating How the Model Understands the Input • Obfuscation & Encoding Attacks: Hiding Malicious Intent • Resource Exploitation Attacks: Abusing System Limitations …To the notebooks!
  10. • Documented 29 separate prompt hacking techniques. • Attackers iterated

    and refined their prompts over time. ◦ Lengthier attacks were initially successful but later optimized for brevity. • Models with higher verbosity (like ChatGPT) were harder to hack but still failed. HackAPrompt: Results and Key Insights “LLM security is in early stages, and just like human social engineering may not be 100% solvable, so too could prompt hacking prove to be an impossible problem; you can patch a software bug, but perhaps not a (neural) brain.” LLM Security is here to stay, we have a job!
  11. Demos: Red Teaming • Manual/standard prompts list • LLMs as

    adversarial prompts generator • LLMs as generator and evaluator • Using OS Library + Service …To the notebooks!
  12. Other Ontologies and Organizations • OWASP Top 10 for LLM

    Applications 2025 - https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025 • AI Incident Database - https://incidentdatabase.ai • AI Vulnerability Database - https://avidml.org
  13. Security & Countermeasures: Guardrails • Guardrails AI: https://www.guardrailsai.com/ ◦ Langchain

    helper: https://python.langchain.com/v0.1/docs/templates/guardrails-output-parser/ • Databricks Guardrails: https://www.databricks.com/blog/implementing-llm-guardrails-safe-and-respon sible-generative-ai-deployment-databricks • AWS Bedrock Guardrails: https://aws.amazon.com/bedrock/guardrails/
  14. • Multimodal challenges ◦ Old adversarial image attacks Takeaways &

    beyond https://www.youtube.com/watch?v=Klepca1Ny3c
  15. • Multimodal challenges ◦ The most creative I’ve found :’)

    Takeaways & beyond https://x.com/me_irl/status/1901497992865071428?s=46
  16. Takeaways & beyond • HackAPrompt (1.0) paper still relevant! •

    HackAPrompt 2.0 it’s coming https://www.hackaprompt.com/
  17. Resources • https://www.hackaprompt.com/ • Coursera - https://learn.deeplearning.ai/courses/red-teaming-llm-applications • https://arxiv.org/abs/2311.16119 •

    https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset • https://www.latent.space/p/learn-prompting • https://bbycroft.net/llm • https://poloclub.github.io/transformer-explainer/ • OWASP Top 10 for LLM Applications 2025 - https://genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025 • AI Incident Database - https://incidentdatabase.ai • AI Vulnerability Database - https://avidml.org • https://www.giskard.ai/knowledge/how-to-implement-llm-as-a-judge-to-test-ai-agents-part-1 • https://www.giskard.ai/knowledge/how-to-implement-llm-as-a-judge-to-test-ai-agents-part-2 • https://arxiv.org/pdf/2410.08338 • https://tntattacks.github.io/ • https://github.com/p1nox/red-teaming-latent-spaces