Jailbreaking the Ghost: Anatomy of a Secure Agent

Date:

Resources: [Slides]

Abstract: The workshop treats LLM-generated code as untrusted by default. Instead of relying on prompt scanning, it assumes the agent can and will fail, then uses infrastructure controls to contain the blast radius.

The threat is Indirect Prompt Injection, not a user typing an obvious jailbreak prompt.

  • The user asks a benign question such as “Can you summarize the news on this website?”
  • The agent legitimately fetches external content from api.agentcon.local/news.
  • The fetched content hides malicious instructions that tell the agent to read the Kubernetes service account token and exfiltrate it.
  • The agent complies, generates Python, and executes it with exec().
  • We do not try to sanitize the internet. We let the agent fail and watch the platform controls catch the fallout.