What is Red Teaming?
TL;DR
Adversarial testing where experts attack AI models to surface vulnerabilities. OpenAI, Anthropic, and Google run internal teams plus external bounties; mandatory under the EU AI Act in 2026.
Red Teaming: Definition & Explanation
AI Red Teaming, borrowed from military and cyber-security tradition, is the practice of attacking an AI system from the adversary's perspective to discover failure modes. It became a central LLM discipline from 2023 onward and is now mandated for High-Risk AI under the EU AI Act, US NIST AI RMF, and the UK AISI process in 2026. Attack categories: (1) Jailbreak — bypass safety constraints (DAN, Many-Shot, Crescendo), (2) Prompt Injection — hijack via hidden instructions, (3) Data Extraction — surface memorized training data (membership inference), (4) Bias / Toxicity — coax discriminatory or harmful output, (5) Indirect Injection — attack via files, URLs, or images, (6) Multi-Modal Attacks — instructions hidden in image or audio, (7) Tool / Agent Abuse — getting agents to misuse tools, (8) Capability Discovery — pulling out undisclosed abilities. How it's done: (a) internal teams — OpenAI Red Team, Anthropic Frontier Red Team, DeepMind Safety, (b) external bounties — HackerOne, Bugcrowd AI bounties, (c) crowdsourced — DEF CON AI Village, Apollo Research evals, (d) automated red teaming — Garak (NVIDIA OSS), PyRIT (Microsoft), AI Red Teamer (Anthropic). Notable artifacts: GPT-4 Red Team Card (six months pre-launch), Claude Constitutional AI (red team-driven), Llama Guard (Meta), Gemini Red Team Report (Google DeepMind). 2026 trends: EU AI Act forces red-team reports for major providers, US Executive Order 14110 / NIST AI RMF alignment, automated red teaming with adversarial AI, capability evaluations for biological / chemical / cyber risks, and the Frontier AI Safety Commitments (OpenAI, Anthropic, Google, Microsoft, etc.).