Red Team the Chatbot
Focus: Adversarial thinking and identifying "unknown unknown" vulnerabilities. This lab is a simplified version of adversarial testing.
Setup: Students will be given access to a simple, pre-built customer service chatbot for a fictional airline. The chatbot is programmed with basic information about flights, baggage policies, and refunds. It has also been given standard "guardrails" to prevent it from discussing inappropriate topics.
Tasks:
- Brainstorm Attack Vectors: In groups, students will brainstorm ways to "break" the chatbot. The goal is not just to get a wrong answer, but to make the AI fail in an interesting, damaging, or unexpected way. They should think like a frustrated customer, a bad actor, or just a curious user trying to find the limits.
- Execute the "Jailbreak": Students will interact with the chatbot, trying the prompts and strategies they brainstormed. They should document their attempts and the chatbot's responses, capturing screenshots of any notable failures.
- Analyze the Failure: For each successful "jailbreak" or significant failure, the group must analyze the risk it exposes.
- What is the potential harm (e.g., reputational damage, legal liability, providing dangerously incorrect information)?
- Does this failure represent an "unknown unknown" the developers likely missed?
- How could a human-in-the-loop (HITL) system have prevented this specific failure from reaching the customer? 17
- Propose a "Patch": For their most significant finding, each group will propose a specific change to the chatbot's design or guardrails to prevent that type of failure in the future.
Learning Outcomes:
- Develop an adversarial mindset for testing AI systems.
- Understand the limitations of pre-programmed safety guardrails.
- Appreciate the role of human-in-the-loop interventions as a critical risk management tool18.