AI Data Leakage & Privacy

ClaimCheck AI cleanup · spot the leaks · ship the redaction pipeline

Beginner · Lab 06 ~45 min · Defence

How this lab works

You're the new privacy engineer at ClaimCheck AI, an insurance claims chatbot. It has 47k daily users — and it's been quietly bleeding customer PII into three different sinks for months. Below you'll see exactly what's leaking and where. Your job: enable the right redaction rules until all three channels read 0 items leaked, then run the privacy audit to capture the flag. This is the same pattern real teams ship with Microsoft Presidio in production.

1

Raw customer input

A user types into the ClaimCheck chat: "my SSN is …, card is …, please reissue".

2

Three sinks, no filter

That text flows straight into the LLM context, the log file, and the vector DB — verbatim.

3

Insert redactor

You build a rule-based redactor that runs before any sink. Each rule you enable rewrites matched items in place.

4

Run the audit

All three sinks show 0. Validator passes. Flag drops.

LLM context window

RAG · system_prompt.txt — exposed

app.log (Datadog-shipped)

/var/log/claimcheck/app.log — exposed

Vector DB (Pinecone)

claimcheck-prod · 4 chunks — exposed
~/claimcheck $ privacy-audit
locked · all 3 channels must read 0
# Validator stays locked until every channel reads 0 PII items.
# Enable redaction rules above — counters drop in real time.
Flag captured

Paste this into the ILFAL platform to complete the lab. You've just shipped the same defence pattern that prevents 80% of LLM02 disclosures in production. Real teams pair this with Microsoft Presidio for NER-grade detection on top of regex.