All posts
5 min readpii-detectionmachine-learningregexsecurity

PII Detection: Why Regex Alone Isn't Enough

Regular expressions catch the obvious patterns, but real-world PII comes in formats that regex can't handle. Here's why ML-powered detection matters.

Regular expressions are the workhorse of pattern matching. For well-defined formats like credit card numbers (16 digits, Luhn checksum) or US Social Security numbers (XXX-XX-XXXX), regex works perfectly. But PII in the real world is messy.

Where Regex Works

Regex excels at structured patterns with predictable formats:

  • SSNs: \d{3}-\d{2}-\d{4}
  • Credit cards: \d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}
  • AWS access keys: AKIA[0-9A-Z]{16}
  • Email addresses: Standard RFC 5322 pattern
  • Phone numbers: Various formats with country codes

For these patterns, regex is fast, deterministic, and reliable. AxSentinel uses regex as its first-pass scanner — it runs in microseconds and catches the most common patterns with near-perfect accuracy.

Where Regex Fails

The problem is that much of the sensitive data developers handle doesn't follow a standard format:

Custom API Keys

Every SaaS company invents their own key format:

  • sk_live_51H7... (Stripe)
  • xoxb-... (Slack bots)
  • ghp_... (GitHub)
  • rk_live_... (some custom service)
  • company_prod_ak_8f3j2k... (internal APIs)

You can't write a regex for every possible API key format. New services launch daily with new formats.

Unstructured PII

Personal data in code often appears in unstructured contexts:

# This is fine
user_name = "test_user_123"

# This contains real PII
user_name = "John Smith"  # Customer from ticket #4521
db_query = "SELECT * FROM users WHERE name = 'Sarah Johnson'"
error_msg = f"Payment failed for customer {customer.full_name}"

Regex can't distinguish between a test string and a real person's name without understanding context.

Encoded and Obfuscated Secrets

Developers sometimes encode credentials:

{
  "auth": "Basic am9obkBhY21lLmNvbTpQQHNzdzByZDEyMw=="
}

That Base64 string decodes to john@acme.com:P@ssw0rd123. A regex won't flag it, but an ML model trained on this pattern will.

How ML Detection Works

Machine learning models for PII detection work differently from regex. Instead of matching exact patterns, they learn to recognize the context around sensitive data:

  1. Token classification — the model processes text token-by-token and labels each as O (ordinary), B-SECRET (beginning of secret), I-SECRET (inside secret), B-PII, or I-PII
  2. Context awareness — the model learns that tokens after words like "password", "key", "ssn" are likely sensitive
  3. Pattern generalization — after seeing thousands of API key formats, the model recognizes new formats it has never seen before

AxSentinel's ML model is a 6.9M parameter transformer trained on synthetic security data covering 30+ secret types and 14 PII categories. It runs locally using a GGUF quantized format, using just 10MB of memory.

The Best Approach: Both

AxSentinel uses a two-tier scanning strategy:

  1. Regex first — instant, deterministic, perfect for known patterns (SSNs, credit cards, AWS keys, emails, phone numbers)
  2. ML second — catches everything regex misses (custom tokens, unstructured PII, encoded secrets, context-dependent detection)

The regex pass runs in microseconds. The ML pass adds a few milliseconds. Together, they catch significantly more than either approach alone.

Performance

MethodSpeedCoverage
Regex only~0.1msKnown patterns only
ML only~5msBroad but slower
Regex + ML~5msMaximum coverage

The --fast flag gives you regex-only scanning for CI/CD pipelines where speed is critical. For interactive use (proxy mode), the full regex+ML pipeline runs in single-digit milliseconds — unnoticeable to the developer.

Try AxSentinel free →