AI Credential Stuffing: How It Works & How to Stop It

Picture a burglar at your front door. The old-school version tries every key on a giant ring — clink, clink, clink — until one turns. He’s loud, he’s slow, and your dog hears him coming. For about fifteen years, that was credential stuffing: dumb, noisy, automated, and easy to swat away with a rate limit and a blocklist of bad IP addresses.

Now picture a different burglar. This one read your diary first. He knows your dog’s name, your anniversary, the year you graduated, and the fact that you always put a ! at the end of your passwords because a website once forced you to. He doesn’t try a million keys — he tries the forty keys most likely to work, arriving from your neighbour’s Wi-Fi so the dog stays quiet. That’s the burglar we’re dealing with. The diary is a 16-billion-credential data breach, the brain is a localised large language model, and the front door is your company’s API.

I’m Amita, and I’ve spent a good chunk of my career living at the intersection of deep learning and the messy real world. What follows is the clearest map I can draw of how generative AI rewired an old attack — and, more importantly, how we close the door again. Let me walk you through it the way I’d explain it to a sharp friend over chai: from first principles, no hand-waving.

The Ground Floor: Why the Old Defenses Stopped Working

Let’s define our terms before we build anything. Credential stuffing is when an attacker takes username-password pairs leaked from one site and tries them on another, betting that people reuse passwords. (They do. You do. I have, and I’m ashamed of it.) It works because of a brutal set of statistics: stolen credentials are the initial way in for 49% of all data breaches and show up in 86% of web-application compromises, and Akamai’s telemetry has clocked over 26 billion stuffing attempts per month. This is not a niche threat; it is the weather.

For years, defenders treated this as a volume problem. Too many login attempts from one IP address? Throttle it. IP on a known-bad list? Block it. The mental model was a fire hose — just kink the hose. The trouble is that the modern attacker no longer uses one fat hose. They use ten thousand garden taps, each one a real residential internet connection (a “residential proxy”), each one whispering a handful of highly probable guesses. There’s no spike to detect, no single IP to ban. The fire-hose defense is fighting a war that ended.

So here’s the question that organizes everything that follows, and the one I want you to hold onto: if the attacker’s guesses are syntactically valid, low-volume, and arrive from clean IP addresses indistinguishable from your real customers — what is actually left to detect? Hold that thought. The answer turns out to be wonderfully sneaky, and we’ll get there. First, we have to understand the brain.

Building the Intuition: The Three-Act Structure of an AI Attack

The attack breaks naturally into three acts. I find it helps to think of them as read the diary, learn the handwriting, try the door.

Act One — Reading the Diary (Agentic Ingestion)

A modern breach dump is not a tidy spreadsheet. It’s a multi-terabyte landfill of .sql files, random PDFs, HR folders, and financial ledgers, often dribbled out over the slow, anonymising Tor network. A human analyst would drown in it.

So the attacker doesn’t use a human. They point an agentic LLM — a language model wired up to run commands and make its own plan — at the landfill. (Think of it as giving the model hands, not just a mouth.) The agent walks the directory tree, flags the high-value folders, and writes itself a map — a canonical index file, a TREE.md of sorts. Then it only digs where the treasure is, extracting emails, passwords, national ID numbers, and passport details with targeted pattern-matching.

Why does this matter? Because of a constraint you’ve probably bumped into yourself if you’ve used ChatGPT on a long document: the context window, the finite amount of text a model can “see” at once. You can’t paste a terabyte into a prompt. By having the agent scope the data first — map it, then sip from it folder by folder — the attacker sidesteps that limit entirely. It’s the difference between trying to memorise a library and learning to read the card catalogue.

Act Two — Learning the Handwriting (The Password-Modeling Brain)

This is the heart of the matter, and it’s where my deep-learning self gets excited, so bear with me.

Old password-cracking tools used rules: take a dictionary word, capitalise the first letter, swap o for 0, slap a year on the end. Rigid. It can’t imagine a password it wasn’t told to imagine. It has no sense that Mumbai@2019 is far more likely than Mumbai@1067.

The new approach treats a password the way a language model treats a sentence. The flagship example is PassGPT (arXiv:2306.01545), built on the GPT-2 architecture. Here’s the one idea you need: it’s autoregressive, which is a fancy word for “it predicts the next character given all the characters so far.” Exactly like the autocomplete on your phone, but for passwords instead of text messages.

The maths looks intimidating and is actually friendly. If a password W is made of characters c₁, c₂, …, cₙ, the model assigns it a probability:

Pr(W) = Pr(c₁) × Pr(c₂ | c₁) × Pr(c₃ | c₁, c₂) × … × Pr(cₙ | c₁, …, cₙ₋₁)

In plain English: the probability of the whole password is the probability of each character, given everything that came before it, all multiplied together. This is just the chain rule of probability — the same rule that lets GPT write an essay one word at a time. Nothing exotic; it’s the bedrock of every language model on Earth.

Here’s a tiny, deliberately harmless illustration of how scoring a sequence works — the identical idea any language model uses, applied to a toy three-letter alphabet:

import math

# Imagine a toy model that, after seeing some characters,
# tells us the probability of each possible NEXT character.
# (A real model learns these from data; here we just hard-code them.)
next_char_prob = {
    "":    {"c": 0.6, "a": 0.3, "t": 0.1},  # at the start
    "c":   {"a": 0.7, "c": 0.2, "t": 0.1},  # after "c"
    "ca":  {"t": 0.8, "a": 0.1, "c": 0.1},  # after "ca"
}

def score(word):
    """Probability of a word = product of each next-char probability."""
    p, context = 1.0, ""
    for ch in word:
        p *= next_char_prob[context][ch]   # chain rule, one factor at a time
        context += ch
    return p

print(score("cat"))   # 0.6 * 0.7 * 0.8 = 0.336  -> very likely
print(score("cca"))   # 0.6 * 0.2 * ...  -> far less likely

The model doesn’t store a list of passwords. It stores a sense of plausibility. That’s the leap. And once you have a probability for every possible string, you can do something clever: generate candidates by sampling, starting from the most probable and working down. You’re no longer guessing randomly; you’re guessing in order of likelihood.

From there, a small zoo of refinements followed, each one a single good idea layered on top:

PagPassGPT feeds the model a structural template as context — something like L4N3S1, meaning “four letters, three numbers, one special character.” Now the model isn’t guessing in the dark; it’s filling in a known shape. Conditioning on that pattern P changes the maths only slightly: Pr(t₁,…,tₙ | P) = ∏ Pr(tᵢ | t₁,…,tᵢ₋₁, P) — same chain rule, now whispering “here’s the shape I’m looking for” at every step. Paired with a duplicate-avoiding generator, its authors report a 27.5% higher hit rate than vanilla PassGPT.
SOPGesGPT forces the model to emit candidates in strict descending probability order, so it never wastes a guess on a duplicate. Its authors credit it with a 35% cover rate — and, eye-wateringly, 421% better than an older GAN-based approach.
PassLLM is the personalised one, and the scariest. It fine-tunes a small (sub-7-billion-parameter) open model — a Mistral or a Qwen — on a single target’s personal data, using LoRA (Low-Rank Adaptation), a technique that lets you adapt a giant model by training only a tiny sliver of new weights. That’s why it runs on a gaming laptop rather than a data centre. The clever bit: the loss function is masked so the model is graded only on getting the password right, given the person’s name and birthday — it doesn’t waste capacity learning anything else. The published result: cracking 12.5%–31.6% of typical targets within just 100 guesses.

Let that sink in. A hundred guesses. The fire-hose detector never even wakes up.

(If the LoRA / autoregressive vocabulary is new to you and you want the proper foundation rather than my coffee-shop version, Deep Learning with Python by François Chollet is still the gentlest on-ramp I know — I recommend it to my own students.)

Act Three — Trying the Door (API Exploitation)

Armed with a short, sorted list of likely passwords, the attacker loads it into tooling and routes each attempt through a different residential proxy, with behavioural mimicry layered on top: reinforcement-learning models that fake human typing rhythm, mouse jitter, and phone-orientation wobble. The bot looks human because it learned to look human.

And the favourite target is the API — the machine-to-machine doorway behind your mobile app and your integration partners. Why? Because the classic anti-bot defenses live in the browser: invisible CAPTCHAs, JavaScript challenges, cookies. APIs don’t run JavaScript or store cookies. So the entire browser-side defensive toolkit simply cannot be deployed there. The API is a door with no peephole.

The Deep Dive: The Attacker Lives in a Glass House

Here’s the twist I genuinely enjoyed. To keep their pipeline private, attackers run these open-weight models locally, on serving runtimes like Ollama and vLLM. And those runtimes have their own gaping holes — which means defenders can hunt the hunters.

The marquee example is CVE-2026-7482, nicknamed Bleeding Llama — a critical (9.1 CVSS) flaw in Ollama. The setup: developers often run Ollama on OLLAMA_HOST=0.0.0.0, which is a quiet way of saying “listen on every network interface, including the open internet.” Now port 11434 is exposed to the world with no authentication.

The vulnerability itself is a heap out-of-bounds read — a class of bug worth understanding because it’s everywhere. Think of your computer’s memory as a row of numbered lockers, and a program is supposed to only open the lockers it was assigned. A malicious model file (in the GGUF format) declares “my tensor is this big” while actually providing far less data. The server, trusting the label, keeps reading past its own locker into the neighbours’ — and whatever happened to be sitting in those neighbouring lockers (API keys, environment variables, other users’ prompts) gets scooped up and handed back. The attacker reads memory they were never allowed to touch. It’s the digital equivalent of asking for a one-page printout and the office printer accidentally feeding it the last person’s confidential document too.

There’s also CVE-2025-59425 in vLLM: a timing attack. The server checked API keys with an ordinary string comparison, which — and this is the subtle, beautiful part — bails out early at the first wrong character. So a key starting with the correct character takes microscopically longer to reject than one that’s wrong from the start. Measure those microseconds enough times and you can reconstruct the real key one character at a time. The fix is one of my favourite lessons in all of security, and I’ll show you the code in a moment because it’s so clean.

There’s even a third, nastier one — CVE-2026-22778, a heap overflow in vLLM’s video-decoding path that escalates all the way to remote code execution on the GPU. The lesson across all three: the moment you run AI infrastructure, that infrastructure is your attack surface. Treat your Ollama and vLLM servers with the same paranoia as your database.

How We Fight Back: Reading the Body, Not the Costume

Now, finally, the answer to the question I asked you to hold onto. If the guesses are valid, low-volume, and arrive from clean IPs, what’s left to detect?

Answer: the body underneath the costume. A bot can fake its User-Agent header — the costume it claims to wear. But it cannot easily fake the deep, low-level signature of how its actual operating system and network library talk. That’s the insight behind JA4 fingerprinting.

JA4 and JA4T — Catching the Impossible Mismatch

When any client connects over HTTPS, it sends a ClientHello packet — a kind of formal handshake that lists, among other things, which encryption ciphers it supports and in what configuration. JA4 reads that handshake and distils it into a compact, human-readable fingerprint. Crucially, it sorts the cipher list alphabetically first, which defeats the trick where browsers shuffle the order to disguise themselves (the flaw that killed JA4’s predecessor, JA3). The payoff: JA4 can tell whether the request truly came from Chrome, or from a Python script, a Go program, or curl wearing a Chrome mask — because those libraries have distinct, hard-to-fake handshakes.

Its sibling JA4T does the same trick one layer deeper, on the raw TCP packet, reading parameters baked into the operating system’s networking kernel — window size, options ordering, maximum segment size. A normal Wi-Fi link shows an MSS of 1460; tunnel it through a VPN and the encapsulation overhead drops it to 1380 or lower. The kernel can’t easily lie about this.

Put the two together and you can spot what defenders call an “impossible mismatch.” Here’s the logic, in defensive pseudocode you could actually adapt:

def is_impossible_client(claimed_user_agent, ja4_lib, ja4t_os):
    """Flag requests whose costume contradicts their body."""
    claims_iphone_safari = "iPhone" in claimed_user_agent

    # A real iPhone runs a Safari/Apple TLS stack on a Darwin kernel.
    # If the body says "Linux server kernel + Python library,"
    # the costume is a lie.
    if claims_iphone_safari and (ja4_os := ja4t_os) == "linux":
        return True
    if claims_iphone_safari and ja4_lib in {"python-requests", "go-http", "curl"}:
        return True
    return False

# A request claiming to be an iPhone, whose TCP fingerprint screams
# "Linux datacenter box running Python" -> dropped at the edge router.

The phone says it’s an iPhone in your hand; its network DNA says it’s a Linux box in a data centre. You don’t need to know the password it’s trying. You just close the connection. This is the defense that doesn’t care about clean IPs or low volume — it reads the body, not the behaviour.

Risk-Adaptive, Multi-Step Authentication

The second pillar is architectural: stop letting your API be a free password-validity oracle. A naïve login endpoint answers “is this password right?” instantly, every time. That’s a gift to an attacker — it turns your server into a high-speed cracking rig working on their behalf.

The fix is to decouple the steps: collect the identifier, then score the risk of the connection (using the JA4 signature, device stability, and “geo-velocity” — is this user in Delhi now and London ninety seconds ago?), then decide on a challenge. High risk triggers a step-up: a WebAuthn passkey, a hardware security key, a second factor. (This is exactly where a phishing-resistant hardware key like a YubiKey earns its keep — it’s the single highest-leverage thing an individual can buy to make their accounts un-stuffable, and I keep two on my own keyring.)

There’s also a quieter, elegant defense here: response-time padding to prevent user enumeration. If your server replies faster for a non-existent username than for a real-but-wrong-password one, an attacker can map out which accounts even exist — for free. The fix is to make every path take the same time. And this is the same family as the timing-attack fix from the vLLM CVE, so let me finally show you that gorgeous one-liner:

import secrets, hmac, hashlib

def check_key_safely(provided_key: str, real_key: str) -> bool:
    # WRONG:  provided_key == real_key
    #   -> bails out at the first wrong character; leaks timing.
    #
    # RIGHT: compare_digest takes the SAME time regardless of
    #        where (or whether) the strings differ. No timing leak.
    return secrets.compare_digest(provided_key, real_key)

# Same principle for normalizing login latency: do the same amount
# of work whether or not the account exists, so the clock tells
# the attacker nothing.
def constant_time_login(username, password, user_db):
    DUMMY = hashlib.sha256(b"decoy").hexdigest()  # always hash *something*
    stored = user_db.get(username, DUMMY)         # no early exit on missing user
    return hmac.compare_digest(hash_pw(password), stored)

secrets.compare_digest is the whole lesson of CVE-2025-59425 in a single function call: never let how long your code takes reveal what your code knows.

WAAP, Degradation Mode, and Tokenization

Two more layers round out the architecture. A WAAP (Web Application and API Protection) platform sits in front of your APIs, learns your OpenAPI schema, and enforces strict request shapes — and under heavy attack it can flip into a degradation mode: tighten the bot-scoring thresholds, queue non-essential traffic, force resets on suspicious successful logins, and keep the lights on for legitimate users while the storm passes. Think of it as a lifeboat protocol for your login system.

Finally, format-preserving tokenization attacks the problem at its root — the fuel. Remember, all of this starts with breached data being parseable enough to train a model. If your data lakes store not real emails and national IDs but structured, region-scoped surrogates — fake-but-realistically-shaped stand-ins — then even a breach yields data that can’t be stitched together across systems to build those personalised PassLLM attacks. You’re not just locking the door; you’re making sure the diary, if stolen, is written in a cipher.

Real-World Grounding

If this feels abstract, it isn’t. Every time your bank texts you a code only when you log in from a new phone, that’s risk-adaptive step-up authentication. Every time a service makes you wait the same beat whether you typed a wrong password or a username that doesn’t exist, someone implemented timing normalisation. And the next time a headline announces “X billion passwords leaked” — now you know that the real danger isn’t the leak itself. It’s that somewhere, a small open-weight model is quietly reading that leak like a diary, learning the handwriting of human passwords, and getting ready to try just forty very good keys at your door.

The encouraging half of the story is that the defense doesn’t require matching the attacker’s AI with bigger AI. It requires reading what can’t be faked — the network DNA — and refusing to be a fast, honest oracle for guesses. That’s a cheaper game to win than the attacker’s, and that asymmetry is our friend.

Reflection Questions — Test Yourself

If you’ve followed me this far, see whether these make you pause (the good ones should):

The chain rule is everything. Re-derive in your own words why Pr(W) = ∏ Pr(cᵢ | c₁…cᵢ₋₁) means a password model can generate candidates in order of likelihood — and why that single property makes 100-guess attacks possible.
Why does JA4 survive what JA3 didn’t? If browsers can randomise their cipher order to evade fingerprinting, why does sorting before hashing defeat that — and what does the attacker have to do to beat the sorted version?
The oracle problem. Explain to a colleague why an API that instantly answers “wrong password” is more dangerous than one that’s merely slow — and connect that to both the login design and the vLLM timing CVE. They’re the same bug wearing two hats.
Follow the fuel. If tokenization makes breached data unusable for training, why isn’t it a silver bullet? What attacks does it not stop? (Hint: re-read Act One.)
The glass house. Attackers run local LLM servers with real vulnerabilities. If you were a defender, how would you turn the attacker’s own infrastructure into a detection signal?

Practical Next Steps

Run the safe code above. Type out the autoregressive score() snippet and watch how probability collapses as a sequence gets longer — it’ll make the chain rule muscle-memory.
Fingerprint yourself. Look up what your own browser’s JA4 string is, then compare it to one generated by a curl request. The difference is the defense.
Audit one thing today. Check whether any service you run binds Ollama or a database to 0.0.0.0. If it does, that’s your weekend.
Skim one primary source. The PassGPT paper (arXiv:2306.01545) is surprisingly readable, and if API defense is your world, Neil Madden’s API Security in Action is the most practical book on the shelf — I keep a copy within arm’s reach.

A note on recommendations: the books and hardware I mention above (Chollet’s Deep Learning with Python, Madden’s API Security in Action, and hardware security keys such as YubiKey) are things I genuinely use and recommend; some links in the published version of this article are Amazon affiliate links, meaning I may earn a small commission at no extra cost to you. It never changes what I recommend — only ever buy what’s useful to you.

About the Author

Dr. Amita Kapoor is an AI researcher, educator, and author who has spent over two decades teaching machines to learn and teaching humans to understand them. She is the author of several widely used books on deep learning and reinforcement learning, and a frequent voice on the practical, ethical, and security dimensions of artificial intelligence. Her work focuses on making complex AI accessible without sacrificing rigor — the same spirit you’ll find in this article.

She is the founder of NePeur, where she builds and advises on applied AI systems, and a co-founder of Retured, working at the intersection of sustainability and intelligent technology. When she isn’t untangling neural networks, she’s probably explaining one over a cup of chai — convinced, as ever, that anything worth knowing can be made clear enough for a curious twelve-year-old to grasp.

Please follow and like us: