I’ve been sitting in on many of the guest lectures for CS 294, “Understanding LLMs.” To my excitement, yesterday’s speaker was Boaz Barak, a Harvard professor who studies theoretical CS and machine learning foundations and whose work and lectures I often enjoy. He spoke about LLM watermarking schemes, and in particular a scheme for breaking more or less arbitrary watermarks. In this post, I’ll give a brief overview of the problem and their solution and offer a synthesis that, to me, gives a succinct way to think about what’s going on here.

AI “watermarking”

There are many settings in which we might like to ensure that some artifact is not generated by an AI model — for example, ensuring that photos came from cameras and not diffusion models or that a homework assignment was written by a student and not an LLM. As generative models get better, it’s going to continue to get harder to make this distinction. The problem statement here is basically: “how can we modify generative AI systems so that essentially all generated content can later be identified as AI-generated, in a way that (a) doesn’t make the model’s outputs significantly worse and (b) can’t be removed by a simple transformation of the output?”

To get a sense for what the landscape looks like here, it’s helpful to imagine some solutions. One naive solution for watermarking LLMs in particular might be: have the LLM only output paragraphs whose characters hash to strings ending in, say, 0. This could be achieved by just regenerating each paragraph until it satisfies this condition. A page of text generated by this model is going to be easily identifiable as such — you just compute the hash of each paragraph and see only ending 0s, which would be unlikely for any text that wasn’t generated under that constraint — but it’s also very fragile: to break this condition, it’s enough to change one word in each paragraph.

Fortunately, there are better ideas: a handful of papers, including one by a team from UMD, propose that an LLM might instead write with a statistical bias that applies at every word, but which is randomized enough to be undetectable to a human — for example, using the most recent token to seed a RNG that splits the dictionary into a “good” and a “bad” half, and weakly biasing the model to sample the next token from the “good” half. This results in human-readable text which can nonetheless be identified as AI-generated with high statistical power (by just running back the “good-list” algorithm), and this “watermark” can’t be removed without changing a lot of the tokens.

A generic black-box attack on AI watermarks

A watermark on AI text is only good if it is hard to remove. In particular, there shouldn’t exist an “attack” process which (1) is easy to compute and (2) transforms “watermarked” text $T$ into “unwatermarked” text $T’$, where $T$ and $T’$ are equally good answers to the original prompt.

Remarkably, Boaz + coauthors propose a generic attack satisfying (1,2). I’ll quickly describe it before offering a synthesis.

Their strategy is pretty simple: just make random small changes to the text, and eventually you’ll have changed enough that any watermark will be destroyed. For example, if you replace every other word with a random synonym, that would be enough to scramble a watermark planted by the UMD team.

…of course, swapping in random synonyms would lead to terrible text, so the authors assume the attacker also has an “quality oracle” that can tell you whether you’ve made a swap that unintentionally made things worse. And of course swapping out single words is a bit limited, so sometimes you might want to rephrase bigger chunks of text, so the authors assume that instead of thesaurus, you’re using a “perturbation oracle” that gives random small modifications. In practice, both of these are language models: a powerful model to tell you if your text is worse, and a weak model to propose local modifications.

Their main result is that, given both a quality oracle and a perturbation oracle, you can remove the watermark by rejection sampling: at every step, use the perturbation oracle to generate a random modification, and then accept the change only if the quality oracle confirms that the change hasn’t made the overall text worse. This provably results in a random walk around the space of high-quality responses, and since only a small fraction of these high-quality responses are “watermarked,” after you follow this random walk for a while, you’re going to reach the promised land of watermark-free essays for your English class. Here’s a beautiful figure from their paper illustrating this scheme:

The authors test their attack against several proposed watermarking schemes and find that it consistently works with little degradation in text quality.

I really like this attack. The fact that such a naive search algorithm — a random walk with rejection sampling — works for this is beautiful (and appeals to me as a trained physicist!). The scheme does a great job of reminding you that watermarked samples are only a small fraction of all samples — indeed, they have to be to be a good watermark — and this very fact makes them fragile! Rarity and nonrobustness are two edges of the same sword.

The one weakness is that it requires a lot of calls to a quality oracle, which in practice will be a fairly powerful language model, so this isn’t a very cheap attack. But then, if you imagine a near-future world in which AI calls are easily available but all AIs are watermarked, this attack works perfectly well!

In all honesty, it doesn’t seem to me like breaking watermarking schemes would be particularly hard. For example, it seems like it’d fool the UMD watermarking scheme to just, say, ask the model to write the essay in Spanish and Google Translate it into English, or to generate the essay with a filler word between every content word. I really like this attack, though, not because it does something I thought was impossible, but because it works very generally and shines light on the landscape of attack and defense here.

Synthesis: bootstrapping a discriminator into a generator

To me, the deep idea behind this attack is that a “quality oracle” can be hooked up to a random process and used to generate high-quality samples — or, in the language of GANs, a “discriminator” can be bootstrapped into a “generator.” A system that merely scores samples is in some sense almost as powerful as a system that generates them! An AI organization might think it’s safe giving you only watermarked text and a powerful discriminator, but in fact that discriminator is enough to effectively regenerate the text.

When can a discriminator be cheaply bootstrapped into a generator? In some sense, this is what every local optimization algorithm is doing: iterated queries to a loss function (the oracle) are converted into a point with a low loss value (the generated sample)! This “iterated bootstrapping of discriminators” is a motif we should be on the lookout for in optimization and learning generally.1

Thought experiment: how far can this bootstrapping idea go?

This de-watermarking attack effectively works on the principle that a discriminator plus a source of random perturbations can be used as a generator. I wonder how far we can take this idea!

Here’s a thought experiment which this principle suggests, and which is sort of the de-watermarking attack taken to the extreme:

  • Choose a prompt from which we’d like to generate some high-quality text.
  • Start with totally random text — just random tokens.
  • Make random perturbations and run rejection sampling as in this attack for a long time.

Question: do you reach high-quality text after many iterations?

In some idealized mathematical setting, the answer has to be yes: quality is only increasing as per the oracle, so eventually it will either reach some local maximum or cross an arbitrary threshold. However, this is a real interesting question in practice thanks to the fact that every component is flawed! Is the oracle reliable enough? Are the perturbations accepted at a good enough rate? Do we reach the promised land of high-quality text in a reasonable amount of time?

I don’t know the answer here, but it seems like it’d be a really cool result if it worked! In particular, I’m curious how dumb the source of perturbations can be in order for it to work. It’d also be really cool to see how the text changes as the process runs — what does it look like, say, 10% of the way through? 25%? 75%? Does it first form good English sentences, then gradually get on topic, then finally become high quality? If it gets stuck, at what point in this progression?2

I’m pretty curious about this! It actually seems like an experiment that could be run by some entity with a lot of compute.


  1. In a meta sense, this is also fundamentally how reinforcement learning works: the learning algorithm converts task feedback scores (the oracle) into a system that can generate new good actions (generated samples). (This process is more expensive than schemes that generate only one sample, but once you’re done you have a system for cheaply generating lots of samples.) It’s not totally clear to me how to relate this meta-bootstrapping — aka learning — to the base-level bootstrapping that generates only one sample, as in this attack and optimization algorithms, but they sure feel suspiciously related to me. 

  2. The main reason to doubt that this thought experiment would work is the lack of a gradient here — we’re taking random steps, most of which will be neutral or a little bit harmful. I’m guessing this is fine for the original attack scheme because we already start at a good point, and because the perturbation oracle is designed to generate samples that tend to neither help nor hurt quality. (Incidentally, this is similar to some physics-inspired sampling schemes, like Hamiltonian Monte Carlo, which use sampling strategies biased towards regions of similar loss to get faster exploration.) Can you improve text in this way instead of just paraphrasing it? I don’t know, but it’d sure be interesting if you could!