Differential Privacy 101
Questions people ask at parties.
Contrary to popular opinion, I do actually get asked about differential privacy at parties. (I go to some pretty nerdy parties, but the point stands). I’m also asked about it in coffee chats, in office hours by my curious students, and once—memorably—on a date, where I think the phrase “epsilon-ball around a database” ended things prematurely.
In any case, it’s come up enough that I thought it would be helpful to write up my favorite explanation of what differential privacy is, why it matters, and how it differs from other so-called privacy-preserving techniques. We’ll start with a story and a party game, and then for those who are interested in the actual implementation, I’ll share my take on the mathematical definitions. They’re famously difficult to parse, but the right intuition helps a lot, and my goal with the story is to help you build that intuition.
Question #1: "Rita can you please explain your research like I’m 5."
Differential privacy, at its bones, is a mathematical way to guarantee the safety of a piece of information in some kind of model. That model could be as simple as a median or as complex as an LLM or a diffusion model—the main thing that matters is that you can’t tell what went in from anything that comes out.
This is obviously relevant in fields where we want to apply machine learning models to inherently sensitive data. I work with HIPAA-protected patient data for healthcare applications, where differential privacy is huge; other users include big tech companies and the US census. In practice, you make something differentially private by injecting noise somewhere in the model (more on that later!) – that’s more or less the basics.
Question #2: "I’ve heard of things being 'privacy-preserving': how is differential privacy different?"
Here’s the example I like to give to my friends.
Say the United Nations creates a new disarmament initiative. Every country will secretly submit the true number of nuclear weapons they possess to a secure, incorruptible black-box machine. The machine then computes the arithmetic mean and reports it publicly. This seems fair, right? The mean here is a privacy-preserving method, meaning that it generally obscures private information.
This is a toy example, but the literature is full of ‘privacy-preserving’ methods: for example, two of the most popular are federated learning (which aggregates gradient updates instead of raw data) and privacy-preserving data synthesis (which generates synthetic data instead of using the real thing). These more complex models are serving the same purpose as our more humble mean.
Let’s continue our example. In the peacekeeping initiative, every country contributes their number, and we learn that the mean is 42. But after a while, Russia decides to pull out of the agreement. They withdraw their data, and suddenly everyone can see that the mean has dropped to 33.
Yikes. You see the problem: even though the mean was anonymized, everyone knows exactly how many weapons Russia must have had. This is called a membership inference attack, and it is the problem with relying solely on “privacy-preserving methods.” These methods tend to protect data in aggregate, but are often vulnerable to attacks that exploit subtle changes—like adding or removing one person (or country). This is why we need differential privacy: it explicitly guards against this kind of inference. In fact, it formally ensures that the presence or absence of any single participant does not substantially affect the outcome.
Question #3: "This is where you start talking about math, yes?"
Yes!! Formally, a randomized algorithm A is said to be ε-differentially private if for any two neighboring datasets D and D' that differ by exactly one individual record (like the countries with and without Russia), and for any set of possible outputs S, we have:
Pr[A(D) ∈ S] ≤ e^ε ⋅ Pr[A(D′) ∈ S]
This means, broadly, that the probability that a particular result happens when your data is included is nearly the same as when it isn't. So whether or not you’re in the dataset, the output won’t noticeably change.
The parameter ε is often called the privacy budget—the lower it is, the more privacy you’re guaranteed. Values like ε = 0.1 are very private, while ε = 5 means your data might have a significant effect on the output.
Question #4: "Is there a fun differentially private game I can play at home?"
I’m so glad you asked. Here’s one to try with a friend!
The Coin Toss Game
You will need:
- A friend
- A coin
- A secret
Have your friend think of a secret (with a yes/no answer)—ideally something mildly scandalous. Did they eat your leftovers? Do they actually like your taste in music?
Then have them follow these instructions:
- Flip a coin.
- If heads, answer truthfully.
- If tails, flip again:
- If second coin is heads, answer “Yes.”
- If second coin is tails, answer “No.”
No matter what they say, you can’t be sure whether they’re telling the truth. The randomness adds plausible deniability, and adding noise to obscure their datapoint is what differential privacy is all about.
This game, in fact, turns out to be ε-differentially private for ε = ln(3). Here's why:
- If the true answer is “Yes,” you’ll respond “Yes” with probability 0.5 + 0.25 = 0.75
- If the true answer is “No,” you’ll respond “Yes” with probability 0.25
So:
0.75 / 0.25 = 3 ⇒ ε = ln(3)
This bounded ratio is exactly what differential privacy asks for: no output is “too much more likely” depending on a single bit of private data.
Final thoughts
I know I’m biased, but differential privacy isn’t just a research buzzword. It’s a deeply useful and beautiful idea, which formalizes something we all intuitively want: to be included in data science without being exposed by it. If I can help explain that at a party, in a blog post, or through a coin toss, I’ll consider my work worthwhile.
Thanks for reading. Feel free to reach out if you’d like to play the coin game sometime—I have plenty of secrets and a very reliable quarter.
Until next time,
- Rita