Why Floating-Point Numbers are Leaking Your Data

Breaking differential privacy guarantees with CPU arithmetic.

Imagine this: you’re a bank, and you’ve just finished implementing a beautiful differentially private algorithm. It’s elegant, it’s efficient, you’ve even shown it’s provably \((\varepsilon, \delta)\)-DP. You’ve run your unit tests, and everything passes— pat yourself on the back! Time to secure some sensitive financial data (say, all your customer’s transactions) before a differentially private information release.

However, bad news for you and good news for me: despite your efforts, I’ve found your customers’ top-spending ZIP codes (and from there, with a little more work, your top-spending customers). You didn’t make a mistake in the proof, nor was your code buggy, so how did I still end up with this private information? It has to do with how your computer— how all computers— function under the hood.

Welcome to the underworld of floating-point attacks.


Floating points? I haven’t thought about those since CS 101!

Floating-point numbers are how we represent the real line on a computer, and while that seems straightforward, it isn’t. The capacity of a float or double isn’t mathematically equivalent to the real number line. Computers have to round up or down somewhere, since they can’t store infinite information just to record a number like \(1/3\): instead, they use IEEE-754 floating-point approximations.

Most modern DP mechanisms—Laplace, Gaussian, Exponential, and even some fancier samplers like those in Rényi DP accounting—rely on sampling from continuous distributions, and the proofs that guarantee privacy for these mechanisms assume operations on real numbers.

But your computer doesn’t do real numbers. And if you’re not careful, your carefully-tuned privacy guarantees can be undone by nothing more than your CPU. Let’s walk through it!


A Quick DP Recap

Recall that differential privacy protects individuals in a dataset by ensuring that the distribution of outputs doesn’t change too much when any one individual’s data is changed (for a more detailed explanation, read Differential Privacy 101.) For all datasets \(D, D^′\) differing on one entry, and all events \(S\), the mechanism \(M\) is \(\varepsilon, \delta\) differentially private if:

\[ \mathbb{P}[M(D) \in S] \leq e^\varepsilon * \mathbb{P}[M(D') \in S] + \delta \]

This is a distributional guarantee. If someone can distinguish between \(D\) and \(D'\) based on the exact output, your mechanism is no longer DP.

This is where floating-point exploits come in.


What Goes Wrong?

Let’s say you want to add Laplace noise to a query output. If you’re going with the industry standard, you'll probably use inverse transform sampling like so:

\[ \text{Sample } u \sim \text{Uniform}(0, 1), \quad \text{return: } \mu - b \cdot \text{sign}(u - 0.5) \cdot \ln(1 - 2|u - 0.5|) \]

In theory, great. But in practice, np.random.uniform() and its counterparts generate a finite number of possible values. An attacker can simply precompute the finite set of all outputs your DP mechanism can produce, and map them back to inputs or input ranges accordingly (the seminal and very cool paper on this exploit is by Ilya Mironov). For example, there are only \(2^{52}\) representable doubles in the interval \([0, 1]\): if your mechanism always maps the same \(u\) to the same noisy output, an attacker may be able to invert it.

Even worse, floating point math is full of rounding errors, cancellation issues, and machine-dependent edge cases. Even float overflow is exploitable: in certain edge cases, one database will case a float to overflow at some point in the mechanism, while a database with just one observation changed will not. This is an immediate way to do what's called membership inference, where an attacker can determine whether a specific individual is in the dataset based on the output of the mechanism.


It Also Gets Worse

There's more, too! While this isn't strictly speaking a float problem, timing attacks are another example of how theoretical DP can fail under realistic conditions. It's usually pretty reasonable to imagine that an algorithm might take more time to run on, say, larger numers, or more complex data: but this means that the time your mechanism takes to run can leak information just like float math can. For instance, if you’re using rejection sampling to implement the Exponential mechanism, the number of iterations it takes to find a valid sample can vary based on the input scores. A clever attacker can use this to get inferences on your input data with nothing more than a clock!

Timing attacks are nothing new: for years, hackers and security researchers have used timing to recover private RSA keys, break into HTTP webservers, and get around WiFi authentication, but DP researchers are increasingly finding such vulnerabilities too, such as in the Exponential mechanism discussed above (Ilvento & Geumlek, 2020).


What Can We Do?

Fortunately, this issue is solvable in many cases. Researchers (now including me!) have been slowly but surely building up a toolkit of techniques to mitigate floating-point attacks. Here are my general, practical suggestions:

Final Thoughts

Floating point attacks are the Achilles heel of real-world differential privacy. They don’t contradict the theory, but they make a mockery of practice if you’re not careful. It’s not enough for your theory paper to guarantee \( ( \varepsilon, \delta ) \) privacy: IEEE-754 is a clever, useful standard, but it's also a perfect tool for subtle attacks on your carefully tuned mechanisms. As much as you love it, it might not love you back.

In any case, if you’re still curious, check out some further reading and the math appendix here. If you’d like to see actual implementations of these attacks, let me know at rionides@umich.edu: I’m happy to write up a follow-up if there’s demand. It just so happens that attack demos are exactly my idea of good wholesome fun.

Until next time,
Rita


Subscribe to the Regular Expressions Blog

All signal. No noise.

Intuit Mailchimp