Large-scale online deanonymization with LLMs

LLMs Enable Automated Internet Deanonymization.
A team of academics from Anthropic, ETH Zurich, and MATS Research has developed large language models (LLMs) that can deanonymize internet users based on past comments or other digital clues they leave behind.

The research paper was published here:

Key quotes:

“Our results show that the practical obscurity protecting pseudonymous users online no longer holds and that threat models for online privacy need to be reconsidered,” the researchers said.

“The average online user has long operated under an implicit threat model where they have assumed pseudonymity provides adequate protection because targeted deanonymization would require extensive effort. LLMs invalidate this assumption.”

Scary, isn’t it?

8 Likes

It was only a matter of time.

3 Likes

I know…
But I thought we had a liitle more time. This is going too fast

3 Likes

Maybe we need to throw in some false clues every now and then. But I’m just a bit busy at the moment preparing for my move to Romania. :wink:

2 Likes

You won’t like Romania: it’s a poor country and very corrupt!

I’ve been there lots of times. I love it.

1 Like

We’re all gonna need a “comment obfuscator” extension that either f—'s up one’s correct grammar, punctuation, and spelling, or corrects those, in multiple languages.

2 Likes

Be sure to see the film “Strigoi,” if you haven’t already. :slight_smile:

1 Like

I love that film. :wink:


The research paper does at least give some mitigations e.g.

Enforcing a rate limit for API access to user data, detecting automated scraping, and restricting bulk data exports may reduce the severity of these attacks.

So we may see more of Anubis.

Potentially, yes. I think I need a Firefox Extension for that.

I haven’t read the actual paper in full but I had the impression that it is mostly focused on facts (“personal attributes”). However “writing style” is also mentioned.

Note that there are (at least) two attack scenarios:

  • connecting an online identity to a real-world identity,
  • associating two online identities with each other.

In the case of English (and some other languages) there is also the possibility of mutating one variant into another (rather than just correcting or anti-correcting) – and of the deanonymization software using that against you if you don’t do that.

2 Likes

An obfuscator to mess with writing content and style, but also other attributes:

  • how you write and move the mouse (the rythm and movements of your inputs)
  • how you hate/like something and their combinations (preference variance)
  • how you know about some things and not others (how smart stupid are you / seem to be) :nerd_face:

And so on. Every specific personal detail needs to be closer to mean or filtered out. So why is AI the likely solution: Semantic ablation: Why AI writing is boring and dangerous • The Register

We can measure semantic ablation through entropy decay. By running a text through successive AI “refinement” loops, the vocabulary diversity (type-token ratio) collapses. The process performs a systematic lobotomy across three distinct stages:

Stage 1: Metaphoric cleansing. The AI identifies unconventional metaphors or visceral imagery as “noise” because they deviate from the training set’s mean. It replaces them with dead, safe clichés, stripping the text of its emotional and sensory “friction.”

Stage 2: Lexical flattening. Domain-specific jargon and high-precision technical terms are sacrificed for “accessibility.” The model performs a statistical substitution, replacing a 1-of-10,000 token with a 1-of-100 synonym, effectively diluting the semantic density and specific gravity of the argument.

Stage 3: Structural collapse. The logical flow – originally built on complex, non-linear reasoning – is forced into a predictable, low-perplexity template. Subtext and nuance are ablated to ensure the output satisfies a “standardized” readability score, leaving behind a syntactically perfect but intellectually void shell.

The result is a “JPEG of thought” – visually coherent but stripped of its original data density through semantic ablation.

So, AI to randomise all output to force conformity? Make it “onionized” so that no AI randomiser of the three level (at least) filter knows which one it is or what changes are happening next? Or to change all output of a group (or just everyone online) to haiku poems - the resulting non-specific output that needs a bit of interpretation would probably be at par :smiley:

4 Likes

Just sad and stupid, I guess.

1 Like

That’s a bit different though because this information - as far as I know - can only be obtained by the web site itself, and cannot be obtained retrospectively by scraping a site. So if you avoid untrustworthy sites, you should be able to mitigate this aspect, granted that with common tools often going into a multitude of sites, it is difficult to be sure about which are the untrustworthy sites.

I guess it is possible always to draft a post ‘offline’ and then paste it in with its final content (with the downside that Discourse will then potentially flag you as a spammer).

I would like to see web browsers clamp down on this aspect i.e. a web site cannot track mouse events unless specifically allowed and cannot track key events unless specifically allowed. Some functionality will stop working. You can decide whether that is sufficiently negative to allow the web site to track events or whether this is actually a positive.

Not necessarily. If you had two online identities and you were aiming to prevent an attacker associating those identities, misinformation would be better i.e. throw in a few distinct, preferably false, details that set you apart from the mean, and sets the identities apart from each other.

An exception to that might be if you were the ‘only’ one in the world doing that, and everyone else is using the Blandifier Extension.

1 Like

Financial and medical services and, presumably, government, sites do this sort of monitoring, often farmed out to third parties “to aid in optimizing site usability”.

All of those may be at least a bit dodgy, but they are kind of hard to avoid.

1 Like

Just an expected part of the path to dominance of the AI overlords you keep forecasting. The AI overlords will strongly prefer “Unique IDs” … although it’s clear they are equipped to deal with “Probabilistically Unique IDs”. :wink:

1 Like