Testing anubis to prevent LLM bots from taking down source.puri.sm

I have shared our plan here Testing anubis to prevent LLM bots from taking down source.puri.sm - PureOS-project - Purism Mailing Lists

More testing, feedback and other ideas welcome.

3 Likes

To elaborate what this is about and summarize the issue: bots that crawl servers - mostly LLM/AI related these days - and use the scarce valuable resources (which don’t have a big budget behind them) need to be restricted and that is done using Anubis, which users may see as an anime girl but which actually is a multithread workload that is used to block some of the creeps. The linked text has a good link to a blog post that explains the problem more: Please stop externalizing your costs directly into my face
[Edit to add: This isn’t just an annoyance and expense, this can also be intentional - as in malicious, like DDoS type, but also as in criminal use of resources of others]

3 Likes
  • Looking at that domain’s robots.txt, is there some reason why you are not just outright rejecting all robots? I understand of course that any robot is completely free to ignore robots.txt anyway.
  • Thinking about this question in more general terms, would it be better to remove most or all of that domain from the public internet? However I understand that in the past there have been difficulties for legitimate users to get accounts to access that domain (that would need fixing!) - and also this would entail separating out the material that you do want to be public from the material that you don’t need to be public.

I assume that a consequence of this (Anubis) is that anyone who refuses to allow JavaScript will simply stop working??

Just for fun … here’s what I do for the scenario that a robot publishes its source IP addresses: I present an empty domain to those IP addresses e.g. maybe just a content-free / and a /robots.txt that tells them to bugger off. So even if the robot ignores robots.txt, it isn’t going to take much time at all for it to “traverse” the web site. However the linked article makes clear that LLM robots deliberately use a mass of anonymous IP addresses.

The other thing that I do is to avoid traversability i.e. so that if a robot starts at / then it won’t find much but actual people are directed to specific URLs that don’t generally go anywhere else. Of course this then is a compromise between resistance to robots and usability. That probably isn’t an option anyway if you are using gitlab or similar.

Test site seems quite slow - perhaps as slow as the original site that you are trying to protect i.e. the problem you are trying to solve.

It looks as if you collected a droplet from the ocean :wink: but perhaps that was just for testing.

Test site did also croak at one point:

Also received at one point: The page could not be displayed because it timed out.

Also: An error occurred while fetching commit data.

Otherwise, yes, proof of work was OK, login was OK, and some navigation (on limited testing) was OK.

Not totally convinced that this is ready for prime-time.

(my emphasis)

… or make it computationally expensive for a robot to traverse the web site. I assume that if a robot is prepared to do the proof of work then it will be allowed to traverse the web site i.e. is not blocked.

1 Like

The idea seems fine for now.

The robots that are going to respect robots.txt are usually the ones that would actually be welcome and not cause much trouble - like Wayback Machine crawlers etc.

It is the original site. You can just access it two ways for now - with or without Anubis. It won’t get faster until the direct path gets disabled. (edit: nope, I got confused there)

I don’t understand the point about “material that doesn’t need to be public”. This is FLOSS, everything that’s public there is public for a reason.

2 Likes

Test instance is a lower spec machine than the original one (2 shared vCPU 4GB ram vs 8 dedicated vCPU and 32 GB RAM). The idea was mainly testing anubis, not gitlab itself. Edit: Also Anubis runs only once if you have a cookie, so any further visits don’t have to show proof of work. The cookie is valid for a week. So if you crossed anubis once, further testing of gitlab is not for anubis.

Update: Anyway doubled CPU count and RAM on the test instance so we have a better idea. Please test again.

1 Like

It is a lower spec clone of the original.

2 Likes

This is going to a be a cat and mouse game. By the time the bots catch up, we will have to figure out other ways as well.

2 Likes

By way of example, from the gitlab interface to that domain, I can fork the source of a document, make edits, commit my changes, and make a merge request but … ultimately it is only the actual final document that needs to be public. Without an account, you can’t do any of the editing process. Requiring an account was a hypothesized way of deterring robots.

So I was proposing that source.puri.sm disappear from the public internet unless you have an account and the public outputs from it, where applicable, appear somewhere else e.g. docs.puri.sm (as they do in some cases) or a new hypothetical subdomain cwiki.puri.sm (for the community wiki material) or mirrored read-only into a category of the forum or something else.

I take your implied point that “abuse” is creating tension between openness and reality. This would be an extension of:

Due to an influx of spam, we have had to impose restrictions on new accounts. Please see this page for instructions on how to get full permissions. Sorry for the inconvenience.

OK, you tricked me. Because they are different IP addresses, I assumed that someone had just cloned the server for test purposes (and I did do a whois on the new IP address). Probably a good thing that I didn’t test anything destructive then … :open_mouth: Edit: Yeah, OK, never mind.

1 Like

This would involve quite a lot of engineering work, anubis seems like a simple solution right now. May be if the bots also catch up, we will need to think of more ways like you are suggesting. For bot owners, it is extra work to add defense against anubis - if only a a few git forges have anubis, it may not be cost effective for them to add defense against anubis.

2 Likes

I will try to schedule that for some hours hence.

Bear in mind that I am at the far end of a very long piece of electric string. However I am broadly familiar with the responsiveness that I get from the real site and there was a noticeable difference. The real site is usually perfectly acceptable.

I think it would be good though for someone in the same continent as the server also to test it.

1 Like

Ideally you only need to test if you see anubis more than once. We don’t need to extensively test all the gitlab features or responsiveness. Just enough to see if more anubis appear or not. If we confirm anubis was seen only once, that is enough to confirm I think.

1 Like

OK, strictly from memory, I think I saw it twice, once initially and once after I logged out. Is that expected?

1 Like

Saw it once, when first entering site (no login, just browsing).

Btw. Just as a happenstance, comic for today is related :wink:

1 Like

Not sure, may be the cookie gets refreshed when you login or logout.

1 Like

Just confirming … retested … performance was much better and fine (I understand that it’s not really important) and I had no timeouts or 5xx errors or other weird errors … and was able to reproduce that Anubis does appear immediately after signing out (in addition to appearing initially).

Consider my soul adequately weighed.

1 Like

Thanks a lot for testing and confirming. I’m getting too many alerts for CPU usage over 90% and server load average above 1.5 on the production instance. Though no down time today yet. Since the initial testing and feedback seems positive, I hope to deploy it soon (next few days or a week).

3 Likes

I know this is already answered, but didn’t you read the linked rant?

If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

Yes, this is just one sentence that explains everything in detail. :smile:

1 Like

Which one of the two sentences are you referring to? :wink:

1 Like

I thought it’s clear: the explanation one. The first one was just the intro. :wink:

1 Like