How To Fight AI Abusing You

Can PureOS stop AI from scraping our data?

It’s more than apparent that AI has become Stalkers favourite tool to get to know you, and use that against you whether you like it or not.
A report from Euronews shows why, and how we need to amour ourselves even more from the Stalkers that monitor our every move, including mouse/drag positions, to inject code in to our devices, in order to record and control us.

Not only is the article informative, but it is also a road map why, and how.
~s

2 Likes

No, and neither can any other operating system. Web scrapping in general can be mitigated by rate-limiting (CAPTCHAs) and authentication pages, but that is not enough to stop the activity entirely.

2 Likes

These kinds of news have popped up during the last two years frequently. They are about gathering data - or hoarding it, more like - for AI training. The key feature is the amount. They are not, in this case, interested in individuals, but want/need massive quantities of any data made by humans (due to the fact that there are limits how and how much synthetic data can be used to train AI models). The data is used to create large language model (LLM) type AIs in an effort to make them less dumb, hallucinate less. There is a privacy issue that AIs may unintentionally include private info (if it has been available online when scraped) in their hallucinations/outputs, but the difference is that private info is not been targeted for scraping - just anything easily and openly available (Youtube was scraped by Nvidia, it seems, OpenAI&Microsoft scraped NY Times etc.).

The only defense or limitation seems to have been robots.txt instruction file that has a request that “please don’t scrape here” but that’s not much of a block if the site is not behind proper login/paywall. For example, I’m pretty sure this forum’s open threads are scraped but for example hidden Round Table area is not (unless it’s hacked, which is much more serious than just scraping) - and not our user metainfo (IP, logs etc.). Then again, it might be a good thing that this forum is/would be scraped as that would add some balance and varied views to the AI “brain” (and good that our Round Table discussions were not). The discussion might be (as such discussions are done elsewhere too), should that happen and should it be for free, for a fee and who’d get the payment/settlement (Purism? Users per word or per post or per like or per solution or or per years or per tier or per something else? Or donate?). TOS is from 2013 and doesn’t specifically cover this point but there’s a lot there to apply. Btw. take a look at the forum robots.txt: https://forums.puri.sm/robots.txt

[edit to add, example: I just tested with one AI and it answered about L5 based on a Tuxphone article and about MiMi based only on Purism marketing materials. Would forum add to those infos? Would it be right?]

Just saying, this particular problem is limited to specific area of un-ethicalness that’s focused on large caches of data, like websites, not at individual devices, like phones (although, technically, if you run a webserver from your L5, it’s potentially available for scraping). At the moment it seems it’s more feasible to scrape/copy/steal data as it’s valuable and the risk is seen as negligible but I’m expecting big litigations which may change that sentiment. [edit to add: Yes, I think linux is part of the solution due to it being mostly secure and preventing random scrapers getting in to your computer, if they are not stopped already before that at home router or such.]

(Btw. What doesn’t help this is that CAPTCHAs that are intended to stop bots are pretty much useless against AI-bots - and Google has been using humans to make profit with them )

1 Like

Probably not but a privacy-respecting ecosystem can.

There are many ways in which data can be collected

  • voluntarily if you publish your data, on social media or otherwise
  • involuntarily if you use a spyphone or own and operate other spying technology
  • involuntarily or semi-voluntarily in the many interactions with government or business that we may have daily

and there’s no one answer that will address all of that.

2 Likes

Boycott, bug out and/or bunker down.

Well, yes, OK - you can exit society completely and hence by definition exit the internet.

That will fully cease the collection process, although that leaves the legacy of all the data collected about you so far.

1 Like

That goes full circle:

1 Like

Fair points.

That might lead to a question: how can one, in practical terms, accelerate the process of data becoming stale? Governments / companies will keep the data anyway though, is my guess.

One answer might, paradoxically, be: rather than starve them of your personal data, instead feed them false personal data (particularly in respect of companies where it could be legal to do so).

Another interesting angle, given the presence of “AI” in the topic title … if your data is used to train AI, that definitely persists beyond even the deletion of the original data, and perhaps the rate at which it goes stale is lower. However maybe the OP’s original concern was only with the actual original data that directly relates to an individual.

1 Like

Let me take that to a theoretical path for a bit: Let’s say AI is inevitable like Thanos. If all those that feel they are in margin, somehow at risk and have something to lose, that data denial could snap half of all data out of existence - or out of AI reach at least. The other half doesn’t have to be majority or evil, just ignorant, complacent etc. But then the result will be, that all data that’s left and used - and therefore AI - would be eschewed towards… something (hard to say how positive, negative or weird it would be, but that’s multiverse for you). At least the efefct would be that AI would not be able to take into consideration minorities and smart people since it wouldn’t have the data and precedent to base its calculations on (that one in fifteen million or so thing). So, although the data assassination plan has its appeal (just go for the head), in the long run a massive show of force of all the different heroes may be better to over-run the generic CGI-data-army. Perhaps we should start generating more of our kind of data to feed this phase of the AI universe, thus making sure it’s endless same old comic content in the future. :star_struck:
Not something that happens just by snapping fingers, I’m afraid. :sparkles:

1 Like

Making deliberate and conscious decisions to avoid public participation for an extended and/or indefinite length of time. An immediate and personal example was the time I lurked on the Purism community forums from 2018 to 2023. A more recent example is my ongoing efforts to become unbanked. The pandemic has taught me how to rationalize every decision to interact with the public, even within digital mediums.

1 Like

IMO:
It’s 2024 and no site needs to play Google’s puzzle games (CAPTCHAs reCaptcha, or any thing requiring the visitor to click to prove they are capable of clicking on demand. But I like:
“Please prove you are a human - enter your credit card and CVV numbers”

Back in the day (2023) we removed the tattlers and used a method that didn’t require notifying Google, or play click-the-bus garbage. The one method we used was to challenge the visitor’s computer.

IMO
If Artificial Idiots are being trained to scrape, the PondScum abusing AI for personal gain, might need another AI just to unravel it.

1 Like

I wonder when the opposite of those CAPTCHA Turing tests appear - systems testing and preventing humans getting in (“Find Waldo from DB of 10 million faces” or “Which of these 100 buses has the median hue of #DAA520”), so we don’t mess things up :wink:

[Really, some admins would love having that feature to prevent idiots and knowitalls from doing stuff, and I’m not referring to using own personal systems]

[Edit to add: just funny: xkcd: Machine Learning Captcha]

1 Like

When I see any kind of captcha, I know the people that built the site don’t know enough about the evils of Google and just how many trackers, SMIRC’ers and stalkers they let ride in the template they used, or plug-in bells and whistles they dress up the site with.
Any captcha where the visitor has to provide Google they are real is part of the assimilation.

IMO, anything Google provides is like getting into bed with the mafia. You’ll never get out. We become part of the collective.

1 Like