How To Fight AI Abusing You

Sharon · January 9, 2025, 7:27pm

Are you able to tell us which AI you used?
TIA
~s

irvinewade · January 9, 2025, 9:09pm

In addition, for publicly accessible pages, “scraping” is a normal part of web search indexing. (So if you successfully prevent scraping, by any means, then you are preventing anyone finding your site’s pages via a web search. For a company that aims to derive revenue that could be badness.)

Any web site can create a resource /robots.txt that asks for parts of the web site to be indexed / not to be indexed by specific web crawlers / all web crawlers - but that is a request that a web crawler is free to ignore. A reputable web crawler will comply.

Just for fun … for my own web sites, in addition to telling web crawlers to f… off, for those web crawlers that publish a list of the IP addresses that they use to crawl your web site … they will see a “virtual web site” that differs from the real web site. That is, the only resource that the web crawler sees is
/robots.txt and there is no other content to index if they choose to ignore /robots.txt

A moment’s thought though will reveal “gaps” in that approach.

I do what I can.

JR-Fi · January 9, 2025, 10:22pm

Sorry, it’s been a while. Probably one of the bigger ones.

Sharon · January 10, 2025, 8:36pm

IMO, that depends on the business. We could “drive revenue” without Google or real search engines.

Back to the scrape. IMO - We don’t need Google. What we need is a search engine.

A fellow with a local swimming pool care service wants his business services to be number 1 found by Google. Google will first push local businesses. One does not need to do the $EO thing. Google will start with a list local to the info seeker any way it can, pending of course one isn’t looking for a vacation in San Moritz! But one needs to skip paid-to-google pages about watches, and San Moritz martini first.

I wouldn’t let Yahoo’s slurp in to scrape tho.

I believe sites will be indexed (scraped?) anyway, with or without Google’s blessings, if they type the business name in the search. Then they come up at the top.

BTW, both Google and Bing are capable of “scraping” web sites for illegal child abuse images - but won’t do it. I can back that up.

Having to fight off and often times pay to hide from hacking and slashing our rights to privacy, shouldn’t be like this.

just my opinions of course,
~s

irvinewade · January 13, 2025, 2:55am

Part 2.

Another way that I counter scraping on my own web sites is by having pages that are unreachable from the home page.

In other words, a typical scrape / crawl would start at the home page and look for the URL of any pages that are referenced by the home page (and on the same web site). It would then apply that process iteratively (or recursively) to any pages so found. Until it has found (and downloaded and indexed) all pages (on the web site) that can be reached from the home page by a finite number of navigations.

But that set of pages can be much less than the total set of pages on the site.

Indeed.

FranklyFlawless · January 13, 2025, 11:48am

A few other technical counters include authentication and using different overlay networks:

See also:

Sharon · January 13, 2025, 10:50pm

and

I get the feeling that we pay more and more for what we don’t want (anti-virus, anti-malware, anti-tracking, ads…) and protection costs is getting very expensive, complicated, and in the end, a lot of sites block us if we don’t comply.

I’m debating with myself if I even need the Internet.

Thanks guys for the helpful tips too.
~s

irvinewade · January 13, 2025, 11:58pm

I don’t know whether we are paying $$ for those things but we certainly pay with our time and attention.

Sharon · January 14, 2025, 3:02am

I pay for anti-virus and anti malware. Really have to with Windows around still. As consumer, I think I help to pay for the ads I don’t want - it’s in the cost of products and services we pay for.
Indirectly, Canadians helped pay to elect the President, and send US bombs to Israel.
Here, in British Columbia our electricity supplier charges extra to cover the cost of people whose I help pick up of those hydro users that don’t, won’t or can’t pay their electronic bills.

AI is already telling us how to behave, limits our choices, changes our outlook (or else), … even making medical diagnosis and providing the prescription.

I still think the I in AI is I dio t.
~s

amarok · January 28, 2025, 9:57pm

And for the crawlies who ignore robots.txt: Tar pits! (Arstechnica.com)

irvinewade · January 29, 2025, 12:50am

The problem that I have with that approach (tar pits), and indeed the problem is acknowledged in the article itself, is that if your primary concern is the network load (or the energy cost at either end arising from it) of constant crawling then … trapping the crawler in infinite pages of garbage makes the problem worse, not better.

The approach really only makes sense when the crawler is specifically trawling for content to feed an AI model and you actually want to devote network bandwidth and energy cost to trashing the AI model i.e. you hate AI more than you hate e.g. your web site being sluggish.

A more pleasant approach might be … detect any source IP address that is imposing a high network load on your web site (regardless of whether it’s a crawler, and regardless of whether it is specifically for AI) and slow down responses to that IP address, thus automatically adjusting downwards the network load that is being imposed. (However that’s not something that I am trialling. I still prefer the empty virtual web site approach.)

I don’t personally see much crawling. Most of the garbage that I see is probing for security weaknesses.

Moldybook · February 2, 2025, 8:50am

What is the alternative, silver bars?

FranklyFlawless · February 2, 2025, 8:53am

Pure unadulterated labour.

Sharon · February 2, 2025, 6:51pm

In re-reading some of the posts, I see I was remiss in using a more pointed question. I think (memory evades me, I meant off our phones, even desktops.

I pointed to a series of events that exposed both Google catalog search engine results and M$'s Bing where both were providing results that were white supremacist sites and child abuse porn.

The NY Times did a investigation and found that both algorithms were sloppy but biggest find was that both Google thing and Micro$ofts were and are able to scan, find and report both types of sites.

I was curious if any web site with the proper tools could index and scrape whatever they want from our L5’s with Pure/OS.

I think people auto-thought I was talking about our own websites. I just wanted to clear.
Being it’s Sunday, I was cleaning some things up.
~s

FranklyFlawless · February 2, 2025, 6:52pm

Maybe AI-powered cloud antivirus SaaS.

Sharon · February 2, 2025, 7:16pm

I wonder, if it is possible for a person to register their name with feds or whomever, as a Registered Trade name, claim absolute rights to any of their own data online, and any info scraped by whatever, means requires the scraper to pay Royalty fees.

Yeah - bookkeeping nightmare, but that would be the scrapers problem to sort out. First things first, stop scraping until they can pay the fee. They have the tools to steal our data, than can figure a way for pay for it.

Just pipe dreaming…

~s

FranklyFlawless · February 2, 2025, 7:21pm

The appropriate authority in Canada would be the Canadian Intellectual Property Office:

The issue is that trademarks are intended for brands with geographical region recognition, not individuals with limited public exposure:

Sharon · February 2, 2025, 8:27pm

Key words “in Canada.”

And, why have we knuckled under and having to Opt-Out of being abused by corporate greed and ignoring our rights?
Corporations should ask us to Opt-In - for a fee.

It’s Sunday and I have nothing pressing to do.
~s

FranklyFlawless · February 3, 2025, 3:33am

I am more concerned with solving issues than discussing them.

Sharon · June 2, 2025, 5:22pm

You have your own web sites? Do tell.
~s