Ragnar
A Master From Germany
Which actually meant creating a robots.txt. Since to my surprise I found Xenforo apparently doesn't come with one by default.
I've ever relied on banning Google Analytics etc. to protect privacy; but as the Internet has changed in the last 5 years; decided to add some repellents to AI crap:
The AI stuff is at the top [ from here Github Training Data ]
Github is of course owned by vile old Microsoft, one of the biggest sinners --- in AI development as well.
The rest is standard stuff cobbled by Elves.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
It was ftpped to what I believe was the absolute root of the installation, and not the other roots like public_html; so mistakes may have been made.
It will probably need to be updated at least annually; since advertisers etc. are tricky little scum; and the Tech Titans will always cheat.
I've ever relied on banning Google Analytics etc. to protect privacy; but as the Internet has changed in the last 5 years; decided to add some repellents to AI crap:
Code:
# The Common Crawl dataset. Original source for GPT and others.
User-agent: CCBot
Disallow: /
# The example for img2dataset, although the default is *None*
User-agent: img2dataset
Disallow: /
# GPTBot is OpenAI's web crawler
User-agent: GPTBot
Disallow: /
# ChatGPT-User takes direct actions on behalf of ChatGPT users
User-agent: ChatGPT-User
Disallow: /
# Google's Bard and Vertex AI generative APIs
User-agent: Google-Extended
Disallow: /
# Speculative blocks for Anthropic
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
# webz.io - they sell data for training LLMs.
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
# Meta's bot that crawls public web pages to improve language models
User-agent: FacebookBot
Disallow: /
# ByteDance's bot used to gather data for their LLMs, including Doubao.
User-agent: Bytespider
Disallow: /
# Brandwatch - "AI to discover new trends"
User-agent: magpie-crawler
Disallow: /
User-agent: *
Disallow: /find-new/
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /admin.php
Allow: /
User-agent: Baiduspider
Disallow: /
User-agent: Baiduspider-video
Disallow: /
User-agent: Baiduspider-image
Disallow: /
User-agent: Yandex
Disallow: /
User-agent: *
Disallow: /account*
Disallow: /help*
Disallow: /misc/style*
Disallow: /misc/quick-navigation-menu*
Disallow: /login*
Disallow: /logout*
Disallow: /lost-password*
Disallow: /register*
Disallow: /reports*
Disallow: /search*
Disallow: /conversations*
Disallow: /css.php
Disallow: /cron.php
Disallow: /admin.php
Disallow: /js
Disallow: /styles
Disallow: /members/*
Disallow: /profile-posts/*
Disallow: /online/*
Disallow: /recent-activity/*
Disallow: /admin.php
Disallow: /js
Disallow: /styles
Disallow: /members/*
Disallow: /profile-posts/*
Disallow: /online/*
Disallow: /recent-activity/*
The AI stuff is at the top [ from here Github Training Data ]
Github is of course owned by vile old Microsoft, one of the biggest sinners --- in AI development as well.
The rest is standard stuff cobbled by Elves.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
It was ftpped to what I believe was the absolute root of the installation, and not the other roots like public_html; so mistakes may have been made.
It will probably need to be updated at least annually; since advertisers etc. are tricky little scum; and the Tech Titans will always cheat.