Comment bloquer les Robots qui aspirent le contenu de votre site pour entraîner des modèles LLM ?
#via conf nginx :
if ($http_user_agent ~* (AI2Bot|Ai2Bot-Dolma|Amazonbot|Applebot|Applebot-Extended|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|Diffbot|FacebookBot|FriendlyCrawler|GPTBot|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|ICC-Crawler|ImagesiftBot|Kangaroo Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|PerplexityBot|PetalBot|Scrapy|Sidetrade indexer bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot|anthropic-ai|cohere-ai|facebookexternalhit|iaskspider/2.0|img2dataset|omgili|omgilibot)) {
return 403;
}
Mon Oct 7 15:56:02 2024 - permalink -
-
https://www.geeek.org/comment-bloquer-robots-aspirent-contenu-pour-llm/