This solution is by far not perfect and can be improved/automated/updated in many different ways. It took me only 10 minutes to implement though and basically reduce my web traffic by 80%.
I simply use the ngx_http_map_module, which allows me to have a variable
depend on the value of another variable.
map $http_user_agent $blocked_user_agent {
default 0;
~*amazonbot 1;
~*openai 1;
~*chatgpt 1;
~*gptbot 1;
~*claudebot 1;
}
The nginx expression ~*term matches case-insensitively for the occurence of
the string at any position.
This map finds its place somewhere in the http-section in my /etc/nginx/nginx.conf.
I simply checked my nginx access-files to find the most commonly used crawlers in the user agent.
To then apply and test for the filter I run a simple if-statement before returning any files or proxying in the server. The error code i have selected is 450, but any code can be really used.
For example
server {
....
location / {
if ($blocked_user_agent) {
return 450; # Blocked by Windows Parental Controls
}
try_files ...;
}
}