Sometimes you might want to skip the google bot from scrawling your certain pages so you can use the robots.txt file to decline them.
But at times during the migration or testing newer changes you allocate a small traffic on new endpoints to verify if things are working fine or not. Sometimes the newer pages might not have certain components which googlebot might be using from the seo perspective.
Also newer limited allocations of a part of traffic might cause bot to view pages differently and mark them as copied content due to which search results might get affected.
So you can prevent and control the google bot from scrawling pages from the nginx webserver itself as well.
First two important things are there:-
1. Google has multiple bots no one actually knows however google give some idea about its bots. But one thing is common they all have google in it
2. This is not a replacement for robots.txt rather we implementing because of the partioning/allocation of small traffic to new site which gradually increases over time. So we don't want both the sites to be simultaneously visible and remove it once the complete migration has occurred.
So you can detect the google bot with the help of the http_user_agent which nginx provides and you can look for the string google in it. If you find the user_agent is having google than you can be certain that its google bot.
So based on above conclusion we can control google bot via user_agent in nginx and restrict and proxy some particular site page based on this approach
So in location directive you can send 420 error to google_bot as and you can use this error condition in all your if statements wherever required.
location = / {
error_page 420 = @google_bot;
# Checking for google bot
if ($http_user_agent ~* (google)) {
return 420;
}
You can also proxy_pass and make the google bot to always come on the old page as
location @google_bot {
proxy_pass $scheme://unixcloudfusion;
}