If you have a website or if you have just started studying about SEO in Digital Marketing, chances are, you might have heard of robots.txt or even worked on the file. Many students find robots.txt hard to comprehend; and although at first glance it may seem like a harmless text file, a tiny error can wreak havoc in any site’s ranking.
What is a robots.txt?
Robots.txt is simple text file with the extension .txt. This file contains instructions meant for search engine bots or web crawlers. The search engine bots from Google, Yahoo, Bing and others can read the instructions in the file and then act accordingly. Append ‘/robots.txt’ to any domain name and view the robots.txt file of the website. Basically, the file tells the bot not to crawl specific pages and files in a website. This is useful when businesses do not want certain pages of the website to be publicly displayed. A robots.txt file can only tell bots what to do; the good bots (such as those from search engines or reputed websites) will obey, but the bad bots (such as spam bots and stuffing bots) will ignore.
Here is a representation of a robots.txt. Seems intimidating, but once it is understood, it is simple to read.
Working of robots.txt in a nutshell.
The robots.txt file is a simple text file with an extension .txt. It does not contain any HTML or XML tags and just like other files of a website or application it is hosted on the web server. Regular visitors to the website cannot find this file as it is not linked to any page on the website, but crawlers will first check this file when they visit the website. A robots.txt file is like a notice, it can tell the crawler what to do, but whether the crawler does what it is told or not is up to the crawler code. In case the file has conflicting instructions, the crawler will follow standard instructions.
Note: Sub-domains, such as those used for blogs, need its own robots.txt file.
Importance of robots.txt file
Every site does not need a robots.txt file. Some basic websites do not include this file because major search engines are able to crawl and present web pages without including duplicate and unwanted pages. So why and when should a robots.txt be used.
To block specific pages - What are some website pages that the general audience need not see, especially on a search engine? The answer – log in page, thank you page etc. These pages do not offer good value to the search engine audience. Hence, these should be blocked using the robots.txt file.
To optimizing search engine crawl budget – What is a crawl budget? It is the number of pages of a website that a search engine crawls in a day. The number of pages it crawls daily varies everyday. Hence it is practical to optimize the crawl budget. No one prefers the bots crawling unwanted pages on the site. Hence the robots.txt file should be used to instruct the bots to crawl only specific value adding pages.
To block bots – Meta tags or meta directives (specifically NOFOLLOW tags) can also be used to block certain pages or files in a website, but it does not work well on pages with audios and videos, PDFs and images and other multimedia. Robots.txt can do it perfectly.
These are the basics of robots.txt. If you are interested in learning more about robots.txt or SEO, feel free to check out our course on Advanced Digital Marketing.