Lesson 8, Topic 5

In Progress

← Previous Next→

Robots.txt

11.02.2022

Lesson Progress

0% Complete

The robots.txt file, together with the xml-map, shows, perhaps, the most important information about the resource: it shows search engine robots how to “read” the site, which pages are important and which should be skipped. Also, robots.txt is the first page to look at if traffic suddenly drops on the site.

What is A Robots.txt Fils?

A robots.txt file is the execution of this protocol. The protocol delineates the guidelines that every authentic robot must follow, including Google bots. Some illegitimate robots, such as malware, spyware, and the like, by definition, operate outside these rules.

You can take a peek behind the curtain of any website by typing in any URL and adding: /robots.txt at the end.

For example, here’s POD Digital’s version:

In a robots.txt file with multiple user-agent directives, each disallow or allow rule only applies to the useragent(s) specified in that particular line break-separated set. If the file contains a rule that applies to more than one user-agent, a crawler will only pay attention to (and follow the directives in) the most specific group of instructions.

Here’s an example:

Robots.txt.png?mtime=20170427090303#asset:5201:large

Msnbot, discobot, and Slurp are all called out specifically, so those user-agents will only pay attention to the directives in their sections of the robots.txt file. All other user-agents will follow the directives in the user-agent: * group.

Example robots.txt:

Here are a few examples of robots.txt in action for a www.example.com site:

Robots.txt file URL: www.example.com/robots.txt

Blocking all web crawlers from all content

User-agent: * Disallow: /

Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage.

Allowing all web crawlers access to all content

User-agent: * Disallow:

Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the homepage.

Blocking a specific web crawler from a specific folder

User-agent: Googlebot Disallow: /example-subfolder/

This syntax tells only Google’s crawler (user-agent name Googlebot) not to crawl any pages that contain the URL string www.example.com/example-subfolder/.

Blocking a specific web crawler from a specific web page

User-agent: Bingbot Disallow: /example-subfolder/blocked-page.html

This syntax tells only Bing’s crawler (user-agent name Bing) to avoid crawling the specific page at www.example.com/example-subfolder/blocked-page.html.

Where to Locate the Robots.txt File

Your robots.txt file will be stored in the root directory of your site. To locate it, open your FTP cPanel, and you’ll be able to find the file in your public_html website directory.

There is nothing to these files so that they won’t be hefty – probably only a few hundred bytes, if that.

Once you open the file in your text editor, you will be greeted with something that looks a little like this:

An image of a basic robots.txt notepage file

If you aren’t able to find a file in your site’s inner workings, then you will have to create your own.

How to Put Together a Robots.txt File

Robots.txt is a super basic text file, so it is actually straightforward to create. All you will need is a simple text editor like Notepad. Open a sheet and save the empty page as, ‘robots.txt’.

An image of a change file attributes pop up

Robots.txt Syntax

A robots.txt file is made up of multiple sections of ‘directives’, each beginning with a specified user-agent. The user agent is the name of the specific crawl bot that the code is speaking to.

There are two options available:

You can use a wildcard to address all search engines at once.
You can address specific search engines individually.

User-Agent Directive

Host Directive

Disallow Directive

Sitemap Directive (XML Sitemaps)

Crawl-Delay Directive

Why Use Robots.txt

Now that you know about the basics and how to use a few directives, you can put together your file. However, this next step will come down to the kind of content on your site.

Robots.txt is not an essential element to a successful website; in fact, your site can still function correctly and rank well without one.

However, there are several key benefits you must be aware of before you dismiss it:

Point Bots Away From Private Folders: Preventing bots from checking out your private folders will make them much harder to find and index.
Keep Resources Under Control: Each time a bot crawls through your site, it sucks up bandwidth and other server resources. For sites with tons of content and lots of pages, e-commerce sites, for example, can have thousands of pages, and these resources can be drained really quickly. You can use robots.txt to make it difficult for bots to access individual scripts and images; this will retain valuable resources for real visitors.
Specify Location Of Your Sitemap: It is quite an important point, you want to let crawlers know where your sitemap is located so they can scan it through.
Keep Duplicated Content Away From SERPs: By adding the rule to your robots, you can prevent crawlers from indexing pages which contain the duplicated content.

You will naturally want search engines to find their way to the most important pages on your website. By politely cordoning off specific pages, you can control which pages are put in front of searchers (be sure to never completely block search engines from seeing certain pages, though).

For example, if we look back at the POD Digital robots file, we see that this URL:

poddigital.co.uk/wp-admin has been disallowed.

Noindex

In July 2019, Google announced that they would stop supporting the noindex directive as well as many previously unsupported and unpublished rules that many of us have previously relied on.

Many of us decided to look for alternative ways to apply the noindex directive, and below you can see a few options you might decide to go for instead:

Noindex Tag/ Noindex HTTP Response Header: This tag can be implemented in two ways, first will be as an HTTP response header with an X-Robots-Tag or create a <meta> tag which will need to be implemented within the <head> section.

Your <meta> tag should look like the below example:<meta name=”robots” content=”noindex”>

TIP: Bear in mind that if this page has been blocked by robots.txt file, the crawler will never see your noindex tag, and there is still a chance that this page will be presented within SERPs.

Password Protection: Google states that in most cases, if you hide a page behind a login, it should be removed from Google’s index. The only exception is presented if you use schema markup, which indicates that the page is related to subscription or paywalled content.
404 & 410 HTTP Status Code: 404 & 410 status codes represent the pages that no longer exist. Once a page with 404/410 status is crawled and fully processed, it should be dropped automatically from Google’s index.

You should crawl your website systematically to reduce the risk of having 404 & 410 error pages and where needed use 301 redirects to redirect traffic to an existing page.

Disallow rule in robots.txt: By adding a page specific disallow rule within your robots.txt file, you will prevent search engines from crawling the page. In most cases, your page and its content won’t be indexed. You should, however, keep in mind that search engines are still able to index the page based on information and links from other pages.
Search Console Remove URL Tool: This alternative root does not solve the indexing issue in full, as Search Console Remove URL Tool removes the page from SERPs for a limited time.

However, this might give you enough time to prepare further robots rules and tags to remove pages in full from SERPs.

You can find the Remove URL Tool on the left-hand side of the main navigation on Google Search Console.

Noindex vs. Disallow

So many of you probably wonder if it is better to use the noindex tag or the disallow rule in your robots.txt file. We have already covered in the previous part why noindex rule is no longer supported in robots.txt and different alternatives.

If you want to ensure that one of your pages is not indexed by search engines, you should definitely look at the noindex meta tag. It allows the bots to access the page, but the tag will let robots know that this page should not be indexed and should not appear in the SERPs.

The disallow rule might not be as effective as noindex tag in general. Of course, by adding it to robots.txt, you are blocking the bots from crawling your page, but if the mentioned page is linked with other pages by internal and external links, bots might still index this page based on information provided by other pages/websites.

You should remember that if you disallow the page and add the noindex tag, then robots will never see your noindex tag, which can still cause the appearance of the page in the SERPs.

Using Regular Expressions & Wildcards

Ok, so now we know what robots.txt file is and how to use it, but you might think, “I have a big eCommerce website, and I would like to disallow all the pages which contain question marks (?) in their URLs.”

This is where we would like to introduce your wildcards, which can be implemented within robots.txt. Currently, you have two types of wildcards to choose from.

* Wildcards – where * wildcard characters will match any sequence of characters you wish. This type of wildcard will be a great solution for your URLs which follows the same pattern. For example, you might wish to disallow from crawling all filter pages which include a question mark (?) in their URLs.

$ Wildcards – where $ will match the end of your URL. For example, if you want to ensure that your robots file is disallowing bots from accessing all PDF files, you might want to add the rule, like one presented below:

Let’s quickly break down the example above. Your robots.txt allows any User-agent bots to crawl your website, but it disallows access to all pages which contain .pdf end.

Mistakes to Avoid

We have talked a little bit about the things you could do and the different ways you can operate your robots.txt. We are going to delve a little deeper into each point in this section and explain how each may turn into an SEO disaster if not utilized properly.

Do Not Block Good Content

It is important to not block any good content that you wish to present to publicity by robots.txt file or noindex tag. We have seen in the past many mistakes like this, which have hurt the SEO results. You should thoroughly check your pages for noindex tags and disallow rules.

Overusing Crawl-Delay

We have already explained what the crawl-delay directive does, but you should avoid using it too often as you are limiting the pages crawled by the bots. This may be perfect for some websites, but if you have got a huge website, you could be shooting yourself in the foot and preventing good rankings and solid traffic.

Case Sensitivity

The Robots.txt file is case sensitive, so you have to remember to create a robots file in the right way. You should call robots file as ‘robots.txt’, all with lower cases. Otherwise, it won’t work!

Using Robots.txt to Prevent Content Indexing

We have covered this a little bit already. Disallowing a page is the best way to try and prevent the bots crawling it directly.

But it won’t work in the following circumstances:

If the page has been linked from an external source, the bots will still flow through and index the page.
Illegitimate bots will still crawl and index the content.
Using Robots.txt to Shield Private Content

Using Robots.txt to Hide Malicious Duplicate Content

Duplicate content is sometimes a necessary evil — think printer-friendly pages, for example.

However, Google and the other search engines are smart enough to know when you are trying to hide something. In fact, doing this may actually draw more attention to it, and this is because Google recognizes the difference between a printer friendly page and someone trying to pull the wool over their eyes:

Here are three ways to deal with this kind of content:

Rewrite the Content – Creating exciting and useful content will encourage the search engines to view your website as a trusted source. This suggestion is especially relevant if the content is a copy and paste job.
301 Redirect – 301 redirects inform search engines that a page has transferred to another location. Add a 301 to a page with duplicate content and divert visitors to the original content on the site.
Rel= “canonical – This is a tag that informs Google of the original location of duplicated content; this is especially important for an e-commerce website where the CMS often generates duplicate versions of the same URL.

Why do you need robots.txt?

Robots.txt files control crawler access to certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt file can be very handy.

Some common use cases include:

Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
Keeping entire sections of a website private (for instance, your engineering team’s staging site)
Keeping internal search results pages from showing up on a public SERP
Specifying the location of sitemap(s)
Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once

If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all.

Checking if you have a robots.txt file

Not sure if you have a robots.txt file? Simply type in your root domain, then add /robots.txt to the end of the URL. For instance, Moz’s robots file is located at moz.com/robots.txt.

If no .txt page appears, you do not currently have a (live) robots.txt page.

How to create a robots.txt file

If you found you didn’t have a robots.txt file or want to alter yours, creating one is a simple process. This article from Google walks through the robots.txt file creation process, and this tool allows you to test whether your file is set up correctly.

SEO best practices

Make sure you’re not blocking any content or sections of your website you want crawled.
Links on pages blocked by robots.txt will not be followed. This means 1.) Unless they’re also linked from other search engine-accessible pages (i.e. pages not blocked via robots.txt, meta robots, or otherwise), the linked resources will not be crawled and may not be indexed. 2.) No link equity can be passed from the blocked page to the link destination. If you have pages to which you want equity to be passed, use a different blocking mechanism other than robots.txt.
Do not use robots.txt to prevent sensitive data (like private user information) from appearing in SERP results. Because other pages may link directly to the page containing private information (thus bypassing the robots.txt directives on your root domain or homepage), it may still get indexed. If you want to block your page from search results, use a different method like password protection or the noindex meta directive.
Some search engines have multiple user-agents. For instance, Google uses Googlebot for organic search and Googlebot-Image for image search. Most user agents from the same search engine follow the same rules so there’s no need to specify directives for each of a search engine’s multiple crawlers, but having the ability to do so does allow you to fine-tune how your site content is crawled.
A search engine will cache the robots.txt contents, but usually updates the cached contents at least once a day. If you change the file and want to update it more quickly than is occurring, you can submit your robots.txt url to Google.

Robots.txt vs meta robots vs x-robots

So many robots! What’s the difference between these three types of robot instructions? First off, robots.txt is an actual text file, whereas meta and x-robots are meta directives. Beyond what they actually are, the three all serve different functions. Robots.txt dictates site or directory-wide crawl behavior, whereas meta and x-robots can dictate indexation behavior at the individual page (or page element) level.

SEO

Participants 286