When my washing machine stopped working about a month ago, I tried everything to fix it.
I made it run an extra wash cycle, cleaned the drum, repeat.
But when I went to empty the filter with the help of my aunt, we managed to do it only after we found a €2 coin.
With it, we managed to open the filter door and empty it for good.
That little coin saved me, and my washing machine.
The robots.txt file impacts SEO just like that little coin had a huge impact on making my washing machine function again.
It’s a pity that sometimes web hosts don’t create a robots.txt file by default.
But, you can still create (and optimize) one yourself.
That’s exactly what this guide is for—from the basics of robots.txt SEO to the technical details made easy to understand.
Let’s dive in!
Robots.txt: The Tiny Website File That Can Make or Break Your SEO
Introducing the Robots.txt File
Pretty much, a robots.txt file is what you see below—a small text file located at the root of your website:
It states, on three lines:
- the User-agent you want to work with
- a Disallow field to tell search engine bots and crawlers what not to crawl
- an Allow field to let them know what to crawl instead
The syntax is simple:
After each field name (or directive), you add a colon followed by the value that you want the robots to consider.
While field names are case-insensitive, value names are not. So, for example, if your folder is named “/My-Work/,” you can’t put “/my-work/” in your robots.txt file.
It won’t work correctly.
Let me explain robots.txt fields and values more in detail below.
This field declares what user agent we want to work with. In the language of robots.txt, a user agent is a spider bot or crawler.
The syntax is:
For example, if I wanted to make the subsequent rules (values) apply to all user agents, I would enter the following:
And if I wanted to say that the rules apply to a specific agent, it would look like this (replacing “AgentName” with the name of the user agent you want to work with):
Examples of commonly used user agents for search engines and social networks are:
- Slurp (Yahoo!’s web crawler)
- Facebot (Facebook’s crawler)
- Twitterbot (Twitter’s crawler)
- ia_archiver (Alexa’s crawler)
Disallow is the blacklisting directive in the robots.txt language.
The basic usage is as follows:
User-agent: * Disallow:
Simply writing “Disallow:” without having it followed by any value means that you want all robots to crawl your website.
On the other hand, if you don’t want robots to crawl your website at all (not even a small portion of it), you’d enter:
User-agent: * Disallow: /
You can also use the Disallow field when you want bots to crawl your entire website minus one or more specific files or areas.
User-agent: Googlebot Allow: /public.jpg Disallow: /private.jpg
The robots.txt whitelisting directive!
This is a good way to tell robots that you want one or more specific files to be crawled when they’re located inside an area of your site that you’ve previously disallowed with another rule.
For example, you may want to have Googlebot crawl only one image in a private area of your site, but not the rest of the private area.
To honor this intention, you can use this syntax:
User-agent: Googlebot Disallow: /private/ Allow: /private/the-only-image-you-can-see.jpg
Robots.txt File Comments and Length
To add a comment to your robots.txt file, simply place a hash symbol (#) before the line you’re writing.
# This rule blocks Bingbot from crawling my blog directory User-agent: Bingbot Disallow: /blog/
A robots.txt file can be any length, there’s no set maximum.
Wanna take a look at Google’s?
(You might have to scroll a little bit there.)
How Robots.txt Can Serve Your SEO Efforts
As I previously mentioned, the robots.txt file can hugely impact SEO. Particularly, it affects page indexing and the indexing of other content types (such as media and images).
Here’s how you can use the robots.txt file to better your SEO outcomes.
Using “User-agent” for SEO
As you’ve seen, when you write the User-agent field, you have the option to apply certain rules to all search engines and crawlers (with the asterisk *), or to single robots.
Or both, when you want to handle a mix of different behaviors.
Take a look at this example from one of my websites:
Here I wanted to exclude Google Images from indexing my images after I found out that some of my artwork from this and a similar website was scraped years ago. I also wanted to deny Alexa’s web crawler from scanning my site.
I applied this SEO and reputation management decision to the robots.txt file by simply writing down Google Images’ and Alexa’s user agents and applying a Disallow rule to both of them, one per line.
As an SEO, you know what search engines (or parts of search engines) you want to appear in, for whatever reason.
Robots.txt lets you tell web services what you allow and what you don’t, de facto determining the way your site appears (or doesn’t appear) on each platform.
Another common application of this field is when you don’t want Wayback Machine (Archive.org) to save snapshots of your website.
By adding these two lines to your robots.txt file:
User-agent: archive.org_bot Disallow: /
You can exclude the internet archive from crawling and snapshotting your website.
Using “Disallow” and “Allow” Directives for SEO
The Disallow and Allow directives are powerful tools to tell search engines and web mining tools exactly what to crawl and index.
So far, you’ve seen how to use them to exclude (or include) files and folders from being scanned and indexed. If you use these directives properly, you can optimize your crawl budget to leave out duplicate pages and service pages that you don’t want to rank in the SERPs (for example, thank you pages and transactional pages).
Here’s how I’d do that for a thank you page:
User-agent: Googlebot Disallow: /thank-you-for-buying-heres-your-guide/
(Heck, can you imagine how many sales you could lose if a page like that gets indexed?)
The Dangers of Not Taking Care of Your Robots.txt
In a case study for Search Engine Land, Glenn Gabe reports how a company’s badly written robots.txt file led to URL leaks and index drop outs.
The kind of bad things you definitely don’t want to happen!
The company in question found themselves with a case sensitivity issue when disallowing category folders (“/CATEGORY/” instead of “/Category/”), and had disallowed their entire website by using “Disallow: /” instead of “Disallow:” (without the trailing slash).
Because blocked URLs don’t drop out altogether but over a slow leak, the company had witnessed their rankings decline over a period of time.
Gabe also wrote a longer article about what happened when another company mistakingly disallowed their entire site.
It’s easy to follow that a regular audit (and good maintenance) of your robots.txt file for SEO is critical to preventing such disastrous issues.
Robots.txt Hacks for SEO and File Security
In addition to basic robots.txt usage, you can implement a few more hacks to help support and boost your SEO strategy.
Add a Sitemap Rule to Your Robots.txt File
You can add a sitemap to your robots.txt file—even more than one, actually!
The screenshot below shows how I did this for my business website:
I added three sitemaps, one for my main site and two from subsites (blogs) that I want counted as part of the main site.
While adding a sitemap to your robots.txt file is no guarantee that it’ll help better site indexing, it’s worked for some webmasters so it’s worth giving a try!
Hide Files That You Don’t Want Search Engines or Users to See
It could be that .PDF e-book you’re selling on your blog for your most loyal readers only.
Or it might be a subscriber-only page that you don’t want common mortals to get their hands on.
Or a legacy version of a file that you no longer want findable except through private exchange.
Whatever the reason for not wanting a file to be available to the public, you have to remember this common sense rule:
Even though search engines will ignore a page or file stated in your robots.txt file, human users will not.
As long as they’re able to load the robots.txt file in their browser, they can read your blocked URLs, copy and paste them into their browser, and get full access to them.
So when it comes robots.txt, SEO and common usage isn’t enough. You also have to ensure that human users keep their hands off the confidential material that you’ve entrusted the robots.txt to keep out of search engines!
Now the question is: How do you do it?
I’m happy to tell you it only takes three steps:
1. Create a specific folder for your secret files
2. Add index protection to that folder (so nobody browsing it can see its contents)
3. Add a Disallow rule to that folder (not to the files under it because they’ll inherit the rule)
Let’s get to putting that into practice.
Step #1: Create a specific folder for your secret files
First, log in to your website administration panel and open the file manager that comes with it (e.g. File Manager in Cpanel). Alternately, you can use a desktop-based FTP client such as FileZilla.
This is how I created the folder “/secret-folder/” in my website using cPanel’s File Manager:
Step #2: Add index protection to the folder
Secondly, you have to add protection for the index of that folder.
If you use WordPress, you can protect all folders by default by downloading and activating the Protect Uploads free plugin from the repository.
In all other cases, including if you want to protect this one folder only, you can use two methods (following on from my example above):
A. .htaccess 403 Error Method
Create a new .htaccess file under “/secret-folder/” and add this line to it:
This line tells browsers to deny the listing of directory files for browsing.
If that doesn’t work on your web server, use:
Deny from all
B. Index.html File Method
Create an index.html (or default.html) under “/secret-folder/.”
This file should be empty or contain a small string of text to remind users who are browsing that this directory is inaccessible (e.g. “Shoo away. Private stuff here!”).
Step #3: Add a Disallow rule to the folder
As the third and last action, go back to your robots.txt file at the root of your website and Disallow the entire folder.
User-agent: * Disallow: /secret-folder/
And you’re done!
As you can see, doing robots.txt SEO is not wasting your time on a minor SEO factor.
Your robots.txt file might seem as small and insignificant as the coin I used to “fix” my washing machine, be it can be just as powerful and critical to the good standing of your website in search engines.
So take good care of it!