As part of the services we offer I am often times asked to do SEO or site health audits. For me, this is like a barista making the next caramel macchiato. For marketers or business owners, they often do not fully utilize the basic and free services available to literally walk Google through your site and help you get your brand found for the products and services you offer. This blog should help to clear up confusion and give you the tools to make sure you are satisfying Google and other search engines.
One of the first steps I do is quickly look at the Google index to see how many pages of their site are listed and to briefly see how they appear in the Google SERP (Search Engine Results Page). This example to the below came from a recent conversation with an acquaintance over a beer and he was complaining that the marketing agency he said to me “I am just not getting the results I expect with what I am paying them”.
As soon as I evaluated his site, I immediately knew why he was faced with an uphill battle. After reviewing the robots.txt file I noticed that the entire site was “disallowed”, or blocked from search engine bots. This is not a good scenario for anyone looking for organic web traffic.
Let’s dig in more about these important files and how to avoid this mess for your brand.
The robots.txt file is a simple text file placed on your web server which tells webcrawlers like Googlebot if they should access a file or not.
Why should you learn about robots.txt?
- Improper usage of the robots.txt file can hurt your ranking
- The robots.txt file controls how search engine spiders see and interact with your webpages
- This file is mentioned in several of the Google guidelines
- This file, and the bots they interact with, are fundamental parts of how search engines work
Tip: To see if your robots.txt is blocking any important files used by Google, use the Google guidelines tool.
Search engine spiders
The first thing a search engine spider like Googlebot looks at when it is visiting a page is the robots.txt file.
It does this because it wants to know if it has permission to access that page or file. If the robots.txt file says it can enter, the search engine spider then continues on to the page files.
If you have instructions for a search engine robot, you must tell it those instructions. The way you do so is the robots.txt file.
Basic robots.txt examples
Here are some common robots.txt setups (we will explain in more detail later).
Allow full access
Block all access
Block one folder
Block one file
Do You Need a Robots.txt?
I guess that is as easy to determine as asking yourself this simple question, “When someone goes to a search engine and searches for my brand or the services I offer, would I like them to find my site”?
There are three important things that any marketer should do when it comes to the robots.txt file.
- Determine if you have a robots.txt file
- If you have one, make sure it is not harming your ranking or blocking content you don't want blocked
- Determine if you need a robots.txt file
1) Determine if you have a robots.txt
The robots.txt file is always located in the same place on any website, so it is easy to determine if a site has one. Just add "/robots.txt" to the end of a domain name as shown below.
If you have a file there, it is your robots.txt file. You will either find a file with words in it, find a file with no words in it, or not find a file at all.
2) Determine if your robots.txt is blocking important files
You can use the Google guidelines tool, which will warn you if you are blocking certain page resources that Google needs to understand your pages.
If you have access and permission you can use the Google search console (Google Webmaster Tools) to test your robots.txt file. Instructions to do so are found here (tool not public - requires login to Webmaster Tool Console).
To fully understand if your robots.txt file is not blocking anything you do not want it to block you will need to understand what it is saying. We cover that below.
3) ask if you need a robots.txt file?
You may not even need to have a robots.txt file on your site. In fact it is often the case you do not need one, although this is rather rare.
Reasons you may want to have a robots.txt file:
- You have content you want blocked from search engines
- You are using paid links or advertisements that need special instructions for robots
- You want to fine tune access to your site from reputable robots
- You are developing a site that is live, but you do not want search engines to index it yet
- They help you follow some Google guidelines in some certain situations
- You need some or all of the above, but do not have full access to your webserver and how it is configured
Reasons you may not want to have a robots.txt file:
- It is simple and error free
- You do not have any files you want or need to be blocked from search engines
- You do not find yourself in any of the situations listed in the above reasons to have a robots.txt file
It is okay to not have a robots.txt file.
When you do not have a robots.txt file the search engine robots like Googlebot will have full access to your site. This is a normal and simple method that is very common.
How to make a robots.txt file
Making a Robots.txt file is easy and only takes a minute or so.
The file is just a text file, which means that you can use notepad or any other plain text editor to make one. You can also make them in a code editor. You can even "copy and paste" them.
Instead of thinking "I am making a robots.txt file", just think, "I am writing a note" they are pretty much the same process.
What should the robots.txt say?
That depends on what you want it to do.
All robots.txt instructions result in one of the following three outcomes
- Full allow: All content may be crawled.
- Full disallow: No content may be crawled.
- Conditional allow: The directives in the robots.txt determine the ability to crawl certain content.
Let's explain each one.
Full allow - all content may be crawled
Most people want robots to visit everything in their website. If this is the case with you, and you want the robot to index all parts of your site, there are three options to let the robots know that they are welcome.
1) Do not have a robots.txt file
If your website does not have a robots.txt file then this is what happens...
A robot like Googlebot comes to visit. It looks for the robots.txt file. It does not find it because it isn't there. The robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.
2) Make an empty file and call it robots.txt
If your website has a robots.txt file that has nothing in it then this is what happens...
A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. There is nothing to read, so the robot then feels free to visit all your web pages and content because this is what it is programmed to do in this situation.
3) Make a file called robots.txt and write the following two lines in it...
If your website has a robots.txt with these instructions in it then this is what happens...
A robot like Googlebot comes to visit. It looks for the robots.txt file. It finds the file and reads it. It reads the first line. Then it reads the second line. The robot then feels free to visit all your web pages and content because this is what you told it to do (I explain this below).
Full disallow - no content may be crawled
Warning: This means that Google and other search engines will not index or display your webpages.
To block all reputable search engines spiders from your site you would have these instructions in your robots.txt:
It is not recommended to do this as it will result in none of your web pages being indexed.
The robot.txt instructions and their meanings
Here is an explanation of what the different words mean in a robots.txt file
The "User-agent" part is there to specify directions to a specific robot if needed. There are two ways to use this in your file.
If you want to tell all robots the same thing you put a " * " after the "User-agent" It would look like this...
The above line is saying "these directions apply to all robots".
If you want to tell a specific robot something (in this example Googlebot) it would look like this...
The above line is saying "these directions apply to just Googlebot".
The "Disallow" part is there to tell the robots what folders they should not look at. This means that if, for example you do not want search engines to index the photos on your site then you can place those photos into one folder and exclude it.
Lets say that you have put all these photos into a folder called "photos". Now you want to tell search engines not to index that folder.
Here is what your robots.txt file should look like in that scenario:
The above two lines of text in your robots.txt file would keep robots from visiting your photos folder. The "User-agent *" part is saying "this applies to all robots". The "Disallow: /photos" part is saying "don't visit or index my photos folder".
Googlebot specific instructions
The robot that Google uses to index their search engine is called Googlebot. It understands a few more instructions than other robots.
In addition to "User-name" and "Disallow" Googlebot also uses the Allow instruction.
The "Allow:" instructions lets you tell a robot that it is okay to see a file in a folder that has been "Disallowed" by other instructions. To illustrate this, let's take the above example of telling the robot not to visit or index your photos. We put all the photos into one folder called "photos" and we made a robots.txt file that looked like this...
Now let's say there was a photo called myphoto.jpg in that folder that you want Googlebot to index. With the Allow: instruction, we can tell Googlebot to do so, it would look like this...
This would tell Googlebot that it can visit "myphoto.jpg" in the photo folder, even though the "photo" folder is otherwise excluded.
Testing your robots.txt file
To find out if an individual page is blocked by robots.txt you can use this technical SEO tool which will tell you if files important to Google are being blocked and also display the content of the robots.txt file.
A robots.txt file is a very important part of solid SEO foundation. Not having one is OK and let's search engines right through your front door. But, a bad or malicious "disallow" could ruin all of your organic search efforts so it is important to understand how they work.
In the end, remember these three critical points:
- If you use a robots.txt file, make sure it is being used properly
- An incorrect robots.txt file can block Googlebot from indexing your page
- Ensure you are not blocking pages that Google needs to rank your pages
- A "Dissalow" in your robots.txt will not protect your site like password protected pages. If you have critical section or content that should not be seen by anyone other than verified users, use another service to protect that data.