Robots.txt - Telling the search engines what they can and cannot index

Affilorama members only

When your site is indexed by the search engines, it is "crawled" by the search engine spiders - GoogleBot, Yahoo Slurp, Bingbot - in order to find all the content on your site, so that other people can find it.

But what if you've got sections of your website that you don't want indexed? The bots dumbly index whatever they can find - they don't know that, for example, those photos on the hidden part of your site are strictly friends and family only, or that there are certain pages in your website that you'd really rather not have popping up in the search engine listings or being archived by that pesky internet archive bot — like your long-expired special offers. In this lesson we look at robots.txt - telling the search engines what they can and cannot index.

What is the robots.txt file?

Robots.txt is a small text document that lives in the root of your website and tells the "robots" visiting your website which pages they can and cannot access. When one of these "robots" visits your site, the first thing they do is go looking for the robots.txt file. They listen to your requests, and won't visit pages that you've disallowed.

How do you make a robots.txt file?

Decide which areas of your website you want the spiders to index, and which ones you don't want them crawling through. And decide if there are any bots you would rather not have crawling through your site.

Open up your plaintext editor of choice, create a new, blank text file and save it as robots.txt, then write this information into the file:

To block all spiders from your entire website:

User-agent: *
Disallow: /

To let all spiders see all content on your site:

User-agent: *
Disallow:

To block certain directories:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /personal/
Disallow: /photos/staffchristmasparty/

To block a certain spider:

User-agent: Googlebot
Disallow: /

To allow a certain spider, while blocking others:

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

Tips:

You must use a new line for each instruction.
Blank lines are used to show separate groups of instructions (as in the last example).
The asterisk in the User-agent line has a special meaning in robots.txt and can't be used as a wildcard; if you wanted to disallow all GIF images on your website, you couldn't just can't just go Disallow: *.gif - that won't work.
Your file must be called robots.txt, all in lower-case.
Your file must be located in the root directory of your website: www.yoursite.com/robots.txt. That's where the spiders look when they visit your site, and they won't find it if you put it anywhere else.

Now simply save your file and upload it to your website.

Robots.txt and your XML sitemap

If you've seen our lesson on creating XML sitemaps, you'll know that your robots.txt file is a really handy place to let the search engines know where that is.

All you have to do is leave a blank line after the last command in your robots.txt file, and then paste this little line:

Sitemap: <http://www.example.com/sitemap.xml>

If you've got more than one sitemap, you can enter more than one line.

Sitemap: <http://www.example.com/sitemap1.xml>
Sitemap: <http://www.example.com/sitemap2.xml>
Sitemap: <http://www.example.com/sitemap3.xml>

This way you don't need to specifically tell each and every search engine where they can find your sitemap. They'll see it as soon as they look for your robots.txt file, which every polite bot will do when they visit your site anyway.

Things you need to know

Not all spiders honor robots.txt

"Polite" spiders, such as those belonging to the major search engines, are polite and won't index items you've listed in your robots.txt file. However, not all robots are polite (for example, from smaller search engines, or general data scraping bots), so they will collect any and all content anyway.

Your robots.txt is publicly accessible!

Don't try to use your robots.txt file to hide content on your site - the robots.txt file is able to be viewed by anybody, simply by typing www.yoursite.com/robots.txt into their browser, so anybody can see the things you've said you don't want indexed!

If there's content on your website that you really, really don't want anybody else seeing, your best bet is to password-protect that directory. There will usually be a tool to help you do this in your hosting control panel (cPanel or similar). Note that password-protecting your comment (if done right) will also prevent the "unpolite" bots from accessing the content

Lesson Summary

In this lesson we've looked at robots.txt - what it is, what it's used for, and how to create one. We've looked at certain things you can do with robots.txt including:

Blocking your entire site from indexing
Blocking certain directories
Blocking certain bots
Identifying the location of the sitemap

Learn something new? Share it with your friends!

Back to Intermediate

View all lessons

Questions & Comments + Add a comment

Reply Manuel Fraustro

When I type www.mywebsite.com/robots.txt this is what I got:

User-agent: *
Crawl-delay: 1
Disallow: /wp-content/plugins/
Disallow: /wp-admin/

Is that ok? Or is that preventing my website to appear on the search engines?

Thanks in advance

Melissa Johnson

Hi, Manuel!

I reached out to one of our support team members, Mike, to get an answer. Your code looks fine, but this is the formatting he uses:

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
Disallow: /author
Disallow: /wget/
Disallow: /httpd/
Disallow: /i/
Disallow: /f/
Disallow: /t/
Disallow: /c/
Disallow: /j/
Disallow: /page/
Disallow: /comments/
Sitemap: http://www.yourdomain.com/sitemap.xml

User-agent: Mediapartners-Google
Allow: /

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Image
Allow: /

User-agent: ia_archiver-web.archive.org
Disallow: /

If you need more help with your Robots.txt file, send an email to [email protected] and address it to Mike.

Hope that helps!

Reply Kate Walls

I would suggest you to use following code:

User-agent: *
Disallow: /cgi-bin
Disallow: /wp-admin
Disallow: /wp-includes
Disallow: /wp-content
User-agent: Googlebot
Allow: /
User-agent: Slurp
Allow: /
User-Agent: msnbot
Disallow:

This will help to crawl Google Bot, Bing Bot and Yahoo Bot to crawl your site.

Intermediate

Anatomy of a Web Page #2 In this lesson we're going to look at a few more elements that make up your average web page, as well as a brief intr...
How to hide your affiliate link with ... Affiliate links are ugly, and can sometimes put people off clicking them. In this lesson we look at making your links...
Simplifying Navigation with PHP Includes Websites with more than one page can be a pain to update. However, 'PHP includes' allow you to change multiple pages ...
Date Scripts What better way to get your visitors to act now than to give them a deadline? "Act now! This offer expires tomorrow!"...
Sitemaps When you become desperate to find a particular page on a site, you need to refer to the sitemap. Learn what it does, ...
How to Move Your Pages With 301 Redir... If you've got a page on your affiliate site that's doing well, but you want to move its location without losing any t...
URL Canonicalization – To www or not ... Why are www.example.com and example.com not the same website? We take a look at what that "www" means for you as an a...
How to create a rollover button In this lesson we'll be explaining the steps involved in creating a rollover button and how to add it to a page on yo...

Robots.txt - Telling the search engines what they can and cannot index