The easiest way to create a robots.txt file is to use the Generate robots.txt tool in Webmaster Tools. Once you’ve created the file, you can use the Analyze robots.txt tool to make sure that it’s behaving as you expect.
Once you’ve created your robots.txt file, save it to the root of your domain with the name robots.txt. This is where robots will check for your file. If it’s saved elsewhere, they won’t find it.
You can also create the robots.txt file manually, using any text editor. It should be an ASCII-encoded text file, not an HTML file. The filename should be lowercase.
Syntax
The simplest robots.txt file uses two rules:
These two lines are considered a single entry in the file. You can include as many entries as you want. You can include multiple Disallow lines and multiple user-agents in one entry.
What should be listed on the User-agent line?
A user-agent is a specific search engine robot. The Web Robots Database lists many common bots. You can set an entry to apply to a specific bot (by listing the name) or you can set it to apply to all bots (by listing an asterisk). An entry that applies to all bots looks like this:
User-agent: *
Google uses several different bots (user-agents). The bot we use for our web search is Googlebot. Our other bots like Googlebot-Mobile and Googlebot-Image follow rules you set up for Googlebot, but you can set up specific rules for these specific bots as well.
What should be listed on the Disallow line?The Disallow line lists the pages you want to block. You can list a specific URL or a pattern. The entry should begin with a forward slash (/).
Disallow: /
Disallow: /junk-directory/
Disallow: /private_file.html
User-agent: Googlebot-Image Disallow: /images/dogs.jpg
User-agent: Googlebot-Image Disallow: /
User-agent: Googlebot Disallow: /*.gif$
User-agent: * Disallow: /folder1/ User-agent: Mediapartners-Google Allow: /folder1/
Note that directives are case-sensitive. For instance, Disallow: /junk_file.asp would block http://www.example.com/junk_file.asp, but would allow http://www.example.com/Junk_file.asp.
Pattern matching
Googlebot (but not all search engines) respects some pattern matching.
User-agent: Googlebot Disallow: /private*/
User-agent: Googlebot Disallow: /*?
User-agent: Googlebot Disallow: /*.xls$
You can use this pattern matching in combination with the Allow directive. For instance, if a ? indicates a session ID, you may want to exclude all URLs that contain them to ensure Googlebot doesn’t crawl duplicate pages. But URLs that end with a ? may be the version of the page that you do want included. For this situation, you can set your robots.txt file as follows:
User-agent: * Allow: /*?$ Disallow: /*?
The Disallow: / *? directive will block any URL that includes a ? (more specifically, it will block any URL that begins with your domain name, followed by any string, followed by a question mark, followed by any string).
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).