6.1.6 Managing site indexing using robots.txt
robots.txt is a text file placed in the root of your website (e.g. https://example.com/robots.txt
).
It is used to control which pages and sections of your website can or cannot be indexed by search engines. It is a tool for SEO optimization and limited protection of data from indexing.
Why do I need robots.txt?
-
Indexation control. You can limit robots’ access to pages that are not intended for general viewing. This helps prevent indexing of service pages, duplicate content, or test sections.
-
Resource optimization. Search robots have a limited time to crawl your website (crawl budget).
robots.txt
allows you to redirect their attention to more important pages. -
Pointing to sitemap.xml. The file allows you to direct robots to the sitemap, which speeds up indexing.
Requirements
- The file must be located in the root of the site, for example, by relative
~/www/example.com/robots.txt
or by full path/var/www/exampleuser/data/www/example.com/robots.txt
- The file must be text, in UTF-8 encoding.
- Make sure the file is available at the specified path. You can check this by going to
https://example.com/robots.txt
.
Basic directives in robots.txt
Important: robots.txt
works on the principle of prohibition (disallow
). If the file is empty or does not exist – search engines and other web crawlers interpret this as permission to index all of site’s content.
The robots.txt
file allows you to flexibly control the behavior of search robots on the site using a small list of special rules (directives). Let’s look at an example file with all the main directives and analyze them:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Sitemap: https://example.com/sitemap.xml
Explanations of the used directives
1. User-agent. Specifies which search robots the rule applies to. The *
symbol means that the rule applies to all robots.
User-agent: Googlebot
The directive will only apply to the Googlebot.
2. Disallow. Denies access to the specified sections or pages of the site. Here we specify the /admin/ and /private/ folders that should not be indexed.
Disallow: /private-data.html
The directive denies access to a specific file.
3. Allow. Explicitly allows access, even if there are denying rules. In the example, access to the /public/ folder is allowed, despite possible other restrictions.
Allow: /private/images/
This rule allows access to the folder with images at the specified path, even if access to the parent folder is blocked.
4. Sitemap. Specifies the path to the sitemap, which helps robots find important pages faster.
Sitemap: https://example.com/sitemap.xml
The directive specifies the location of the sitemap.
Setup examples
As you can see, for precise setup you need to know the list of relevant bots.
List of relevant User-Agents
Search engines:
- Googlebot (main Google bot)
- Googlebot-Image (bot for images)
- Googlebot-News (bot for news)
- Googlebot-Video (bot for video)
- AdsBot-Google (checks pages for Google advertising)
- Mediapartners-Google (Google AdSense bot)
- Bingbot
- AdIdxBot (bot for Bing advertising)
- Yandex
- YandexBot
- YandexImages (bot for images)
- YandexNews (bot for news)
- YandexVideo (bot for video)
- YandexMetrika (user behavior analysis)
- Baidu
- Baiduspider
- Baiduspider-image (bot for images)
- DuckDuckGo
- DuckDuckBot
- Yahoo
- Slurp
- Seznam
- SeznamBot
- Ecosia
- ecosia-bot
Social networks:
- facebookexternalhit
- Facebot
- Twitterbot
- LinkedInBot
- Pinterestbot
- InstagramBot
Specialized services:
- ia_archiver (bot from Internet Archive)
- Google-AMPHTML (Mobile page acceleration)
- UptimeRobot
- PingdomBot
SEO and analysis:
- AhrefsBot
- SemrushBot
- Moz
- Majestic-12
Screenshots and preview:
- ScreenshotMachine
- PagePeeker
Other known user agents:
- Applebot
- Amazonbot
- CloudflareBot
- Yeti
- Sogou Spider
- TelegramBot
- msnbot
Block all robots on the entire site
User-agent: *
Disallow: /
Prohibit indexing of certain directories
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow for a specific search robot
User-agent: Googlebot
Allow: /
Prohibit for everyone except Googlebot
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
Exclude individual files
User-agent: *
Disallow: /private-data.html
Practical advice
-
robots.txt
does not protect data!Disallow
directives only ask robots not to visit certain pages, but they remain directly accessible. To protect information, use passwords or server settings. -
Search engines may ignore
robots.txt
. Some robots (e.g.Googlebot
) do not respectrobots.txt
directives. -
If you have duplicate pages, they should be closed from indexing via
robots.txt
or thenoindex
meta tag. -
For large sites (online stores, portals), it is important to properly distribute the crawling budget. Close less important pages, such as product filters or pagination pages.
-
If the site has a lot of dynamic URLs, for example, with parameters, configure
robots.txt
to exclude them, for example:
Disallow: /search/s=*
- You can check the operation of
robots.txt
using Google Search Console in the Testing robots.txt section or using various types of online checkers.