6.1.6 Managing site indexing using robots.txt

Robots.txt Banner EN

robots.txt is a text file placed in the root of your website (e.g. https://example.com/robots.txt).

It is used to control which pages and sections of your website can or cannot be indexed by search engines. It is a tool for SEO optimization and limited protection of data from indexing.

Why do I need robots.txt?
  • Indexation control. You can limit robots’ access to pages that are not intended for general viewing. This helps prevent indexing of service pages, duplicate content, or test sections.

  • Resource optimization. Search robots have a limited time to crawl your website (crawl budget). robots.txt allows you to redirect their attention to more important pages.

  • Pointing to sitemap.xml. The file allows you to direct robots to the sitemap, which speeds up indexing.

Requirements

  • The file must be located in the root of the site, for example, by relative ~/www/example.com/robots.txt or by full path /var/www/exampleuser/data/www/example.com/robots.txt
  • The file must be text, in UTF-8 encoding.
  • Make sure the file is available at the specified path. You can check this by going to https://example.com/robots.txt.

Basic directives in robots.txt

Important: robots.txt works on the principle of prohibition (disallow). If the file is empty or does not exist – search engines and other web crawlers interpret this as permission to index all of site’s content.

The robots.txt file allows you to flexibly control the behavior of search robots on the site using a small list of special rules (directives). Let’s look at an example file with all the main directives and analyze them:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Sitemap: https://example.com/sitemap.xml
Explanations of the used directives

1. User-agent. Specifies which search robots the rule applies to. The * symbol means that the rule applies to all robots.

Example in robots.txt
User-agent: Googlebot

The directive will only apply to the Googlebot.

2. Disallow. Denies access to the specified sections or pages of the site. Here we specify the /admin/ and /private/ folders that should not be indexed.

Example in robots.txt
Disallow: /private-data.html

The directive denies access to a specific file.

3. Allow. Explicitly allows access, even if there are denying rules. In the example, access to the /public/ folder is allowed, despite possible other restrictions.

Example in robots.txt
Allow: /private/images/

This rule allows access to the folder with images at the specified path, even if access to the parent folder is blocked.

4. Sitemap. Specifies the path to the sitemap, which helps robots find important pages faster.

Example in robots.txt
Sitemap: https://example.com/sitemap.xml

The directive specifies the location of the sitemap.

Setup examples

As you can see, for precise setup you need to know the list of relevant bots.

List of relevant User-Agents

Search engines:

  • Googlebot (main Google bot)
  • Googlebot-Image (bot for images)
  • Googlebot-News (bot for news)
  • Googlebot-Video (bot for video)
  • AdsBot-Google (checks pages for Google advertising)
  • Mediapartners-Google (Google AdSense bot)
  • Bingbot
  • AdIdxBot (bot for Bing advertising)
  • Yandex
  • YandexBot
  • YandexImages (bot for images)
  • YandexNews (bot for news)
  • YandexVideo (bot for video)
  • YandexMetrika (user behavior analysis)
  • Baidu
  • Baiduspider
  • Baiduspider-image (bot for images)
  • DuckDuckGo
  • DuckDuckBot
  • Yahoo
  • Slurp
  • Seznam
  • SeznamBot
  • Ecosia
  • ecosia-bot

Social networks:

  • facebookexternalhit
  • Facebot
  • Twitterbot
  • LinkedInBot
  • Pinterestbot
  • InstagramBot

Specialized services:

  • ia_archiver (bot from Internet Archive)
  • Google-AMPHTML (Mobile page acceleration)
  • UptimeRobot
  • PingdomBot

SEO and analysis:

  • AhrefsBot
  • SemrushBot
  • Moz
  • Majestic-12

Screenshots and preview:

  • ScreenshotMachine
  • PagePeeker

Other known user agents:

  • Applebot
  • Amazonbot
  • CloudflareBot
  • Yeti
  • Sogou Spider
  • TelegramBot
  • msnbot
Block all robots on the entire site
User-agent: *
Disallow: /
Prohibit indexing of certain directories
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow for a specific search robot
User-agent: Googlebot
Allow: /
Prohibit for everyone except Googlebot
User-agent: *
Disallow: /

User-agent: Googlebot
Allow: /
Exclude individual files
User-agent: *
Disallow: /private-data.html

Practical advice

  • robots.txt does not protect data! Disallow directives only ask robots not to visit certain pages, but they remain directly accessible. To protect information, use passwords or server settings.

  • Search engines may ignore robots.txt. Some robots (e.g. Googlebot) do not respect robots.txt directives.

  • If you have duplicate pages, they should be closed from indexing via robots.txt or the noindex meta tag.

  • For large sites (online stores, portals), it is important to properly distribute the crawling budget. Close less important pages, such as product filters or pagination pages.

  • If the site has a lot of dynamic URLs, for example, with parameters, configure robots.txt to exclude them, for example:

Disallow: /search/s=*
  • You can check the operation of robots.txt using Google Search Console in the Testing robots.txt section or using various types of online checkers.