Glossaire · SEO

Robots.txt

Robots.txt is a text file placed at the root of a website (at the address domain.com/robots.txt) that tells crawlers, such as Googlebot, which parts of the site they are allowed or not allowed to access. It relies on the Robots Exclusion Protocol and uses simple directives: User-agent specifies the targeted crawler, Disallow forbids the crawling of a path, and Allow permits it. Robots.txt controls crawling, meaning access to content, but does not guarantee de-indexing: a blocked URL can still appear in search results if it receives links. It is an essential tool for managing crawl budget, preventing crawlers from wasting their resources on pages with no SEO value, such as admin pages, shopping carts, or faceted filters. A misconfigured robots.txt can unintentionally block strategic pages and seriously harm a site's visibility in search.

Robots.txt is one of the first files a crawler checks when it visits a site. Although tiny, it has a direct impact on how your content is discovered and explored by search engines and, increasingly, by AI crawlers.

How it works

The file follows the Robots Exclusion Protocol. It groups directives into blocks, each targeting one or more crawlers via the User-agent directive. The Disallow and Allow directives then define forbidden or permitted paths. Here is a simple example:

User-agent: *
Disallow: /admin/
Disallow: /cart/
Allow: /

Sitemap: https://domain.com/sitemap.xml

The Sitemap line points to the XML sitemap, helping crawlers discover all the important URLs. Well-behaved crawlers, such as Googlebot, read this file before exploring the site.

Why it matters

The main benefit of robots.txt is controlling your crawl budget. On a large site, preventing crawlers from wasting resources on low-value pages (filters, URL parameters, private areas) lets them focus on strategic content. Conversely, a syntax error can block entire sections and make a site disappear from search results.

Key takeaway

Robots.txt controls access, not presence in the index. To de-index a page, use the noindex tag, not a Disallow.

Robots.txt and generative AI

In 2026, robots.txt plays a new role: it lets you allow or block the crawlers of language models (GPTBot, ClaudeBot, PerplexityBot, Google-Extended). Blocking these agents protects your content from training, but can also reduce your chances of being cited in AI assistant answers. This strategic trade-off is now an integral part of a modern visibility approach, to be coordinated with your llms.txt file and your overall strategy.

FAQ

Frequently asked questions

No. Robots.txt blocks crawling, not indexing. A blocked URL can still appear in search results if other pages link to it. To prevent indexing, use the meta robots noindex tag on a page that remains accessible to crawlers.

The file must be placed at the root of the domain, accessible at domain.com/robots.txt. Placed anywhere else, it will be ignored by crawlers. Each subdomain requires its own robots.txt file.

Go further

Related terms & resources

Glossaire

Crawl budget

The resources crawlers allocate to exploring a site, partly optimized through robots.txt.

Explore

Glossaire

Crawl

The process by which crawlers explore pages, directly governed by robots.txt directives.

Explore

Service

Technical SEO service

The audit and optimization of your site's technical infrastructure, including robots.txt configuration.

Explore

Free audit

A question about your AI visibility?

Your site’s AI visibility score. Gap analysis vs 3 direct competitors. 5 priority optimizations. Delivered as a PDF, no commitment.

Reply within 24h · No commitment · contact@luwiz.io