The robots.txt file, placed in the root directory of a website, instructs search engine robots about which pages should and should not be crawled. By preventing web crawlers from crawling certain parts of a website, we can have more control over the content visibility and improve the site’s SEO score.

Understanding the Purpose of robots.txt

The primary purpose of the robots.txt file is to prevent web crawlers from crawling specific parts of a website that are not supposed to be seen by everybody. Also, by omitting such routes, we can prevent crawlers from including irrelevant or repetitive content in search engine results. This selective approach ensures that only the valuable content is crawled, thereby boosting the overall SEO score.

For example, we may want to prevent search engines from crawling personalized pages, admin pages, or content that is still under development. By using the robots.txt file, we can pass these instructions to web crawlers.

However, it's important to note that while the robots.txt file can prevent certain pages from being crawled, it does not guarantee that search engines won't find them. The file simply provides instructions to web crawlers and does not have the power to block access to pages actively.

Importance of robots.txt

Configuring the robots.txt file correctly can have several benefits. Let's explore some of the key advantages:

1. Enhancing Crawl Budget Efficiency

Properly configured robots.txt file helps conserve the crawl budget, which is the number of pages a search engine bot is willing to crawl on a website during a given time period. By conserving the crawl budget, we can ensure that only the most relevant and valuable content is indexed, improving overall crawl efficiency.

2. Preventing Duplicate Content Issues

Duplicate content can harm a website's search engine rankings. By disallowing search engine bots from crawling repetitive or similar content, we can prevent confusion and maintain the quality and credibility of the content.

3. Securing Sensitive Information

Website security and user privacy are crucial, especially for sites with user accounts or confidential information. The robots.txt file enables us to protect sensitive or private sections of the websites by disallowing search engine bots from crawling them. But keep in mind that in some situations URLs that are disallowed in robots.txt may still be indexed, even if they haven't been crawled.

4. Providing a Clear Sitemap Reference

Another feature of robots.txt is referencing a website's XML sitemap. The XML sitemap helps search engine bots discover and follow the website's structure, leading to a more efficient and thorough crawling process. By including a reference to the sitemap in the robots.txt file, we can ensure that search engine bots can easily find and navigate the sitemap.

5. Directing Crawler Behavior for Multilingual Websites

For websites with multilingual content, using robots.txt file can help to ensure that search engine bots prioritize crawling the correct versions of the content based on user location or language preferences. This improves geo-targeting and relevance in search results, ultimately enhancing the overall user experience.

Syntax Used in robots.txt File

1. User-agent

The "User-agent" protocol identifies the specific bot or crawler to which the rule applies. For example, User-agent: Googlebot would target Google's web crawler. To target all crawlers, the rule can be specified like this: User-agent: *.

2. Disallow

The "Disallow" protocol tells bots not to crawl specific pages or sections of a website. For example, Disallow: /settings/ would block crawlers from accessing the routes in the "settings" folder. Paths must start with the "/" character and if it refers to a folder, it must end with the "/" as well.

3. Allow

The "Allow" protocol grants bots permission to crawl specific pages or sections of a website, even if they have been disallowed in a previous rule. For example, Allow: /settings/public-page.html would allow bots to access the "public-page.html" file, even if it is located in a disallowed folder.

4. Sitemap

The "Sitemap" protocol provides the location of a website's XML sitemap, helping search engine bots find pages more efficiently. Including the sitemap in the robots.txt file is considered one of the best practices for SEO. For example, Sitemap: https://www.example.com/sitemap.xml directs crawlers to the website's sitemap file. The sitemap does not even have to be on the same host as the robots.txt file. It is also possible to reference multiple XML sitemaps in robots.txt. As an example, this may be useful if a site has one static sitemap and a dynamic one.

5. Crawl-delay

The "Crawl-delay" property sets a delay between requests to avoid overloading the server. For example, Crawl-delay: 10 would request that bots wait 10 seconds between requests to the website.

Example of robots.txt

User-agent: *
Allow: /

Disallow: /_next/*.js$

Disallow: /login
Disallow: /signup
Disallow: /forgot-password

Disallow: /admin/
Allow: /admin/exception

Disallow: /api/admin/mutations/

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/dynamic-sitemap.xml

NextJS and robots.txt

In Next.js, we can easily add or generate a static robots.txt file in the root of our app directory as usual. Since 13.3 Next.js also provides the convenient flexibility to dynamically generate the robots.txt file by returning a Robots object from the robots.ts file. This approach is particularly useful for generating different rules based on certain conditions (e.g. .env properties). Let's take a look at an example of generating a Robots object in Next.js:

import { MetadataRoute } from 'next';

export default function robots(): MetadataRoute.Robots {
    if (process.env.ENVIRONMENT === 'development') {
        return {
            rules: {
                userAgent: '*',
                disallow: '*',
            },
        }
    }

    return {
        rules: [
            {
                userAgent: '*',
                allow: '/',
                disallow: '/private/',
                crawlDelay: 5,
            },
        ],
        sitemap: ['https://example.com/sitemap.xml'],
    }
}

In the example above, we define a robots function. We restrict all paths for crawlers for the development environment. For other environments, we return a Robots object (or an array of objects) with the rules property containing the specific crawling directives. Additionally, we specify the location of the sitemap.xml file using the sitemap property (or an array for multiple sitemaps).

Then NextJS will automatically generate a static robots.txt file using the function we provided.

Conclusion

Configuring the robots.txt file is one of the best practices for website management and SEO. The robots.txt file provides instructions to search engine bots, guiding their crawl process.

The robots.txt file helps to secure sensitive information, enhance crawl efficiency, prevent duplicate content issues, provide a clear sitemap reference, and direct crawler behavior for multilingual or multiregional websites which increases the overall SEO score. NextJS offers increased flexibility allowing generating the robots.txt file based on certain conditions.

Remember to regularly update the robots.txt file to keep up with the changing needs of the website. With proper configuration and regular maintenance, the robots.txt file can be a powerful tool to optimize search engine rankings.

Configuring robots.txt for Better Indexation and SEO Score