A robots.txt file informational document read by search crawlers that provides what URLs can be crawled on the site.
Robots.txt file is usually hosted at the root level of the site. For ex: https://www.abc.com/robots.txt.
Lets see how we ca implement this in AEM the easiest way: using dispatcher.
First create a robots.txt. Below is sample content of robots.txt file, which tells crawler the url for the sitemap. Sitemap is file hosted in your site, that tells crawlers, the important pages in your site. Sitemap is in XML format with urls to different pages in your site. You can call it as a simple bookmark for all your site structures. Sitemap.xml can be auto-generated (we will look into implementing sitemap.xml in another blog) or can be manually created.
#Any search crawler can crawl our site
User-agent: *
#Allow only below mentioned paths
Allow: /en/
#Disallow everything else
Disallow: /
#Crawl all sitemaps mentioned below
Sitemap: https://[sitename]/sitemap.xml
- Robots.txt is hosted in AEM Dam. This allows authors to update the file without having to depend on the developers.
- Crawler request (http://<site>/robots.txt) is routed through dispatcher
- Dispatcher rewrites the request to publisher dam location /content/dam/<site>/robots.txt
- Publisher responds back with robots.txt