Saturday, July 31, 2021

Publishing Robots.txt in AEM Cloud

A robots.txt file informational document read by search crawlers that provides what URLs can be crawled on the site.

Robots.txt file is usually hosted at the root level of the site. For ex: https://www.abc.com/robots.txt.

Lets see how we ca implement this in AEM the easiest way: using dispatcher.

First create a robots.txt. Below is sample content of robots.txt file, which tells crawler the url for the sitemap. Sitemap is file hosted in your site, that tells crawlers, the important pages in your site. Sitemap is in XML format with urls to different pages in your site. You can call it as a simple bookmark for all your site structures. Sitemap.xml can be auto-generated (we will look into implementing sitemap.xml in another blog) or can be manually created. 

#Any search crawler can crawl our site
User-agent: *

#Allow only below mentioned paths
Allow: /en/

#Disallow everything else
Disallow: /

#Crawl all sitemaps mentioned below
Sitemap: https://[sitename]/sitemap.xml


Before we look into steps to implement the robots.txt, lets understand how the data flow occurs.
  • Robots.txt is hosted in AEM Dam. This allows authors to update the file without having to depend on the developers.
  • Crawler request (http://<site>/robots.txt) is routed through dispatcher
  • Dispatcher rewrites the request to publisher dam location /content/dam/<site>/robots.txt
  • Publisher responds back with robots.txt



Dispatcher applies the rewrite rules and redirects the 

Activities in AEM:
Login to AEM author environment and upload the file in the DAM. Path I am using is /content/dam/<site>/robots.txt. You can have separate robots.txt for each site managed in AEM. Publish the file so it gets replicated to the publisher.

Dispatcher Changes:
There are two places we need to modify in dispatcher.

Rewrite Rules:/conf.d/rewrites
In the rewrite rules,modify to redirect request for /robots.txt to publisher dam path.
RewriteRule ^/robots.txt$ /content/dam/<site>/robots.txt [PT,L]

Also set content disposition to inline - this allows robots.txt to be viewed in browser. Else robots.txt will download as file which will fail with crawlers

<LocationMatch "^(.*)/content\/dam*">
       Header unset "Content-Disposition"
        Header set Content-Disposition inline
</LocationMatch>

Filters:/conf.dispatcher.d/filters
In the filters for the site, modify to allow .txt extension for the path
/1001 { /type "allow" /extension '(txt)' /path "/content/*"}


Deploy the dispatcher configurations to cloud.

Now all requests to https://<site>/robots.txt should render the contents from robots.txt published though the DAM.




No comments:

Post a Comment