SharePoint 2013: Publishing Robots.txt in AEM Cloud

A robots.txt file informational document read by search crawlers that provides what URLs can be crawled on the site.

Robots.txt file is usually hosted at the root level of the site. For ex: https://www.abc.com/robots.txt.

Lets see how we ca implement this in AEM the easiest way: using dispatcher.

First create a robots.txt. Below is sample content of robots.txt file, which tells crawler the url for the sitemap. Sitemap is file hosted in your site, that tells crawlers, the important pages in your site. Sitemap is in XML format with urls to different pages in your site. You can call it as a simple bookmark for all your site structures. Sitemap.xml can be auto-generated (we will look into implementing sitemap.xml in another blog) or can be manually created.

#Any search crawler can crawl our site
User-agent: *

#Allow only below mentioned paths
Allow: /en/

#Disallow everything else
Disallow: /

#Crawl all sitemaps mentioned below
Sitemap: https://[sitename]/sitemap.xml

Before we look into steps to implement the robots.txt, lets understand how the data flow occurs.

Robots.txt is hosted in AEM Dam. This allows authors to update the file without having to depend on the developers.
Crawler request (http://<site>/robots.txt) is routed through dispatcher
Dispatcher rewrites the request to publisher dam location /content/dam/<site>/robots.txt
Publisher responds back with robots.txt

Dispatcher applies the rewrite rules and redirects the

Activities in AEM:

Login to AEM author environment and upload the file in the DAM. Path I am using is /content/dam/<site>/robots.txt. You can have separate robots.txt for each site managed in AEM. Publish the file so it gets replicated to the publisher.

Dispatcher Changes:

There are two places we need to modify in dispatcher.

Rewrite Rules:/conf.d/rewrites

In the rewrite rules,modify to redirect request for /robots.txt to publisher dam path.

RewriteRule ^/robots.txt$ /content/dam/<site>/robots.txt [PT,L]

Also set content disposition to inline - this allows robots.txt to be viewed in browser. Else robots.txt will download as file which will fail with crawlers

<LocationMatch "^(.*)/content\/dam*">

Header unset "Content-Disposition"

Header set Content-Disposition inline

</LocationMatch>

Filters:/conf.dispatcher.d/filters

In the filters for the site, modify to allow .txt extension for the path

/1001 { /type "allow" /extension '(txt)' /path "/content/*"}

Deploy the dispatcher configurations to cloud.

Now all requests to https://<site>/robots.txt should render the contents from robots.txt published though the DAM.

SharePoint 2013

Saturday, July 31, 2021

Publishing Robots.txt in AEM Cloud

No comments:

Post a Comment