Saturday, July 31, 2021

Publishing Robots.txt in AEM Cloud

A robots.txt file informational document read by search crawlers that provides what URLs can be crawled on the site.

Robots.txt file is usually hosted at the root level of the site. For ex: https://www.abc.com/robots.txt.

Lets see how we ca implement this in AEM the easiest way: using dispatcher.

First create a robots.txt. Below is sample content of robots.txt file, which tells crawler the url for the sitemap. Sitemap is file hosted in your site, that tells crawlers, the important pages in your site. Sitemap is in XML format with urls to different pages in your site. You can call it as a simple bookmark for all your site structures. Sitemap.xml can be auto-generated (we will look into implementing sitemap.xml in another blog) or can be manually created. 

#Any search crawler can crawl our site
User-agent: *

#Allow only below mentioned paths
Allow: /en/

#Disallow everything else
Disallow: /

#Crawl all sitemaps mentioned below
Sitemap: https://[sitename]/sitemap.xml


Before we look into steps to implement the robots.txt, lets understand how the data flow occurs.
  • Robots.txt is hosted in AEM Dam. This allows authors to update the file without having to depend on the developers.
  • Crawler request (http://<site>/robots.txt) is routed through dispatcher
  • Dispatcher rewrites the request to publisher dam location /content/dam/<site>/robots.txt
  • Publisher responds back with robots.txt



Dispatcher applies the rewrite rules and redirects the 

Activities in AEM:
Login to AEM author environment and upload the file in the DAM. Path I am using is /content/dam/<site>/robots.txt. You can have separate robots.txt for each site managed in AEM. Publish the file so it gets replicated to the publisher.

Dispatcher Changes:
There are two places we need to modify in dispatcher.

Rewrite Rules:/conf.d/rewrites
In the rewrite rules,modify to redirect request for /robots.txt to publisher dam path.
RewriteRule ^/robots.txt$ /content/dam/<site>/robots.txt [PT,L]

Also set content disposition to inline - this allows robots.txt to be viewed in browser. Else robots.txt will download as file which will fail with crawlers

<LocationMatch "^(.*)/content\/dam*">
       Header unset "Content-Disposition"
        Header set Content-Disposition inline
</LocationMatch>

Filters:/conf.dispatcher.d/filters
In the filters for the site, modify to allow .txt extension for the path
/1001 { /type "allow" /extension '(txt)' /path "/content/*"}


Deploy the dispatcher configurations to cloud.

Now all requests to https://<site>/robots.txt should render the contents from robots.txt published though the DAM.




Sitemap.xml in AEM Cloud

An XML sitemap is a file that lists a website’s important pages, making sure Google and other search engines can find and crawl them all. It also helps search engines understand your website structure. Though search engines crawls all pages and links in the page, seldom sites end up with few pages that are not linked from any other pages (like the promo landing pages). Having those pages entered in sitemap.xml, speeds up the content discovery and get indexed.

This blog narrates the implementation of sitemap.xml using ACS AEM Commons  package. ACS AEM Commons package is open source package, originally built and supported by Adobe Consulting Services, later turned into an open source/AEM community maintained package. 

You can get more details on this and other features provided by package here 

Lets walk through the steps in implementing the sitemap.xml using this package.

  1. Create a OSGI Configuration for sitemap servlet.
    • create a file com.adobe.acs.commons.wcm.impl.SiteMapServlet-custom.cfg.json under ui.config/config. This setting will be common for all run modes.
    • Edit file with necessary parameters
    • sling.servlet.resourceTypes - This should have path to the page component in your project
    • externalizer.doain - this should have the domain name or the site name

  2. Create a OSGI Configuration for Externalizer Implementation
    • create a file com.day.cq.commons.impl.ExternalizerImpl.cfg.json.
    • This file should be created for each run mode (dev, qa, stage, prod) as the value will change for each environment
    • Entries have various site/domains and its relative URL - local, author and publish are updated by default by AEM.
    • add the new site and its url

Build and Deploy the changes to AEM instance. Now request to any page with .sitemap.xml should render the sitemap for the page.

Additionally, you can make dispatcher rewrite rule for /sitemap.xml to redirect to site home page with extention ".sitemap.xml" - ex: /content/<site>/us/en.sitemap.xml where en is the root home page for the site. This will generate sitemap with links to all pages in the site starting from root page and drill down to all child level pages.