Understanding caching
<- Previous: Explanation of Configuration Files
This document will explain how Dispatcher caching happens and how it can be configured
Caching Directories
We use the following default cache directories in our baseline installations
- Author
/mnt/var/www/author
- Publisher
/mnt/var/www/html
When each request traverses the Dispatcher the requests follow the configured rules to keep a locally cached version to response of eligible items
Configuration Files
Dispatcher controls what qualifies as cacheable in the /cache {
section of any farm file.
In the AMS baseline configuration farms, you鈥檒l find our includes like shown below:
/cache {
/rules {
$include "/etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any"
}
When creating the rules for what to cache or not, please refer to the documentation here
Caching author
There are a lot of implementations we鈥檝e seen where people don鈥檛 cache author content.
They are missing out a huge upgrade in performance and responsiveness to their authors.
Let鈥檚 talk about the strategy taken in configuring our author farm to cache properly.
Here is a base author /cache {
section of our author farm file:
/cache {
/docroot "/mnt/var/www/author"
/statfileslevel "2"
/allowAuthorized "1"
/rules {
$include "/etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any"
}
/invalidate {
/0000 {
/glob "*"
/type "allow"
}
}
/allowedClients {
/0000 {
/glob "*.*.*.*"
/type "deny"
}
$include "/etc/httpd/conf.dispatcher.d/cache/ams_author_invalidate_allowed.any"
}
}
The important things to note here are that the /docroot
is set to the cache directory for author.
DocumentRoot
in the author鈥檚 .vhost
file matches the farms /docroot
parameterThe cache rules include statement includes the file /etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any
which contains these rules:
/0000 {
/glob "*"
/type "deny"
}
/0001 {
/glob "/libs/*"
/type "allow"
}
/0002 {
/glob "/libs/*.html"
/type "deny"
}
/0003 {
/glob "/libs/granite/csrf/token.json"
/type "deny"
}
/0004 {
/glob "/apps/*"
/type "allow"
}
/0005 {
/glob "/apps/*.html"
/type "deny"
}
/0006 {
/glob "/libs/cq/core/content/welcome.*"
/type "deny"
}
In an author scenario, content is changing all the time and on purpose. You only want to cache items that are not going to change frequently.
We have rules to cache /libs
because they are part of the baseline AEM install and would change until you have installed a Service Pack, Cumulative Fix Pack, Upgrade, or Hotfix. So caching these elements make a ton of sense and really have huge benefits of the author experience of end users who use the site.
/apps
this is where custom application code lives. If you鈥檙e developing your code on this instance then it will prove to be very confusing when you save your file and don鈥檛 see if reflect in the UI due to it serving up a cached copy. The intention here is that if you do a deployment of your code into AEM it too would be infrequent and part of your deployment steps should be to clear the author cache. Again the benefit is huge making your cacheable code run faster for the end users.ServeOnStale (AKA Serve on Stale / SOS)
This is one of those gems of a feature of the Dispatcher. If the publisher is under load or has become unresponsive it will typically throw a 502 or 503 http response code. If that happens and this feature is enabled the Dispatcher will be instructed to still serve what ever content is still in the cache as a best effort even if it鈥檚 not a fresh copy. It鈥檚 better to serve something if you鈥檝e got it rather than just showing an error message that offers no functionality.
This setting can be set in any farm but only makes sense to apply it on the publish farm files. Here is a syntax example of the feature enabled in a farm file:
/cache {
/serveStaleOnError "1"
Caching pages with Query params / Arguments
/content/page.html?myquery=value
) it will skip caching the file and go directly to the AEM instance. It鈥檚 considering this request a dynamic page and shouldn鈥檛 be cached. This can cause ill effects on cache efficiency.See this showing how important query parameters can affect your site performance.
By default you want to set the ignoreUrlParams
rules to allow *
. Meaning that all query parameters are ignored and allow all pages to be cached regardless of the parameters used.
Here is an example where someone has built a social media deep link reference mechanism that uses the argument reference in the URI to know where the person came from.
Ignorable Example:
- https://www.we-retail.com/home.html?reference=android
- https://www.we-retail.com/home.html?reference=facebook
The page is 100% cacheable but doesn鈥檛 cache because the arguments are present.
Configuring your ignoreUrlParams
as a allow list will help fix this issue:
/cache {
/ignoreUrlParams {
/0001 { /glob "*" /type "allow" }
}
Now when the Dispatcher sees the request it will ignore the fact that the request has the query
parameter of ?
reference and still cache the page
Dynamic Example:
- https://www.we-retail.com/search.html?q=fruit
- https://www.we-retail.com/search.html?q=vegetables
Keep in mind that if you do have query parameters that make a page change it鈥檚 rendered output then you鈥檒l need to excempt them from your ignored list and make the page un-cacheable again. For example a search page that uses a query parameter changes the raw html rendered.
So here is the html source of each search:
/search.html?q=fruit
:
<html>
...SNIP...
<div id='results'>
<div class='result'>
Orange
</div>
<div class='result'>
Apple
</div>
<div class='result'>
Strawberry
</div>
</div>
</html>
/search.html?q=vegetables
:
<html>
...SNIP...
<div id='results'>
<div class='result'>
Carrot
</div>
<div class='result'>
Cucumber
</div>
<div class='result'>
Celery
</div>
</div>
</html>
If you visited /search.html?q=fruit
first then it would cache the html with results showing fruit.
Then you visit /search.html?q=vegetables
second but it would show results of fruit.
This is because the query parameter of q
is being ignored in regards to caching. To avoid this issue you鈥檒l need to take note of pages that render different HTML based on query parameters and deny caching for those.
Example:
/cache {
/ignoreUrlParams {
/0001 { /glob "*" /type "allow" }
/0002 { /glob "q" /type "deny" }
}
Pages that use query parameters via Javascript will still fully function ignoring the paramters in this setting. Because they don鈥檛 change the html file at rest. They use javascript to update the browsers dom realtime on the local browser. Meaning that if you consume the query parameters with javascript it鈥檚 highly likely you can ignore this parameter for page caching. Allow that page to cache and enjoy the performance gain!
Caching response headers
It鈥檚 pretty obvious that the Dispatcher caches .html
pages and clientlibs (i.e. .js
, .css
), but did you know it can also cache particular response headers along side the content in a file with the same name but a .h
file extension. This allows the next response to not only the content but the response headers that should go with it from cache.
AEM can handle more than just UTF-8 encoding
Sometimes items have special headers that help control cache TTL鈥檚 encoding details, and last modified timestamps.
These values when cached are stripped by default and the Apache httpd webserver will do it鈥檚 own job of processing the asset with it鈥檚 normal file handling methods, which normally is limited to mime type guessing based on file extensions.
If you have the Dispatcher cache the asset and the desired headers you can expose the proper experience and assure the all the details make it to the clients browser.
Here is an example of a farm with the headers to cache specified:
/cache {
/headers {
"Cache-Control"
"Content-Disposition"
"Content-Type"
"Expires"
"Last-Modified"
"X-Content-Type-Options"
}
}
In the example they have configured AEM to serve up headers the CDN looks for to know when to invalidate it鈥檚 cache. Meaning now AEM can properly dictate which files get invalidated based on headers.
Auto-Invalidate Grace Period
On AEM systems that have a lot of activity from authors that do alot of page activations you can have a race condition where repeat invalidations occur. Heavily repeated flush requests are un-necessary and you can build in some tolerance to not repeat a flush until the grace period has cleared.
Example of how this works:
If you have 5 request to invalidate /content/exampleco/en/
all happen within a 3 second period.
With this feature off you鈥檇 invalidate the cache directory /content/exampleco/en/
5 times
With this feature on and set to 5 seconds it would invalidate the cache directory /content/exampleco/en/
once
Here is an example syntax of this feature being configured for 5 second grace period:
/cache {
/gracePeriod "5"
TTL Based Invalidation
A newer feature of the Dispatcher module was Time To Live (TTL)
based invalidation options for cached items. When an item gets cached it looks for the presence of cache control headers and generates a file in the cache directory with the same name and a .ttl
extension.
Here is an example of the feature being configured in the farm configuration file:
/cache {
/enableTTL "1"
Cache Filter Rules
Here is an example of a baseline configuration for which elements to cache on a publisher:
/cache{
/0000 {
/glob "*"
/type "allow"
}
/0001 {
/glob "/libs/granite/csrf/token.json"
/type "deny"
}
We want to make our published site greedy as possible and cache everything.
If there is elements that break the experience when cached you can add rules to remove the option to cache that item. As you see in the example above the csrf tokens shouldn鈥檛 ever be cached and have been excluded. Further details on writing these rules can be found here
Next -> Using and Understanding Variables