Thursday, May 10, 2012

How we use Amazon CloudFront for dynamically generated content

As you might know from the previous blog entry we store integration flows created by our users in Github Gists created on user’s behalf. This gives our users a full ownership of their data and gives us a great storage, collaboration and versioning concept from Github.

Our integration flows are stored as JSON and it looks like this:

https://gist.github.com/2519313

Github is doing a great job to pretty-print JSON and our JSON is rather simple, however we wanted to make it simple to transform that JSON into visual form, like this:

So as we were using JavaScript and Canvas to render it on browser side but now we decided to use same JavaScript and node-canvas to render PNG on server side. How we did it is a topic for separate blog post, however one was clear from beginning : server-side PNG generation is costly and resource intensive task, therefore we need to take extreme care when doing it on the web-scale.

One obvious way to reduce the load on our servers was caching. We could have installed a custom caching solution, however we decided to go with AWS CloudFront, particularly with AWS CloudFront and custom origin option.

What is CloudFront custom origin?

When CloudFront was introduced it did a great on content distribution. As stated on the CloudFront page:

Amazon CloudFront delivers your static and streaming content using a global network of edge locations. Requests for your objects are automatically routed to the nearest edge location, so content is delivered with the best possible performance…

In the beginning you only could feed CloudFront with your Amazon S3 content. You put your files on the Amazon S3 and CloudFront takes it from there, caches it on the edge locations and delivers from there to your customers.

That was working great, however S3 as the only source was definitely a limitation. In November 2011 CloudFront introduced new ‘custom origin’ feature. With custom origin you could feed your CloudFront with any content, even dynamically generated, right from your web-server.

That was exactly what we were looking for to cache our dynamically generated images.

How to use CloudFront with custom origin

Configuring CloudFront is very simple via AWS Management Console. Just create new CloudFront distribution and put your source web-server URL as custom origin. 

After some initialization time your CloudFront distribution can be used to serve your files.

And what you also need to do is to check that resources served by your web-server have necessary HTTP headers to control how long CloudFront will cache them. Three are well documented requirements to the resources CloudFront will cache. For our use-case most interesting HTTP headers will be Cache-Control, Date, Last-Modified and ETag.

Amazon CloudFront uses Cache-Control header to control how long it should cache given resource. There is another Expires header available in the HTTP spec, hoever Cache-Control is preferred by CloudFront.

Apart from that we need to serve Date and Last-Modified headers that are required to compute expire value from Cache-Control header, and the ETag we also going to need (more about it later).

That’s how a response from our origin server will looks like (you can use curl -I http://webserver.com to see HTTP response headers):

HTTP/1.1 200 OK
Cache-Control: max-age=86400, public
Date: Thu, 10 May 2012 07:43:35 GMT
Etag: b784a8d162cd0b45fcb6d8933e8640b457392b46
Last-Modified: Tue, 08 May 2012 16:46:33 GMT
X-Powered-By: Express
Connection: keep-alive

According to the headers above this resource will be cached by CloudFront for 86400 seconds which is equals to one day. And once we will request same URL via CloudFront we will see following headers. First time:

HTTP/1.0 200 OK
Cache-Control: max-age=86400, public
Date: Thu, 10 May 2012 07:43:51 GMT
ETag: b784a8d162cd0b45fcb6d8933e8640b457392b46
Last-Modified: Tue, 08 May 2012 16:46:33 GMT
X-Powered-By: Express
X-Cache: Miss from cloudfront
X-Amz-Cf-Id: u3aMd_juLM3t9sEfQLOhB75OmzO2EB6LK6n4HdhkkqwtpfexWYVlbg==,mDcVYMPxwm3fW_D7-qd93Dy8nfOrw5AsAamRSWpEu9rR7t9ZQPzOaQ==,18IhmhwGxno7iHVSWHbnL8tb0TNJwfpp6KcdznEkD0fVc2fIWt9r3w==
Via: 1.0 28edd995979e84232ebdb595b33d9deb.cloudfront.net (CloudFront)
Connection: close

Note the ‘X-Cache: Miss from cloudfront' header, and second time:

HTTP/1.0 200 OK
Cache-Control: max-age=86400, public
Date: Thu, 10 May 2012 07:43:51 GMT
ETag: b784a8d162cd0b45fcb6d8933e8640b457392b46
Last-Modified: Tue, 08 May 2012 16:46:33 GMT
X-Powered-By: Express
Age: 7
Content-Length: 0
X-Cache: Hit from cloudfront
X-Amz-Cf-Id: V_da8LHRj269JyqkEO143FLpm8kS7xRh4Wa5acB6xa0Qz3rW3P7-Uw==,iFg6qa2KnhUTQ_xRjuhgUIhj8ubAiBrCs6TXJ_L66YJR583xXWAy-Q==
Via: 1.0 d2625240b33e8b85b3cbea9bb40abb10.cloudfront.net (CloudFront)
Connection: close

Note the 'X-Cache: Hit from cloudfront  header - that means that our resource is now served directly from CloudFront cache.

Speed comparison

So, now we can check how much faster it is to serve a single image. Please note that it is in no way a representative comparison of CloudFront capabilities, it is just a single test for two consequent HTTP requests. One request will do a HTTP redirect and another one will serve an image. As an origin server we use a single Heroku web-worker with node.js 0.6.8 on it. Here is the picture without caching:

So it took overal 1005 milliseconds including 799 milliseconds latency to do these two HTTP requests (HTTP connection was kept open). And here is the same requests served from CloudFront (after cache & DNS wam-up):

So as you can see it’s a w00ping 93 milliseconds (!) with only 35 milliseconds latency which is more than 10 times faster than without caching.

Small bonus - HTTP 304

And apart from that you will get a nice bonus if you serve your ETag header. ETags are kind of digest of the resource that is used by modern browsers to save the bandwidth. When browser request a resource that is available in browser’s cache it will send a special If-Match header with the ETag of the resource from the cache. In this case CloudFront servers will compare ETag from browser with ETag from cache and will respond with HTTP 304 Not Modified without sending the resource again. Your web server might do the same for you, however it is need to be explicitly handled for dynamically  generated content. With CloudFront you just need to suply your content with ETag and you will get this feature for free.

Summary

Amazon CloudFront in combination with custom origin provide us with a great caching and content delivery service. We can use it even for dynamically generated resources. In our (simplistic) test we’ve seen 10-fold increase in speed and latency when serving from CloudFront caches. With CloudFront we can declaratively control caching behavior. With agressive caching we can significantly reduce load on our servers therefore we need less servers/VMs hence reducing costs.

CloudFront service is not free. Actual price for serving one GB from CloudFront is $0,120 which is for our use-cases definitely justify the lowering costs of server hours.

More information you can find in blog post about CloudFront custom origin and CloudFront documentation.

Update

Today (14 May 2012) Amazon released an update for CloudFront that will make it even more suitable for our use-cases. With the newest update we would be able to serve all static resources, even the once which are not dynamically generated via one CloudFront distribution (by using multiple origins). Another addition will allow us to specify image width as query parameters to resize images on the server-side. So we will keep you posted about our new developments in this blog, in the meantime you can follow us on twitter and read AWS blog about new CloudFront features.