Blog
Post Mortem – Ensuring High Availability on Azure Functions
Intro
On May 15th, 2019 at 5:01 PM UTC, API2PDF services stopped generating PDF files, returning high error rates. This is what is depicted in the image above. 3 minutes later, at 5:04 PM, services resumed.
While only down for 3 minutes and 99% of our customers did not even notice, at least one customer was impacted severely – the customer who was generating thousands of PDFs in that brief window of time. This customer actually has a daily cron that generates many thousands of PDFs at the same time every day, and had done so without issue for several months. But this day, the service went down.
This 3 minute outage was regrettable and we needed to get to the bottom of it immediately to prevent it from happening again.
Investigation
Our first thought was that we hit the lame 300 connection limit that Azure Functions has in place. We could not believe that this would happen since we tackled the issue last summer. When we loaded up Azure’s Health Monitoring, we saw the max connections we hit during the down time was 245. That still blew us away with how high it was, but it definitely did not hit 300. We were expecting low 100s.
With that theory discarded, we had another theory. The increased rate of requests happened so quickly that Azure Functions did not respond fast enough to allocate additional CPU resources. If true, this would bother us – what is the point of Serverless otherwise? Though we did not really have a way to verify this, so we went to put in a support ticket with Microsoft Azure.
In the process of filling out the support ticket, Azure automatically showed us different health analytics about API2PDF’s functions app that we had not previously known about. This was impressive.
Sure enough, it displayed a metric “Degraded” on some limit we were unaware of called the Thread Limit. What is this? We thought there was only a connection limit of 300?!? No, there are more limits! There is a thread limit of 512. This is super interesting, because to avoid the 300 connection limit, we optimized the heck out of the functions app to make heavy use of C#’s async, which would mean high throughput, but also more threads. I suppose you can only optimize so much… We checked the analytics and as it shows in the image above, we did hit that 512 Thread Limit.
Resolution
After doing some further research, we concluded that we could not optimize our app any further and needed a new approach to ensure high availability on Azure Functions. The current established best practices is the following:
- Deploy your Functions App to a second data center. Now we have East US 2 and North Central deployments on Azure.
- Setup Azure Traffic Manager in front of the deployments and split the traffic in half between both data centers. Each app receives half the traffic.
Voila! Once again, we are no longer close to any of the limits. However, API2PDF is growing at a rapid pace (thanks to all of you!) and so we think it is well within reason that we may need to deploy to a 3rd and 4th data centers eventually. The negatives is that this complicates the deployment process for us, slightly.
Conclusion
Long term – We have come to the conclusion that “Serverless” is just clever marketing. It makes no sense that to maintain high availability, we have to continue to deploy to more data centers. Serverless is supposed to scale infinitely! At some point we may switch off Serverless entirely, but that probably will not be necessary for a while. Regardless, we are doing research into Docker containers and Kubernetes to make sure we stay ahead of the curve.
We wrote this post because we are committed to our promise of transparency, love sharing technical challenges with you all, and take great pride in delivering the best product possible. Always happy to receive your feedback and advice. Reach out to us on Twitter or email us at [email protected]
Cheers.