Growing Pains – On Yesterday’s Downtime
After almost four years without any major downtime Paymo has experienced it’s second major downtime in the past two weeks. The whole situation is unsatisfactory and we would like to apologize to everyone about these issues. We’re fully focused on avoiding these types of problems in the future and we’ll try to shed some light into what happened and what are the next steps we’re taking to resolve these issues in this blog post:
Last Week’s Downtime
Last week’s downtime was caused by a network outage at our data center (Peer1). Their investigation uncovered the fact that both their transport providers used a common third party, long-haul provider for their underlying fiber infrastructure. The underlying provider was performing a planned fiber maintenance in the Waco, Texas area on the morning of February 12th. During that maintenance, both backbone circuits that run between San Antonio and Dallas were taken offline. This resulted in several hours of downtime for all Paymo users. After seeing what happened we immediately switched to backup servers in order to restore the service – however, these servers failed rapidly due to an unexpected surge in traffic & server load. We finally managed to get the service up and running after our host Peer1 solved their network problems.
In the past few months we’ve seen accelerated growth in the Paymo user base that caught us a bit off-guard – it’s the first time since we created Paymo that we were not able to correctly anticipate the resources needed to run the service smoothly without interruption for everyone. Yesterday’s downtime was caused initially by a few SQL queries that help generate timesheet reports (sometimes with hundreds of thousands of entries). These queries were taking much more time to complete than they usually do, which in turn slowed down the rest of the queries and ultimately bogged down the whole system.
What we are doing to avoid service interruptions in the future:
Short term mitigation (yesterday):
After the 1st hour of downtime (in which we tried to see if some optimizations could restore normal operations) it was clear that the current server, although very powerful will not be able to handle the load so we decided to migrate to new hardware. We had a scenario for this situation ready and we decided to go ahead with it and migrated to an Amazon instance. We booted up the most powerful instance we could get (more RAM then the old hardware and SSD backed storage), migrated the data and changed the DNS to point to the new server. The process took about 20 minutes and after the DNS switch it took about 10 minutes for the server to handle the initial surge of traffic and stabilize.
Medium term mitigation (1-2 months):
We will be investigating if a custom built server backed by SSD storage or custom storage hardware like Fusion-io ioDrive will deliver more performance for the database server.
Long term mitigation:
We are almost in the alpha stage of the upcoming Paymo version. The new application was redesigned from the ground up with a new architecture that will support distributed servers which will allow us to handle scaling more easily.
Thanks for your understanding and patience. Please let us know if you have any questions, feedback or concerns.