Lime Go Incident 2016-09-05
This Monday the Lime Go users experienced issues with using the product. The problems originated from bug in how virtual machines are handled by our hosting provider. This bug brought down the search engine of Lime Go, affecting large parts of the product. Functionality was restore to most customers during Monday and all customers were without issues during Tuesday
We would like to start of by saying we are truly sorry about any trouble you have experienced during this incident. Below follows a recollection of what has happened and the current actions.
Our search engine experienced a breakdown early during Monday morning. As of a previous incident we had already taken several precautions to minimize the effect of such a incident. However this time the error originated from a bug in how our hosting provider, Amazon Web Services, handles virtual machines, or more exact Docker containers. We do have redundancy when it comes to our machines, but the nature of the error required us to reindex our search engine. As long as we have at least a single living container this problem would never occur. The simultaneous failure of all the search engine containers was a scenario that we didn’t believe could or should happen. We were wrong. The bug caused both machines to be become overloaded and non responsive. We quickly brought a new third machine online which could replicate some of the search engines data into safety, but much of the data was unretrievable. New machines were quickly brought online and started the process of rebuilding the search engine index. By 09:00 all this was underway and affected customers were informed of the issues.
Our Lime Go customers are divide into six clusters and we managed to process two clusters simultaneously before reaching bottlenecks in our database machine. As the clusters were completely index full functionality was restored for the customers in the cluster. This blog post was created around lunchtime, giving affected customers more information about the current status.
Late Monday evening our cache started to be overwhelmed and became a bottleneck for remaining clusters. This slowed the process down somewhat. Early during Tuesday morning we discovered that one of, what we thought, unaffected clusters actually had been be damaged. A reindex was started of this cluster and all customers were brought back to full functionality early Tuesday afternoon.
In the wake of this incident we have performed the following precautions:
- Search Engine Reindexing will now be much faster thanks to code rewrites. This will severely reduce the time it will take us to fully restore functionality in case of a complete break down
- The Docker management software has been patched by our hosting provider and we have upgraded to the latest version
- There are new alarms set up, alerting us for this type of problems. This will enable us to act even faster and hopefully before any customer experiences any problems
We thank you for your patience and understanding. Keeping Lime Go fast and always available is at the very core of our business, and we are committed on delivering just that