We are doing everything we can to mitigate the impact and resolve the situation as quickly as possible, and apologize for the inconvenience caused. We strongly recommend users login through the browser extensions to access their vault, where most users should have access though some may still see warnings that they are in “offline mode”.
We will continue to update our user base and appreciate your patience.
Update: 1:28 pm EST
Though one of our data centers remains completely down, the service is generally stable and should be available to the majority of users (with the exception of login favicons). Some users may see connection errors but should still be able to access their data. We continue to work as quickly as possible to get the service back to 100%.
Update: 4:13 pm EST
Most users should now be able to connect to LastPass browser extensions and LastPass.com without errors, though favicons still may not sync. We continue to closely monitor the situation.
August 13, 2014: Post Mortem of Yesterday’s Outage
As noted in our original post, on August 12th, 2014 a data center that LastPass relies on went down around 4 am Eastern Time. Below, we have outlined the timeline of events as they unfolded at the data center and with the LastPass service at large.
We again sincerely apologize for the inconveniences caused, and want to assure our community we are moving forward stronger than before, as we remain deeply committed to the security and reliability of our service for our users.
CEO of LastPass
Summary of Events
The majority of users were unaffected due to having proper redundancy in place to deal with the loss of a data center, as well as the built-in offline access via the LastPass browser extensions. However, during our efforts to scale at the secondary data center to ensure sufficient capacity at peak of the day, we inadvertently worsened the situation through human error. Our team certainly has takeaways from the experience and will be implementing changes going forward, as detailed in the concluding statements below.
We did receive a full RFO from our data center confirming that the BGP routing table issues affecting other companies yesterday played a role as well. For more, see: http://www.zdnet.com/internet-hiccups-today-youre-not-alone-heres-why-7000032566/
Timeline of Events (EDT)
3:50 am – We detected extreme latency and packet loss between one of our data centers and most major networks, including inter-connectivity with the other data center.
3:54 am – Our monitoring system detected the situation as critical and paged two operators.
4:00 am – We contacted our data center provider regarding the issue we were experiencing with their service.
5:00 am – With no update from our impacted data center provider, we switched from two data centers to run entirely on the second data center and disabled the affected data center.
6:00 am – We noticed IPv6 has suddenly started working at the now-disabled data center, making it clear to us that major networking changes were being made.
7:00 am – Our report was escalated by the impacted data center provider.
8:00 am – We determined that the outage will likely be extended, so we executed on a plan to add some spare machines into load balancing at the second data center to ensure we would have plenty of spare capacity at the peak of the day.
8:15 am – We began to receive alerts of intermittent connectivity issues at our second (now only) data center.
8:30 am – A small percentage of users reported logout errors that prevented them from utilizing offline mode.
9:00 am – We continued trying to work with our impacted data center provider, but received no updates on the situation or information on resolution.
9:30 am – Latency and connectivity issues increased at the second (now only) data center, which we began investigating.
10:00 am – We received acknowledgement from our impacted provider indicating this is a widespread problem, and indicated they would reload the core routers. They noted that it may be an extended outage.
10:30 am – The impacted data center’s network went completely down.
12:00 pm – We tracked down the source of an issue at the second data center, in which 3 machines we had added were running at 100Mbps instead of Gigabit (despite having Gigabit cards and being connected to Gigabit switches) and were network saturated.
12:45 pm – We resolved the issue with the 3 additional machines, and fully restored service still running on the second data center only, though favicons remained disabled.
2:15 pm – Impacted provider indicated they were fully online, though those machines remained unreachable for us.
2:30 pm – We authorized the impacted data center staff to reboot our networking equipment, with no effect.
3:30 pm – We discovered the underlying issue with why some users are being logged off immediately after login and resolved.
3:45 pm – Members of our team arrived at the impacted data center, and verified that our networking equipment was still down.
4:15 pm – We completed a swap to spare equipment, bringing the impacted data center back online.
8:45 pm – We completed testing and confirmed that replication to secondary data center looked good, and were fully restored with both data centers active again.
Conclusions & Lessons Learned
As a result of yesterday’s events, we have formed the following key takeaways and action steps:
- We have moved our status page to be hosted outside our network, since it was inaccessible for periods of time.
- In an effort to gather more detailed information for our community, we delayed communicating about the situation. Going forward, we will share what information we have, however sparse, and work to update the community from there, via the blog, the status page, our social accounts, and email where appropriate.
- Our monitoring checks now verify port speed:
- We are considering moving to another data center provider.
- In an effort to improve the situation, we worsened it through our actions, and we will be more cautious in taking preventative actions when running on a single data center.
- We’re moving to a hosted model for DNS that includes external service checks.
- Though we designed some systems to be ‘non-critical’, such as favicons for sites, we’ll be improving our systems to minimize visual disruption during a massive outage.
- A small number of users were impacted by an inability to access the service offline, we continue to investigate and test this.
- We will be implementing more disaster and redundancy tests of our systems to better prepare for a catastrophic, single data center scenario.