LastPass Service Disruption: What Happened and What’s Next

Yesterday, LastPass suffered a six-hour disruption. While some users were still able to access their LastPass vault during this time, many were unable to do so, which we acknowledge is unacceptable for a tool that provides individuals and businesses with access to their important information. Many users understandably are asking what happened, why the issues lasted as long as they did, and what we’re doing to make sure nothing like this happens again. So we wanted to first and foremost apologize to our users, as well as clarify that you do not need to take any action on your account, and then share what we know, along with an early look at what we’re doing in response.

What Happened

On November 20th, 2018, around 9am EST, we began to experience connectivity issues in our LastPass datacenters that resulted in login errors for many of our LastPass users. The team responded immediately, however we had difficulties pinpointing the root cause for the reasons detailed below. Eventually, we determined that a server failed in a way that overwhelmed the internal network, slowing down other servers and network devices, as well as the connectivity between our data centers. This resulted in slow or failed logins globally. In addition, the oversaturation of our systems resulted in a slower execution of LastPass’ pre-established response efforts — efforts and processes designed to bring LastPass back to an operational state in the case of a disruption.

During this time, many users were able to access their passwords through offline mode that provides access to your vault even when you have no internet connection or our servers are inaccessible; this mode is automatically enabled when no connection is detected. Because connectivity was intermittent for users throughout the day, offline mode did not kick in properly causing additional frustration and confusion for some users. As you’ll see below, this discrepancy yielded one of the first actionable takeaways of the day: a careful review of our offline mode to make improvements, where necessary, and to capture additional failure scenarios such as this one.

Ruling Out a Security Issue, Including DDoS

Early on in our investigation and mitigation efforts, we sought to determine if this was the result of a security incident. We executed a thorough security audit and reviewed our security related systems to allow us to identify patterns or activity that might indicate our systems were impacted – no such indicators were identified. As part of our standard internal procedure, we engaged our perimeter defense and detection vendor, and they also advised that there was no indication that we were experiencing an attack against our network.

We determined that the issue originated within our own infrastructure and there was no indication that an external party had accessed our servers and conducted any nefarious actions.

Time to Resolution

There were many factors that made this a complex issue to identify which led to a much longer than expected service interruption for our users. We’re highlighting a few so you understand the LastPass teams’ line of questioning, investigations and actions taken during the incident.

  • No spike in network traffic volume: We have extensive monitoring, both at the network and the individual server level, intended to help us quickly pinpoint network related issues. However, at no time during the connectivity issues did we experience a spike in volume of our network traffic, internal or external. Eventually we were able to identify that a node sent corrupt network packets to our internal network, causing packet losses and slowing down other servers. We promptly restarted the LastPass services on this server, but it turned out to be a problem at the network stack level that was time-consuming to diagnose and could only be resolved by taking the node off the network.
  • New code release Tuesday morning: The LastPass engineering team had completed a release of new code and we initially believed this release to be the cause of the issue. The team spent considerable time putting LastPass into emergency mode and rolling back the release. This is a process that normally takes less time, but the systems were slow due to the connectivity problems. The code release proved not to be the issue but we had to do it before we looked at other potential sources.
  • Parallel events: At the same time, other popular online services like Facebook, Instagram, and a major network provider were experiencing downtime or intermittent connectivity, so our team worked with our vendors and service providers to determine if external factors were contributing to the issues that LastPass was experiencing.
  • Redundancy/failover discussion: With one datacenter having issues, the first solution is to failover to a secondary datacenter. The team initiated the failover work during the incident and the service was restored before the failover was completed.

Ultimately, we added server capacity and removed the malfunctioning component, which restored normal operation of the LastPass service.

Conclusions & Key Takeaways

The LastPass team has already had many detailed conversations on what happened during the incident and are beginning to move forward with essential action items designed to prevent this from happening again. Those include:

  • Bringing on additional datacenter capacity, which is scheduled to be online and live by the end of 2018.
  • Adding clarity to our internal response playbook, including improving our external status pages for transparent communication.
  • Evaluating and selecting a tool that allows for a deeper level of network monitoring and optimized efficiency of diagnostics across all components.
  • A review of offline mode, including additional test scenarios, documentation, and external communication to users regarding the use cases of offline mode, and how to enable it, so you’re always able to access your LastPass vault.

The bottom line: The failure of the component and subsequent overload of the system was highly atypical and led our incident responders toward steps that cost them extra time. We are taking the measures mentioned above in order continue to provide our valued users the secure and reliable service they have come to expect.