LivePerson Incident #SEV-106 - Root Cause Analysis
Date: January 11, 2024
Severity: SEV1
Start time: 01:17 PM CT
End time: 02:43 PM CT
Duration: 1 Hour(s), 26 Minute(s)
Summary
On January 11, 2024, at 1:17 PM CT, LivePerson’s Tenfold Cloud Operations team observed increasing operation latency on the Tenfold Platform. LivePerson immediately assembled a war room and began an investigation upon notification of the issue. During the investigation, engineering teams observed that the incident was affecting the data streaming subsystem which would cause high latency for all users of the Tenfold Platform including voice agents with the Tenfold Application and admin users of the Tenfold Dashboard.
During the investigation, engineering teams attributed the incident to new staging data streaming components that were prematurely brought into service. As part of a major data streaming upgrade planned for the Tenfold platform, new components were being built up in preparation for the upgrade. These new components triggered the delays and latency observed by users. Once identified, engineering teams immediately decommissioned the new components and allowed the Tenfold Platform to return to normal operating conditions. At 2:43 PM CT, the incident was resolved when latency metrics returned to normal.
Customer Impact
During the January 11, 2023 incident, all customers of LivePerson’s Tenfold solution were affected by high delays in operations with some operations timing out. Post-Incident Analysis
Prior to the incident, the mentioned data streaming system was scheduled for a major upgrade post-holiday freeze period. In preparation for this upgrade, the infrastructure teams have been building new upgraded data streaming components for a blue-green style upgrade. It has been identified that some of the new components were undergoing preliminary testing and were incorrectly configured to connect to the production cluster. This created a rebalance process which introduced high levels of latency.
The mitigation was to immediately shut down the new data streaming components and allow the platform to return to the normal operating state.
Corrective Actions
During the January 11, 2023 incident, LivePerson’s engineering teams mitigated the issue by removing from service the offending components.
As preventative measures against similar incidents, LivePerson is implementing the following long-term corrective actions: