Degraded performance issue.
Incident Report for Tenfold
Postmortem

LivePerson Incident #SEV-106 - Root Cause Analysis 

Date: January 11, 2024

Severity: SEV1

Start time:   01:17 PM CT

End time: 02:43 PM CT

Duration: 1 Hour(s), 26 Minute(s)

Summary

On January 11, 2024, at 1:17 PM CT, LivePerson’s Tenfold Cloud Operations team observed increasing operation latency on the Tenfold Platform.  LivePerson immediately assembled a war room and began an investigation upon notification of the issue. During the investigation, engineering teams observed that the incident was affecting the data streaming subsystem which would cause high latency for all users of the Tenfold Platform including voice agents with the Tenfold Application and admin users of the Tenfold Dashboard.  

During the investigation, engineering teams attributed the incident to new staging data streaming components that were prematurely brought into service.  As part of a major data streaming upgrade planned for the Tenfold platform, new components were being built up in preparation for the upgrade.  These new components triggered the delays and latency observed by users.  Once identified, engineering teams immediately decommissioned the new components and allowed the Tenfold Platform to return to normal operating conditions.  At 2:43 PM CT, the incident was resolved when latency metrics returned to normal.

Customer Impact

During the January 11, 2023 incident, all customers of LivePerson’s Tenfold solution were affected by high delays in operations with some operations timing out. Post-Incident Analysis

Prior to the incident, the mentioned data streaming system was scheduled for a major upgrade post-holiday freeze period.  In preparation for this upgrade, the infrastructure teams have been building new upgraded data streaming components for a blue-green style upgrade.  It has been identified that some of the new components were undergoing preliminary testing and were incorrectly configured to connect to the production cluster.  This created a rebalance process which introduced high levels of latency.

The mitigation was to immediately shut down the new data streaming components and allow the platform to return to the normal operating state. 

Corrective Actions

During the January 11, 2023 incident, LivePerson’s engineering teams mitigated the issue by removing from service the offending components.  

As preventative measures against similar incidents, LivePerson is implementing the following long-term corrective actions:

  • Team members involved in infrastructure upgrades will be educated on configuration and standard process for upgrades of this type. (Completed on January 12, 2023)
Posted Jan 18, 2024 - 17:12 CST

Resolved
This incident has been resolved.
Posted Jan 11, 2024 - 16:07 CST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 11, 2024 - 15:10 CST
Investigating
We are currently investigating reported performance issues.
Posted Jan 11, 2024 - 14:41 CST
This incident affected: Tenfold Dashboard and Application Functionality (Dashboard, Chrome Extension).