Operational delay for some customers.
Incident Report for Tenfold
Postmortem

LivePerson Incident #SEV-103, SEV-105, SEV-108, SEV-109 - Preliminary Root Cause Analysis 

This preliminary assessment is pending further in-depth analysis of the incident to confirm the root cause and corrective actions.

Summary

On Monday, January 8th, 2024, at 3:54 PM EST, LivePerson’s Tenfold Cloud Operations team observed increasing operation latency followed by a database connection timeout on the Tenfold Platform. In parallel, Tenfold customers reported that login to the Tenfold Application and Tenfold Dashboard was failing. Upon notification of the issue, the NOC engaged our on-call team, created a war room, and started an investigation.

During our investigation, it was determined the primary platform database was unresponsive. The reason pointed to corrupt records in the primary database node. We modified the service to failover incoming requests to a secondary database server. Attempts to resume the impacted services were unsuccessful, and outside resources were engaged. After cross-collaboration between Engineering and third-party infrastructure providers, it was determined the affected server needed to be updated due to an unforeseen misconfiguration of the Domain Name System.

At 9:05 PM EST, it was confirmed the failover succeeded, and new incoming requests were processed correctly. At 9:15 PM EST, after successfully monitoring incoming server requests and services returning to normal working conditions, the issue was marked as resolved.

Subsequent outages have been identified to be related to the original January 8th incidents inclusive of disruptions in service on January 9th, January 16th, and January 17th, 2024. While the affected database node has remained out of service, the remaining nodes have experienced similar stability issues. Ongoing efforts are being undertaken to restore operational stability. The efforts include in-depth consultation with the database vendor and infrastructure provider along with internal LivePerson architecture groups.  

Customer Impact

Tenfold Customers were unable to perform the following actions:

  • Log in to the Tenfold Application or Tenfold Dashboard
  • Perform any call actions
  • Perform any CRM actions
  • Access Analytics data

Corrective Actions

To resolve the January 8th incident, LivePerson performed a failover to the secondary database node utilized by the underlying login process for the LivePerson Agent Connector for Salesforce and subsequently updated DNS configurations, returning services to normal conditions. The subsequent incidents have required similar failover and restarts to restore service with the remaining operational nodes.

To resolve the January 9th incident, LivePerson performed a failover from the new primary node to the new secondary database node. Additionally, actions were taken to attempt to return to a 3-node configuration without success. Service was restored with the new configuration of both a primary and secondary database node in a pooled architecture.

To resolve the January 16th, 2024 incident, LivePerson performed a restart of the primary database node and removed the secondary node from service (not replication). Service restarts of all platform microservices were required to clear the issue and restore normal service.

To resolve the January 17th, 2024 incident, LivePerson performed a restart of the primary database node and platform microservices. Normal operation was restored after the actions.

In the January 17th service window, LivePerson performed a configuration change to bring the database architecture back to a known good operating mode. This operation was successful with stable service observed and improved latency performance.  

The implementation of an additional backup database node is in progress and will be added to the production environment after thorough testing and according to the change management policy. Planning has begun for architectural simplification and platform upgrades to add additional stability to the service.

Posted Jan 18, 2024 - 17:08 CST

Resolved
This incident has been resolved.
Posted Jan 09, 2024 - 17:15 CST
Monitoring
A mitigation has been applied and services are return to normal operation. We are closely monitoring performance of all services.
Posted Jan 09, 2024 - 16:32 CST
Update
The investigation continues with our Engineering teams.
Posted Jan 09, 2024 - 14:24 CST
Update
We are continuing to investigate this issue.
Posted Jan 09, 2024 - 13:25 CST
Investigating
We are currently investigating reported operational delay for some customers.
Posted Jan 09, 2024 - 11:54 CST
This incident affected: Tenfold Dashboard and Application Functionality (Dashboard, Chrome Extension).