Tenfold Dashboard - Analytics Data
Incident Report for Tenfold
Postmortem

LivePerson Incident #SEV-103, SEV-105, SEV-108, SEV-109 - Preliminary Root Cause Analysis 

This preliminary assessment is pending further in-depth analysis of the incident to confirm the root cause and corrective actions.

Summary

On Monday, January 8th, 2024, at 3:54 PM EST, LivePerson’s Tenfold Cloud Operations team observed increasing operation latency followed by a database connection timeout on the Tenfold Platform. In parallel, Tenfold customers reported that login to the Tenfold Application and Tenfold Dashboard was failing. Upon notification of the issue, the NOC engaged our on-call team, created a war room, and started an investigation.

During our investigation, it was determined the primary platform database was unresponsive. The reason pointed to corrupt records in the primary database node. We modified the service to failover incoming requests to a secondary database server. Attempts to resume the impacted services were unsuccessful, and outside resources were engaged. After cross-collaboration between Engineering and third-party infrastructure providers, it was determined the affected server needed to be updated due to an unforeseen misconfiguration of the Domain Name System.

At 9:05 PM EST, it was confirmed the failover succeeded, and new incoming requests were processed correctly. At 9:15 PM EST, after successfully monitoring incoming server requests and services returning to normal working conditions, the issue was marked as resolved.

Subsequent outages have been identified to be related to the original January 8th incidents inclusive of disruptions in service on January 9th, January 16th, and January 17th, 2024. While the affected database node has remained out of service, the remaining nodes have experienced similar stability issues. Ongoing efforts are being undertaken to restore operational stability. The efforts include in-depth consultation with the database vendor and infrastructure provider along with internal LivePerson architecture groups.  

Customer Impact

Tenfold Customers were unable to perform the following actions:

  • Log in to the Tenfold Application or Tenfold Dashboard
  • Perform any call actions
  • Perform any CRM actions
  • Access Analytics data

Corrective Actions

To resolve the January 8th incident, LivePerson performed a failover to the secondary database node utilized by the underlying login process for the LivePerson Agent Connector for Salesforce and subsequently updated DNS configurations, returning services to normal conditions. The subsequent incidents have required similar failover and restarts to restore service with the remaining operational nodes.

To resolve the January 9th incident, LivePerson performed a failover from the new primary node to the new secondary database node. Additionally, actions were taken to attempt to return to a 3-node configuration without success. Service was restored with the new configuration of both a primary and secondary database node in a pooled architecture.

To resolve the January 16th, 2024 incident, LivePerson performed a restart of the primary database node and removed the secondary node from service (not replication). Service restarts of all platform microservices were required to clear the issue and restore normal service.

To resolve the January 17th, 2024 incident, LivePerson performed a restart of the primary database node and platform microservices. Normal operation was restored after the actions.

In the January 17th service window, LivePerson performed a configuration change to bring the database architecture back to a known good operating mode. This operation was successful with stable service observed and improved latency performance.  

The implementation of an additional backup database node is in progress and will be added to the production environment after thorough testing and according to the change management policy. Planning has begun for architectural simplification and platform upgrades to add additional stability to the service.

Posted Jan 18, 2024 - 17:09 CST

Resolved
This incident has been resolved and all Analytics information should be displayed as expected.
Posted Jan 09, 2024 - 14:19 CST
Monitoring
A fix has been implemented, and we do see that all the Analytics data since yesterday's incident has been restored.

Please let us know if you see any issues with Analytics data by contacting your Voice Channel Customer Care team.
Posted Jan 09, 2024 - 11:41 CST
Identified
The issue has been identified and partially resolved with Dashboard Analytics data starting from 9 AM EST. You should be able to see this data in your Analytics Dashboards.

Our Engineering team continues to work on repopulating Analytics data from the end of the outage ~9 PM EST yesterday, January 8th, and we will keep you updated once that data is available once again. Thank you for your continued patience during this very unique situation.
Posted Jan 09, 2024 - 10:59 CST
Investigating
The Tenfold site reliability team is investigating the issue of Dashboard Analytics data not being visible since the resolution of yesterday's incident (https://status.tenfold.com/incidents/3qwdcqsjzrqn). Users will not see any Tenfold Dashboard Analytics call data before or following yesterday's incident at this time. The team is currently investigating the cause of the behavior, and more information will be provided as it is made available.

Note that the Dashboard is accessible, and this is only impacting the visibility of phone call data. We have confirmed that we do see the data being delivered and tracked in our database.

Feel free to subscribe for the most up-to-date notifications here: How to Subscribe
Posted Jan 09, 2024 - 07:01 CST
This incident affected: Tenfold Dashboard and Application Functionality (Dashboard) and Corporate Site (www.tenfold.com).