ECS Impairment
Incident Report for Evolve IP
Postmortem

Incident Summary:

The first alarm was received April 21, 2021 @ 1:57 PM EDT and engineers immediately began troubleshooting. By 2:15 PM EDT engineers determined the fastest path to resolution was to restart core applications and services were restored by 2:20 PM EDT.

 

Root Cause: 

A software component called the Service Locator, which is a sub-system of another component called the Node Manager, was overwhelmed, creating an exponential delay. This set off a complex chain reaction that erroneously removed application servers. The Node Manager provides several key functions besides the Service Locator, including a mechanism which automatically migrates customers to redundant servers when any malfunction is detected. In this condition, an application server can falsely detect neighboring servers as malfunctioning when they are not. When this occurs, the Node Manager incorrectly places servers out of service, resulting in an outage.

 

Resolution:

The Node Manager and all related services were restarted to resume operating as expected.

 

Remediation:

On April 25, 2021, Evolve IP modified the Service Location service to distribute requests across many servers to prevent the condition that was observed during the incident. The software developers are further refining the code to optimize the performance of this subsystem and we expect to schedule maintenance shortly to add this code improvement; the exact date of the maintenance is still to be determined but is expected to be released in May.

Posted May 03, 2021 - 19:21 EDT

Resolved
Services have continued to function normally and this incident is being marked resolved.

A Service Incident Report will be provided in the postmortem section within 5 business day including the Incident Summary, Root Cause, Resolution, and Remediation required.
Posted Apr 21, 2021 - 22:55 EDT
Monitoring
Engineers have addressed the issue and we will continue to monitor.
Posted Apr 21, 2021 - 14:20 EDT
Identified
The issue has been identified and Engineers are working towards a resolution.
Posted Apr 21, 2021 - 14:18 EDT
Investigating
We are currently investigating an issue with ECS. Our Engineers are active engaged and we will provide updates as they become available.
Posted Apr 21, 2021 - 14:13 EDT
This incident affected: CCaaS (US) (ECS).