Incident Summary:
The first alarm was received April 21, 2021 @ 1:57 PM EDT and engineers immediately began troubleshooting. By 2:15 PM EDT engineers determined the fastest path to resolution was to restart core applications and services were restored by 2:20 PM EDT.
Root Cause:
A software component called the Service Locator, which is a sub-system of another component called the Node Manager, was overwhelmed, creating an exponential delay. This set off a complex chain reaction that erroneously removed application servers. The Node Manager provides several key functions besides the Service Locator, including a mechanism which automatically migrates customers to redundant servers when any malfunction is detected. In this condition, an application server can falsely detect neighboring servers as malfunctioning when they are not. When this occurs, the Node Manager incorrectly places servers out of service, resulting in an outage.
Resolution:
The Node Manager and all related services were restarted to resume operating as expected.
Remediation:
On April 25, 2021, Evolve IP modified the Service Location service to distribute requests across many servers to prevent the condition that was observed during the incident. The software developers are further refining the code to optimize the performance of this subsystem and we expect to schedule maintenance shortly to add this code improvement; the exact date of the maintenance is still to be determined but is expected to be released in May.