ECS Application Incident
Incident Report for Evolve IP
Postmortem

Incident Summary:

The first alarm was received at 1:22 PM EDT which notified the engineering team that there was an issue with the Node Manager, which manages the system-level redundancy and failover. Over the next 14 minutes other critical alarms were received that application servers were failing. Engineering worked to clear the issue and restart services, with the last customer coming online at 2:18 PM EDT.

 

Root Cause: 

ECS contains a Node Manager that ensures that the active node is operating in the expected manner (all processes are up and running with the proper resources available), and that a standby node is readily available should there be an issue with the active node. A token is used to continually monitor the status of all active and standby nodes. ECS then relies on a Service Locator to track the “check-in” status of the token and the node that it is monitoring at any given time.

ECS experienced a communication delay between application servers, resulting in a breakdown of the Node Manager, causing hosts to be removed from service. We believe this communication delay was due to a timeout within the Service Locator, caused by competing resources with other processes on the host. This delayed the return status of that active node via the token. The underlying infrastructure (routing/switching/physical hosts) remained available throughout the incident, however.

 

Resolution:

The Service Locator was restarted, reinitializing the token to reset at node 1 and resume operating as expected.

 

Remediation:

Evolve IP is re-architecting the ECS framework to give discreet resources (e.g., dedicated VMs) to the Node Manager and subsequently to the Service Locator. This will eliminate the risk of a timeout that disrupts the monitoring of active and standby modes. This maintenance is scheduled for early morning on Saturday, April 17, 2021 and there will be no downtime.

Posted Apr 16, 2021 - 16:41 EDT

Resolved
Services have continued to function normally and this incident is being marked resolved.

A Service Incident Report will be provided in the postmortem section within 5 business day including the Incident Summary, Root Cause, Resolution, and Remediation required.
Posted Apr 08, 2021 - 17:21 EDT
Update
Customers have confirmed systems are operational.
Posted Apr 08, 2021 - 14:41 EDT
Update
Services continue to functioning as normal and our Engineers will continue to monitor the system throughout the rest of the day.
Posted Apr 08, 2021 - 14:26 EDT
Monitoring
Our engineers have restored services and the systems is back online.
Posted Apr 08, 2021 - 14:14 EDT
Identified
Our engineers are taking actions that will momentarily restore effected services.
Posted Apr 08, 2021 - 13:52 EDT
Investigating
We are currently investigating this issue.
Posted Apr 08, 2021 - 13:44 EDT
This incident affected: OSSmosis (ECS Admin Portal) and UCaaS (Telephony).