ECS Application Responsiveness
Incident Report for Evolve IP
Postmortem

Purpose:

This document is to provide a summary of any infrastructure or application interruption.  Interruptions are analyzed to determine the root cause and future remediation.

Incident Summary:

On 01/04/2021 at 10:15 AM EST, our engineers detected and were investigating an issue related to server connections. Prior to that, at 8:15 AM EST, we had received the first customer report, “ECS will not load the Bar is grayed out and trying to connect”. By 10:25 AM EST, a second customer report was received and correlated, and global incident INC-36549 was opened for investigation. At approximately 11:10 AM EST, Evolve IP engineers took action to reduce the impact of the issue by removing two underperforming servers from the pool, while also enhancing monitoring. About an hour later, it was determined that while symptoms had improved, there remained evidence of some periods of latency, which was causing application slowness/freezing. By 2:30 PM EST the engineers identified a process that was a likely cause and restarted it at 2:45 PM EST—after which we observed a significant performance improvement. We have remained in a monitoring state since that time and our software team has since found and fixed a software bug, which was deployed successfully on 1/10/2021 at 2:00 AM EST.

Root Cause: 

A software deficiency was found to block connections and cause delays to end users.

Resolution:

Two software improvements were implemented: one to prevent blocked connections, and one to prevent the system from having to wait for a connection before processing queued messages.

Remediation:

This section is to review and document areas for improvement, to better partner with our Clients on future incidents and providing a high-level of client satisfaction.

1.       The software improvements were deployed on 1/10/2021 at 2:00 AM EST.

2.       Additional monitoring was implemented to better detect this condition and alert thresholds were increased to improve sensitivity.

Posted Jan 12, 2021 - 16:20 EST

Resolved
Our engineering team has monitored the ECS system since the maintenance performed on Sunday and we have not logged any new tickets related to this incident. At this time, services have continued to function normally and this incident is being marked resolved.
Posted Jan 12, 2021 - 16:13 EST
Update
The scheduled maintenance has been completed. We will continue to monitor today and tomorrow.
Posted Jan 10, 2021 - 09:12 EST
Update
We have identified a bug that was creating locks which was impacting performance on desktop apps. A hot fix has been developed, passed full regression testing, and will be applied during a maintenance window from 2 AM to 6 AM ET on Sunday, January 10th with a few minutes of downtime. https://status.evolveip.net/incidents/pwbg7m7f2nkv
Posted Jan 07, 2021 - 13:34 EST
Update
We have observed no further issues with application responsiveness and services are functioning as normal. Our Engineers will continue to monitor the system throughout the rest of the day.
Posted Jan 05, 2021 - 10:21 EST
Monitoring
Our Engineers will rebalance web servers during a maintenance window this evening, with no impact to services. Monitoring of the system will continue throughout the evening and into tomorrow morning.
Posted Jan 04, 2021 - 16:51 EST
Identified
Some servers were noticed to be under performing and were removed from a high availability pool of servers. At this time service has continued to function normally and this incident is being marked as identified. Our Engineers will continue to monitor the system throughout the rest of the day.
Posted Jan 04, 2021 - 13:27 EST
Update
We have observed no further issues with application responsiveness. Our Engineers will continue to monitor the system throughout the rest of the day.
Posted Jan 04, 2021 - 12:12 EST
Update
We are continuing to investigate this issue.
Posted Jan 04, 2021 - 11:38 EST
Update
Our Engineers are actively engaged and investigating the incident and additional information will be provided as it becomes available.
Posted Jan 04, 2021 - 10:58 EST
Investigating
We are currently investigating this issue.
Posted Jan 04, 2021 - 10:47 EST
This incident affected: CCaaS (US) (ECS).