Purpose:
This document is to provide a summary of any infrastructure or application interruption. Interruptions are analyzed to determine the root cause and future remediation.
Incident Summary:
On 01/04/2021 at 10:15 AM EST, our engineers detected and were investigating an issue related to server connections. Prior to that, at 8:15 AM EST, we had received the first customer report, “ECS will not load the Bar is grayed out and trying to connect”. By 10:25 AM EST, a second customer report was received and correlated, and global incident INC-36549 was opened for investigation. At approximately 11:10 AM EST, Evolve IP engineers took action to reduce the impact of the issue by removing two underperforming servers from the pool, while also enhancing monitoring. About an hour later, it was determined that while symptoms had improved, there remained evidence of some periods of latency, which was causing application slowness/freezing. By 2:30 PM EST the engineers identified a process that was a likely cause and restarted it at 2:45 PM EST—after which we observed a significant performance improvement. We have remained in a monitoring state since that time and our software team has since found and fixed a software bug, which was deployed successfully on 1/10/2021 at 2:00 AM EST.
Root Cause:
A software deficiency was found to block connections and cause delays to end users.
Resolution:
Two software improvements were implemented: one to prevent blocked connections, and one to prevent the system from having to wait for a connection before processing queued messages.
Remediation:
This section is to review and document areas for improvement, to better partner with our Clients on future incidents and providing a high-level of client satisfaction.
1. The software improvements were deployed on 1/10/2021 at 2:00 AM EST.
2. Additional monitoring was implemented to better detect this condition and alert thresholds were increased to improve sensitivity.