Purpose:
This document is to provide a summary of any infrastructure or application interruption. Interruptions are analyzed to determine the root cause and future remediation.
Incident Summary:
On Wednesday, January 6th at 8:20 AM ET, a license key was updated in the Help Center while a background index on one of our nodes was running. This caused the index on that node to become corrupted, which impacted application performance and degraded system functionality. At 8:40 AM ET the node was dropped from the server cluster. This restored functionality to all users. A ticket was opened with the vendor who reviewed logs and performed a health check on the node. At approximately 10:50 AM ET the identified node was brought back into the cluster and services were functioning as expected.
Root Cause:
Corruption of an ongoing index resulted in slow and degraded system functionality
Resolution:
The identified corrupted node was immediately removed from service and all traffic was transferred to the healthy. No downtime was experienced during this transition. The node was restarted and re-indexed to restore it to expected performance. Once health checks were completed, the node was added back to the cluster and traffic redistributed.
Remediation:
This section is to review and document areas for improvement, to better partner with our Clients on future incidents and providing a high-level of client satisfaction.