Issue Description

Starting Thursday January 11, there were reports of occasional 500 errors being served up as users requested the Mimir application (Dashboard, Assignments or Project pages). 

Root Cause Analysis

A server in a load balanced cluster was malfunctioning and also failing to report this status anywhere. This caused only some sessions to be served up by the faulty server.

Resolution

We have identified the bad node or server and had it fixed by AWS. Going forward we are investigating more ways to monitor outages like this more proactively to insure we can resolve as quickly as possible.

Did this answer your question?