Temporary test unavailability and 1.5 hrs of downtime
Incident Report for Calibre
Postmortem

On December 3rd, Calibreapp.com suffered approximately 1 hour 30 minutes of downtime following difficulties during a routine data migration followed by a period of degraded performance.

During the data migration, tests recorded prior to December 3rd were temporarily unavailable to view. New tests were being conducted, but delayed in aggregation due to the ongoing data migration and also temporarily unavailable.

No data was lost.

Monday 3rd December, 4:25pm AEST

45 minutes into the data migration we noticed drastically degraded Postgres database performance, which brought Calibreapp.com down for almost an hour.

Monday 3rd December, 5:30pm AEST

Calibreapp.com was brought back up while still experiencing degraded performance due to the migration load.

Monday 3rd December, 9:30pm AEST

A routine vacuum and automatic daily database backups started running and operating on the same table that was being migrated, which caused further issues.

Tuesday 4th December, 8:00am AEST

By Tuesday the migration had progressed to process data back to September 2018, which meant that timeline metrics were 100% available, but detailed reports of those tests were still unavailable for view.

We continued to monitor the database.

Wednesday 5th December, 7:27pm AEST

Following numerous process efficiency fixes and replacing a database replica the remaining queue backlog was processed smoothly and the service came back to full availability.

Posted 3 months ago. Dec 10, 2018 - 21:00 AEDT

Resolved
This incident has been resolved.
Posted 3 months ago. Dec 03, 2018 - 04:15 AEDT