Downtime Report : After the dust has settled…

Late last week the main GNOME database server suffered a major crash which resulted in extended downtime for major services such as bugzilla, blogs, and anything else requiring a MySQL backend. There was some data loss, but less than 24hrs worth. This means a few blog posts were lost, and some stats data. A few bugzilla bug reports and comments were lost as well but overall, considering the nature of the hardware failure, the data loss was minimal.

We were able to bring the machine back up with help from the data center technician, briefly, but it failed yet again. It was becoming clear that we would be needing additional help.

The next step was to get the IBM on-site support technician. I think the best way to describe the IBM support staff is.. thorough. They have a long diagnostic process. Apparently this includes downloading onboard logs, updating and flashing firmware. After these diagnostics and updates the machine came back online, and quickly collapsed under a pile of RAID errors. As I’m sure you can guess, this was bad news.

The good news (if you can call it that) at this point was that it was clear the motherboard and controller were bad so the next step was to replace the motherboard. The bad news was that there was not a replacement on site. A replacement motherboard was overnighted to the data center, but by this point it was nearing the weekend and it was clear that the machine would not be in a reliable state by end of business. It was decided that restoring from backups onto a different machine would be the best plan.

Thanks goes to Owen for taking the lead on this effort. He was able to gather the backups and begin the restoration process. It took some time to transfer the backups, which are stored in Raleigh, and get them to the datacenter in Phoenix. This was done and a new database server was setup. Again, the only data that was lost was that between the previous backup (24hr cycle) and the hardware failure. Everything was imported into the new database server and services began to come back online.

There remained a bit of work to migrate services to point to the new server, but that was minimal, and handled quickly. At this point everything appears to be online and stable. Before long we will want to migrate back to the original database server, but I’m sure we’ll save this until after the GNOME 3 release.

Overall I think the process was handled well. The downtime was longer than any of us would have liked, but we did the best with what we have.

All of the admins pitched in and did the work that needed to be done. Thanks goes to everyone on the team for their help, but especially Owen for taking the lead on coordinating the on-site technician and restoring from backups. We expect to be stable and reliable for the upcoming release, and we appreciate everyones patience during the process.

Leave a Reply

Your email address will not be published. Required fields are marked *