March | 2011 | GNOME Sysadmin Team

In preparation for the big release next week the GNOME Sysadmin team has been kickin’ ass and taking names! Today (finally, my fault) we have finished the migration and upgrade of the blogs.gnome.org WordPress installation to 3.1. The site has also been moved to a new RHEL 6 virtual machine. Some of you may notice some weirdness as DNS propagates, but it won’t be long before everyone will be using the new site.

When you login I’m sure you’ll be able to tell that the site has been upgraded. The admin UI has been polished quite a bit and you’ll notice a bar at the top of your browser inside and outside of the admin panel if you are logged in, providing added functionality. It reminds me a bit of what you might have seen if you’ve used Blogger.

For those interested in the changes, you can read the WordPress 3.1 release notes here.

In any case, this is one more ticket we’re able to close and I’m happy to finally have this done. It took longer than I had anticipated due to repeated testing (I didn’t want to clobber anything!). It is better to be safe than sorry after all.

As usual, if you run into anything wonky please drop us a line at gnome-sysadmin@ or drop into IRC to #sysadmin.

Late last week the main GNOME database server suffered a major crash which resulted in extended downtime for major services such as bugzilla, blogs, and anything else requiring a MySQL backend. There was some data loss, but less than 24hrs worth. This means a few blog posts were lost, and some stats data. A few bugzilla bug reports and comments were lost as well but overall, considering the nature of the hardware failure, the data loss was minimal.

We were able to bring the machine back up with help from the data center technician, briefly, but it failed yet again. It was becoming clear that we would be needing additional help.

The next step was to get the IBM on-site support technician. I think the best way to describe the IBM support staff is.. thorough. They have a long diagnostic process. Apparently this includes downloading onboard logs, updating and flashing firmware. After these diagnostics and updates the machine came back online, and quickly collapsed under a pile of RAID errors. As I’m sure you can guess, this was bad news.

The good news (if you can call it that) at this point was that it was clear the motherboard and controller were bad so the next step was to replace the motherboard. The bad news was that there was not a replacement on site. A replacement motherboard was overnighted to the data center, but by this point it was nearing the weekend and it was clear that the machine would not be in a reliable state by end of business. It was decided that restoring from backups onto a different machine would be the best plan.

Thanks goes to Owen for taking the lead on this effort. He was able to gather the backups and begin the restoration process. It took some time to transfer the backups, which are stored in Raleigh, and get them to the datacenter in Phoenix. This was done and a new database server was setup. Again, the only data that was lost was that between the previous backup (24hr cycle) and the hardware failure. Everything was imported into the new database server and services began to come back online.

There remained a bit of work to migrate services to point to the new server, but that was minimal, and handled quickly. At this point everything appears to be online and stable. Before long we will want to migrate back to the original database server, but I’m sure we’ll save this until after the GNOME 3 release.

Overall I think the process was handled well. The downtime was longer than any of us would have liked, but we did the best with what we have.

All of the admins pitched in and did the work that needed to be done. Thanks goes to everyone on the team for their help, but especially Owen for taking the lead on coordinating the on-site technician and restoring from backups. We expect to be stable and reliable for the upcoming release, and we appreciate everyones patience during the process.

GNOME Sysadmin Team

Monthly Archives: March 2011

Upgraded to 3.1 and Migrated to RHEL6

Downtime Report : After the dust has settled…

We run the servers that run GNOME.