Category Archives: maintenance

Downtime Report : After the dust has settled…

Late last week the main GNOME database server suffered a major crash which resulted in extended downtime for major services such as bugzilla, blogs, and anything else requiring a MySQL backend. There was some data loss, but less than 24hrs worth. This means a few blog posts were lost, and some stats data. A few bugzilla bug reports and comments were lost as well but overall, considering the nature of the hardware failure, the data loss was minimal.

We were able to bring the machine back up with help from the data center technician, briefly, but it failed yet again. It was becoming clear that we would be needing additional help.

The next step was to get the IBM on-site support technician. I think the best way to describe the IBM support staff is.. thorough. They have a long diagnostic process. Apparently this includes downloading onboard logs, updating and flashing firmware. After these diagnostics and updates the machine came back online, and quickly collapsed under a pile of RAID errors. As I’m sure you can guess, this was bad news.

The good news (if you can call it that) at this point was that it was clear the motherboard and controller were bad so the next step was to replace the motherboard. The bad news was that there was not a replacement on site. A replacement motherboard was overnighted to the data center, but by this point it was nearing the weekend and it was clear that the machine would not be in a reliable state by end of business. It was decided that restoring from backups onto a different machine would be the best plan.

Thanks goes to Owen for taking the lead on this effort. He was able to gather the backups and begin the restoration process. It took some time to transfer the backups, which are stored in Raleigh, and get them to the datacenter in Phoenix. This was done and a new database server was setup. Again, the only data that was lost was that between the previous backup (24hr cycle) and the hardware failure. Everything was imported into the new database server and services began to come back online.

There remained a bit of work to migrate services to point to the new server, but that was minimal, and handled quickly. At this point everything appears to be online and stable. Before long we will want to migrate back to the original database server, but I’m sure we’ll save this until after the GNOME 3 release.

Overall I think the process was handled well. The downtime was longer than any of us would have liked, but we did the best with what we have.

All of the admins pitched in and did the work that needed to be done. Thanks goes to everyone on the team for their help, but especially Owen for taking the lead on coordinating the on-site technician and restoring from backups. We expect to be stable and reliable for the upcoming release, and we appreciate everyones patience during the process.

Maintenance Schedule : 2010-11-15 16:00 UTC

The GNOME master LDAP server has had a failing drive in its RAID set for some time now. This last week we were able to replace the failing drive and re-sync. So far this has not caused any service interruption, but we want to verify the update by rebooting the server. We plan on doing this on 2010-11-15 16:00 UTC. We do not expect this to interrupt service for more than a few minutes, but would like to schedule a one-hour window to allow for any unexpected errors or problems.

Please make a note of this downtime in your schedule as this will disrupt access to most other servers, including the use of git.

Please let the GNOME Sysadmin Team know if you have any questions or concerns about this maintenance.

Maintenance Downtime 2010-10-26 : Report

All –

We just finished another maintenance window, which went great. All but two servers were rebooted, and all lights are green on our monitoring server. If I have somehow managed to miss something, please let me know and I’ll attend to it right away.

This goal of this maintenance window was to apply kernel and other updates to all servers, as well as ensure that all services are configured properly to start at boot time. There are a few remaining issues related to the latter, but fewer than the last time we did this. We’re making progress, and things are looking better!

Lessons learned from this downtime:

  • progress (secondary DNS, l10n.gnome.org) still has service issues. I will attend to these this week.
  • signal (monitoring server) could use some tuning in regard to check intervals. It listed many services as “flapping” when they shouldn’t.

I will look into addressing these this week.

As usual, if you have any questions or other feedback you know where to find me.

Christer

Infrastructure Downtime: 2010-10-26 10:00am MDT – 11:00am MDT

The GNOME Sysadmin team would like to propose a maintenance window for 2010-10-26 10:00am MDT – 11:00am MDT (UTC -6). This window will include a short downtime of all services in order to apply kernel updates and other errata. If this time window is a concern to anyone, please let us know as soon as possible.

A second reminder will go out an hour previous to this downtime.

Infrastructure Downtime: 2010-10-14 10:00am MDT – 11:00am MDT

I would like to propose a short downtime window for progress and socket for 2010-10-14 10:00am – 11:00am MDT. These machines manage the
following services:

progress.gnome.org:

socket.gnome.org:

The purpose of this downtime is to apply errata and kernel updates, and to continue to streamline our procedure and documentation.

If anyone has any concerns about this date/time, please let me know. A second email will be sent just prior to the start of this downtime.

Thank you,

Christer

Maintenance Downtime 2010-10-06: Report

This morning we had scheduled maintenance on all GNOME servers, which caused rolling outages of services. All servers should now have all the latest security errata applied, and all services should be available. In the interest of transparency, below you’ll find an outline of our maintenance and any issues we had:

Task:

  1. Reboot all servers to apply the latest kernel updates and ensure all other errata was applied cleanly.

Issues / Lessons Learned:

  1. We were reminded that our LDAP server requires manual intervention when rebooting. This needs hardware attention/replacement, but is no longer covered under any support contract. In the future Owen will need to manually bring the machine back up via console/KVM access.
  2. When rebooting servers, the LDAP server and NFS server should be last. These both host critical services related to the functionality of the other servers.
  3. The server that hosts the translations website (l10n.gnome.org)  has problems with starting it’s services on boot. Manually starting services is required.
  4. The server hosting bugzilla and git was problematic coming back online. This requires more investigation, and is unknown whether it’ll be a consistent problem.

Our maintenance did extend beyond the originally announced schedule based on some of the above unexpected issues, but now we’re aware of them and can prepare for them in the future. We appreciate your patience while we brought everything back online.

Planned Improvements

Based on the above, here are some things we will do to improve and streamline future maintenance windows:

  1. Ensure Owen is available and ready to bring the LDAP server back online if/when rebooted.
  2. Communicate the downtime schedule on the gnome-infrastructure list as well as the devel-announce-list. We will aim for 48hrs notice as well as a reminder just before the outage begins.
  3. Before the next maintenance window we will address issue #3 above regarding manually starting of services required.
  4. Our standard operating procedure for rebooting servers will be updated to include a priority list and dependency list (reboot order).

Again, thank you for your patience during our maintenance. This all goes towards a better and more mature infrastructure.

As usual, if you have any concerns, questions, or praise to share with us feel free to drop by and let us know.

Christer

Infrastructure Downtime: 2010-10-06 10:00am EST – 11:00am EST

The GNOME Infrastructure Team is planning regular maintenance for Wed 2010-10-06 at 10:00am EST (UTC -4). This will include brief downtime for all major services while security errata are applied.

Please be sure to finish any work and log out of any servers before that time.

The expected maintenance window is 10:00am – 11:00am (1hr).

If you have any questions or concerns, please contact us in on irc.gnome.org.