Maintenance Downtime 2010-10-06: Report

This morning we had scheduled maintenance on all GNOME servers, which caused rolling outages of services. All servers should now have all the latest security errata applied, and all services should be available. In the interest of transparency, below you’ll find an outline of our maintenance and any issues we had:

Task:

  1. Reboot all servers to apply the latest kernel updates and ensure all other errata was applied cleanly.

Issues / Lessons Learned:

  1. We were reminded that our LDAP server requires manual intervention when rebooting. This needs hardware attention/replacement, but is no longer covered under any support contract. In the future Owen will need to manually bring the machine back up via console/KVM access.
  2. When rebooting servers, the LDAP server and NFS server should be last. These both host critical services related to the functionality of the other servers.
  3. The server that hosts the translations website (l10n.gnome.org)  has problems with starting it’s services on boot. Manually starting services is required.
  4. The server hosting bugzilla and git was problematic coming back online. This requires more investigation, and is unknown whether it’ll be a consistent problem.

Our maintenance did extend beyond the originally announced schedule based on some of the above unexpected issues, but now we’re aware of them and can prepare for them in the future. We appreciate your patience while we brought everything back online.

Planned Improvements

Based on the above, here are some things we will do to improve and streamline future maintenance windows:

  1. Ensure Owen is available and ready to bring the LDAP server back online if/when rebooted.
  2. Communicate the downtime schedule on the gnome-infrastructure list as well as the devel-announce-list. We will aim for 48hrs notice as well as a reminder just before the outage begins.
  3. Before the next maintenance window we will address issue #3 above regarding manually starting of services required.
  4. Our standard operating procedure for rebooting servers will be updated to include a priority list and dependency list (reboot order).

Again, thank you for your patience during our maintenance. This all goes towards a better and more mature infrastructure.

As usual, if you have any concerns, questions, or praise to share with us feel free to drop by #sysadmin and let us know.

Christer