The GNOME master LDAP server has had a failing drive in its RAID set for some time now. This last week we were able to replace the failing drive and re-sync. So far this has not caused any service interruption, but we want to verify the update by rebooting the server. We plan on doing this on 2010-11-15 16:00 UTC. We do not expect this to interrupt service for more than a few minutes, but would like to schedule a one-hour window to allow for any unexpected errors or problems.
Please make a note of this downtime in your schedule as this will disrupt access to most other servers, including the use of git.
Please let the GNOME Sysadmin Team know if you have any questions or concerns about this maintenance.
We just finished another maintenance window, which went great. All but two servers were rebooted, and all lights are green on our monitoring server. If I have somehow managed to miss something, please let me know and I’ll attend to it right away.
This goal of this maintenance window was to apply kernel and other updates to all servers, as well as ensure that all services are configured properly to start at boot time. There are a few remaining issues related to the latter, but fewer than the last time we did this. We’re making progress, and things are looking better!
Lessons learned from this downtime:
- progress (secondary DNS, l10n.gnome.org) still has service issues. I will attend to these this week.
- signal (monitoring server) could use some tuning in regard to check intervals. It listed many services as “flapping” when they shouldn’t.
I will look into addressing these this week.
As usual, if you have any questions or other feedback you know where to find me.
The GNOME Sysadmin team would like to propose a maintenance window for 2010-10-26 10:00am MDT – 11:00am MDT (UTC -6). This window will include a short downtime of all services in order to apply kernel updates and other errata. If this time window is a concern to anyone, please let us know as soon as possible.
A second reminder will go out an hour previous to this downtime.
I would like to propose a short downtime window for progress and socket for 2010-10-14 10:00am – 11:00am MDT. These machines manage the
The purpose of this downtime is to apply errata and kernel updates, and to continue to streamline our procedure and documentation.
If anyone has any concerns about this date/time, please let me know. A second email will be sent just prior to the start of this downtime.
This morning we had scheduled maintenance on all GNOME servers, which caused rolling outages of services. All servers should now have all the latest security errata applied, and all services should be available. In the interest of transparency, below you’ll find an outline of our maintenance and any issues we had:
- Reboot all servers to apply the latest kernel updates and ensure all other errata was applied cleanly.
Issues / Lessons Learned:
- We were reminded that our LDAP server requires manual intervention when rebooting. This needs hardware attention/replacement, but is no longer covered under any support contract. In the future Owen will need to manually bring the machine back up via console/KVM access.
- When rebooting servers, the LDAP server and NFS server should be last. These both host critical services related to the functionality of the other servers.
- The server that hosts the translations website (l10n.gnome.org) has problems with starting it’s services on boot. Manually starting services is required.
- The server hosting bugzilla and git was problematic coming back online. This requires more investigation, and is unknown whether it’ll be a consistent problem.
Our maintenance did extend beyond the originally announced schedule based on some of the above unexpected issues, but now we’re aware of them and can prepare for them in the future. We appreciate your patience while we brought everything back online.
Based on the above, here are some things we will do to improve and streamline future maintenance windows:
- Ensure Owen is available and ready to bring the LDAP server back online if/when rebooted.
- Communicate the downtime schedule on the gnome-infrastructure list as well as the devel-announce-list. We will aim for 48hrs notice as well as a reminder just before the outage begins.
- Before the next maintenance window we will address issue #3 above regarding manually starting of services required.
- Our standard operating procedure for rebooting servers will be updated to include a priority list and dependency list (reboot order).
Again, thank you for your patience during our maintenance. This all goes towards a better and more mature infrastructure.
As usual, if you have any concerns, questions, or praise to share with us feel free to drop by #sysadmin and let us know.
The GNOME Infrastructure Team is planning regular maintenance for Wed 2010-10-06 at 10:00am EST (UTC -4). This will include brief downtime for all major services while security errata are applied.
Please be sure to finish any work and log out of any servers before that time.
The expected maintenance window is 10:00am – 11:00am (1hr).
If you have any questions or concerns, please contact us in #sysadmin on irc.gnome.org.