I thought I’d give a quick progress report after taking a long three day weekend. While my break was nice, I came back to quite a few tickets and emails to take care of. I think I’ve managed to get ahead of everything again. We’ll see how long that holds up this week!
This morning was a lot of Bugzilla/RT queue management. Account creations, mailing list creation, migrating a project from Google Code to GNOME, etc. Nothing terribly exciting. I’ll also need to remember not to let the moderation queue go for that long again. Usually I attend to it once daily. After nearly four days it took quite some time to moderate everything in all the queues!
One nice thing that I did manage to finish today was the addition of more monitors in Nagios. We had to wait for a firewall exception at the Red Hat data center, but I’m now able to remotely monitor much, much more on a large number of servers. Today, for starters, I added a monitor for load averages. I was also able to fix the monitors for a few mysql server checks. I’ll continue to add more until I feel like all the critical bits are covered.
What else is on the list for this week?
- Submit Sysadmin hackfest proposal to take place at Scale 9x in Feb.
- document all allotted IPs and corresponding hostnames.
- attend to nearly full filesystem on window
- re-address building the RHEL 6 vm for the wiki migration
Let’s hope we can get all of this done this week. Fingers crossed.
The GNOME master LDAP server has had a failing drive in its RAID set for some time now. This last week we were able to replace the failing drive and re-sync. So far this has not caused any service interruption, but we want to verify the update by rebooting the server. We plan on doing this on 2010-11-15 16:00 UTC. We do not expect this to interrupt service for more than a few minutes, but would like to schedule a one-hour window to allow for any unexpected errors or problems.
Please make a note of this downtime in your schedule as this will disrupt access to most other servers, including the use of git.
Please let the GNOME Sysadmin Team know if you have any questions or concerns about this maintenance.
We just finished another maintenance window, which went great. All but two servers were rebooted, and all lights are green on our monitoring server. If I have somehow managed to miss something, please let me know and I’ll attend to it right away.
This goal of this maintenance window was to apply kernel and other updates to all servers, as well as ensure that all services are configured properly to start at boot time. There are a few remaining issues related to the latter, but fewer than the last time we did this. We’re making progress, and things are looking better!
Lessons learned from this downtime:
- progress (secondary DNS, l10n.gnome.org) still has service issues. I will attend to these this week.
- signal (monitoring server) could use some tuning in regard to check intervals. It listed many services as “flapping” when they shouldn’t.
I will look into addressing these this week.
As usual, if you have any questions or other feedback you know where to find me.
The GNOME Sysadmin team would like to propose a maintenance window for 2010-10-26 10:00am MDT – 11:00am MDT (UTC -6). This window will include a short downtime of all services in order to apply kernel updates and other errata. If this time window is a concern to anyone, please let us know as soon as possible.
A second reminder will go out an hour previous to this downtime.
The GNOME Infrastructure Team is planning regular maintenance for Wed 2010-10-06 at 10:00am EST (UTC -4). This will include brief downtime for all major services while security errata are applied.
Please be sure to finish any work and log out of any servers before that time.
The expected maintenance window is 10:00am – 11:00am (1hr).
If you have any questions or concerns, please contact us in #sysadmin on irc.gnome.org.