This week has been yet another adventure in my little part of the world. I’ve primarily only been able to do regular maintenance on the GNOME Infrastructure, not having been able to make time for any of the larger projects or tasks on my list. I’ve got a few that I’m going to try and find time to tackle today, but currently my schedule is not entirely up to me.
For those that don’t know (which is very likely all but one or two of you), my wife and I are expecting our third child any day now. The actual due date isn’t until December, but she’s been put on strict bed-rest for the past two weeks to try and prolong the pregnancy as long as possible. We doubt very much we’ll make it until December, and we’ve already been to labor and delivery at the hospital twice now. Luckily they’ve been able to calm things down and send us home. Any any case, we’ve had a few scares that the baby will come too early, so I’ve needed to take on a lot more responsibilities at home.
As I mentioned, primarily I’ve only been able to tackle everyday maintenance tickets. Account creations, mailing list management, and other low-hanging fruit. To my count I have closed about a half-dozen RT tickets (accounts requests and updates). I’ve also worked recently with the marketing team to put something in place for a yet-to-be-announced GNOME 3.x related project. Stay tuned for those details this next week.
If I’m able to get some backup (read: a babysitter) for a few hours I’ve got a few projects I think I can tackle. I hope to be able to get them finished today or tomorrow. We’ll see.
Could be that the next time I report I’ll have a new addition. Here’s hoping that isn’t the case.. yet.
This week has been an interesting one for me with a lot going on personally. This has, unfortunately, kept me from some of my duties as a Sysadmin, but not completely. Below is a report of what I’ve been working on over the past week.
This week we saw the release (finally!) of Red Hat Enterprise Linux 6. We’ve started discussing a migration plan for the GNOME Red Hat servers, but this’ll still take some time. We’ll eventually need some help in testing GNOME services as they are migrated, so stay tuned here.
In addition I’ve spent some time this last week working with the moderators team. This is the (small) group of contributors that handles the mailman mailing list queue moderation. We’ve made some nice improvements to our procedures, but there are still only a few contributors on the team. If you’re interested in contributing to GNOME in a non-technical way, this might be a good place for you. Please let us know. Send an email to moderators@ or see the Moderators Wiki Page.
This also applies to any current list owner that would like to delegate their list management to the team.
Beyond that I’ve only been able to manage time for normal daily maintenance. RT tasks (accounts), minor user updates (ssh keys, etc), and server monitoring.
As I mentioned at the beginning of this post, my schedule has been a bit random so the next few weeks may a bit unpredictable. If you have any questions or concerns I should still be available via normal channels during the week.
We just finished another maintenance window, which went great. All but two servers were rebooted, and all lights are green on our monitoring server. If I have somehow managed to miss something, please let me know and I’ll attend to it right away.
This goal of this maintenance window was to apply kernel and other updates to all servers, as well as ensure that all services are configured properly to start at boot time. There are a few remaining issues related to the latter, but fewer than the last time we did this. We’re making progress, and things are looking better!
Lessons learned from this downtime:
- progress (secondary DNS, l10n.gnome.org) still has service issues. I will attend to these this week.
- signal (monitoring server) could use some tuning in regard to check intervals. It listed many services as “flapping” when they shouldn’t.
I will look into addressing these this week.
As usual, if you have any questions or other feedback you know where to find me.
This morning we had scheduled maintenance on all GNOME servers, which caused rolling outages of services. All servers should now have all the latest security errata applied, and all services should be available. In the interest of transparency, below you’ll find an outline of our maintenance and any issues we had:
- Reboot all servers to apply the latest kernel updates and ensure all other errata was applied cleanly.
Issues / Lessons Learned:
- We were reminded that our LDAP server requires manual intervention when rebooting. This needs hardware attention/replacement, but is no longer covered under any support contract. In the future Owen will need to manually bring the machine back up via console/KVM access.
- When rebooting servers, the LDAP server and NFS server should be last. These both host critical services related to the functionality of the other servers.
- The server that hosts the translations website (l10n.gnome.org) has problems with starting it’s services on boot. Manually starting services is required.
- The server hosting bugzilla and git was problematic coming back online. This requires more investigation, and is unknown whether it’ll be a consistent problem.
Our maintenance did extend beyond the originally announced schedule based on some of the above unexpected issues, but now we’re aware of them and can prepare for them in the future. We appreciate your patience while we brought everything back online.
Based on the above, here are some things we will do to improve and streamline future maintenance windows:
- Ensure Owen is available and ready to bring the LDAP server back online if/when rebooted.
- Communicate the downtime schedule on the gnome-infrastructure list as well as the devel-announce-list. We will aim for 48hrs notice as well as a reminder just before the outage begins.
- Before the next maintenance window we will address issue #3 above regarding manually starting of services required.
- Our standard operating procedure for rebooting servers will be updated to include a priority list and dependency list (reboot order).
Again, thank you for your patience during our maintenance. This all goes towards a better and more mature infrastructure.
As usual, if you have any concerns, questions, or praise to share with us feel free to drop by #sysadmin and let us know.