Maintenance Downtime 2010-10-26 : Report

All –

We just finished another maintenance window, which went great. All but two servers were rebooted, and all lights are green on our monitoring server. If I have somehow managed to miss something, please let me know and I’ll attend to it right away.

This goal of this maintenance window was to apply kernel and other updates to all servers, as well as ensure that all services are configured properly to start at boot time. There are a few remaining issues related to the latter, but fewer than the last time we did this. We’re making progress, and things are looking better!

Lessons learned from this downtime:

  • progress (secondary DNS, l10n.gnome.org) still has service issues. I will attend to these this week.
  • signal (monitoring server) could use some tuning in regard to check intervals. It listed many services as “flapping” when they shouldn’t.

I will look into addressing these this week.

As usual, if you have any questions or other feedback you know where to find me.

Christer

Sysadmin Hackfest Proposal – SCALE

Calling all Volunteers –

It’s still a ways away, but SCALE is happening this February and I’d like to propose a Sysadmin Hackfest! Currently both Jeff and I plan to be there (we’ll also be doing the GNOME booth), and we’d love to see anyone else there. We’ve got some pending tickets that are perfect for a weekend hackfest, such as an openvpn setup, LDAP improvements, etc. The more people we can round up the more we can get done!

If you’re in the area, or plan on attending SCALE, please let us know if you’d be willing to contribute some time to a hackfest. Even if you’re not familiar with the above technologies, or have root on the servers, I guarantee we can find something for you to do. We can knock out a lot of bugzilla tickets in one weekend. It’ll be record-breaking! (Can you tell I’m excited about a hackfest!?)

Please let me know on or off-list if you’d be available to show up and help out. We’ll need a rough head count to present to the board to make it an “official” hackfest, so please don’t be shy.

I’d like to be able to present a tentative head count at the end of the week, so start thinking about your plans! For me, it’s the draw of sunny California in Feb (instead of three feet of snow in Salt Lake City!)
Thanks!

Christer

Improved Mailman List Moderation with Listadmin

The GNOME mail ecosystem is a very busy one, encompassing hundreds of mail aliases, and dozens of mailing lists. Tens of thousands of emails flow through our mail server each day. These numbers grow almost daily and keeping all of this maintainable is a challenge. Historically we have done a pretty good job keeping on top of things, but every now and then something gets away from us and we’re reminded that keeping things simple, and using the right tool for the job is the best way to go.

In the past I’ve managed mailing lists using a Perl-based tool called “listadmin”. listadmin allows you to moderate pending mailman queues from the command-line, which is simpler than navigating through the web interface for each list. I used listadmin for years to moderate Ubuntu lists, but oddly enough when I started working within the GNOME community listadmin didn’t work reliably. Fixing listadmin has been a priority for me for the past few months, and finally we’re there! Thanks to the contribution of a community member, Raymond Lu, we’ve found a fix for listadmin that works on the GNOME mailing lists.

This post is for all of the mailing list administrators and moderators out there.

Installation

In order for listadmin to work reliably on GNOME mailing lists, you’ll need to grab the latest version from Debian squeeze or apply a patch manually. (Details regarding the patch are outlined in the .diff.gz file in that link). It seems there have been some interface and internationalization changes, and this takes care of those. Once you’ve got this version installed / patched, see the configuration options below:

Configuration

The listadmin configuration is pretty straightforward. You define the URL, password and list name for each mailing list you moderate. You can also optionally configure the default action and log file, and then simply run listadmin and you’re prompted regarding the action to take on the pending message(s). Below is an example configuration for one of the GNOME mailing lists, .listadmin.ini:

adminurl http://mail.gnome.org/mailman/admindb/{list}
default discard
log ~/.listadmin.log

password s3cr3t!
cheese-list@gnome.org

password p@ssw0rd!
ekiga-list@gnome.org

These, of course, are not the real passwords, but shows an example of assigning different passwords for different lists. If you are a list moderator I would suggest you set up a configuration similar for all the lists that you moderate. Then, you can simply run listadmin every few days and easily keep on top of your lists. If all moderators were able to do as much, none of the lists would ever get away from us!

If you know any list moderators that could make use of listadmin, please forward this on to them. Whether it be for GNOME (particularly for GNOME!), or another project, it sure is a time saver!

If you’ve got any questions about setting up listadmin, need a reminder regarding your moderator credentials or if you’d simply like to help out with list moderation, feel free to contact myself or any of the other core Sysadmins. We’re more than happy to help!

Infrastructure Downtime: 2010-10-26 10:00am MDT – 11:00am MDT

The GNOME Sysadmin team would like to propose a maintenance window for 2010-10-26 10:00am MDT – 11:00am MDT (UTC -6). This window will include a short downtime of all services in order to apply kernel updates and other errata. If this time window is a concern to anyone, please let us know as soon as possible.

A second reminder will go out an hour previous to this downtime.

Wanted: Perl Guru

Here at the GNOME Foundation we maintain a large number of mailman-powered mailing lists, which facilitates discussion on development and related projects. The maintenance and moderation (read: spam filtering) of these lists can become a burden on the list maintainer(s), and many of them fall behind.

I have maintained a number of Free Software mailing lists over the years, and the best solution that I’ve found to keep on top of this is a tool called ‘listadmin’. listadmin is a Perl-based command line utility that communicates with the mailman web-interface and handles the moderation of mailing lists. I’ve found this tool to be a huge timesaver with the half-dozen lists that I normally maintain. To be honest, as part of my morning routine I run ‘listadmin’ and I complete the moderation of over a a dozen lists in under a minute. It is really quite nice.

For some reason, listadmin is problematic with our GNOME mailing lists (the lists I mention above are Ubuntu-related lists, which is a different mailman version than we run here at the GNOME Foundation). It’ll work with some lists and not with others, and the reason is unclear. I’ve tried troubleshooting it a bit, but my Perl doesn’t go too far beyond local system administration scripting.

If anyone out there on the internets considers themselves a Perl Guru and would like to donate some time toward this cause, please contact me. It would be a great benefit to the foundation, and a real time saver for list moderators if we could figure out the issue with listadmin (perhaps it just requires a small patch). The ability to maintain one list or a dozen lists becomes simple with this tool, and would really allow us to catch up and get a handle on some of these pending mail queues.

If you are interested, please find me in on irc.gnome.org, or email me at cedwards AT I HATE THE SPAM gnome org.

Infrastructure Downtime: 2010-10-14 10:00am MDT – 11:00am MDT

I would like to propose a short downtime window for progress and socket for 2010-10-14 10:00am – 11:00am MDT. These machines manage the
following services:

progress.gnome.org:

socket.gnome.org:

The purpose of this downtime is to apply errata and kernel updates, and to continue to streamline our procedure and documentation.

If anyone has any concerns about this date/time, please let me know. A second email will be sent just prior to the start of this downtime.

Thank you,

Christer

Maintenance Downtime 2010-10-06: Report

This morning we had scheduled maintenance on all GNOME servers, which caused rolling outages of services. All servers should now have all the latest security errata applied, and all services should be available. In the interest of transparency, below you’ll find an outline of our maintenance and any issues we had:

Task:

  1. Reboot all servers to apply the latest kernel updates and ensure all other errata was applied cleanly.

Issues / Lessons Learned:

  1. We were reminded that our LDAP server requires manual intervention when rebooting. This needs hardware attention/replacement, but is no longer covered under any support contract. In the future Owen will need to manually bring the machine back up via console/KVM access.
  2. When rebooting servers, the LDAP server and NFS server should be last. These both host critical services related to the functionality of the other servers.
  3. The server that hosts the translations website (l10n.gnome.org)  has problems with starting it’s services on boot. Manually starting services is required.
  4. The server hosting bugzilla and git was problematic coming back online. This requires more investigation, and is unknown whether it’ll be a consistent problem.

Our maintenance did extend beyond the originally announced schedule based on some of the above unexpected issues, but now we’re aware of them and can prepare for them in the future. We appreciate your patience while we brought everything back online.

Planned Improvements

Based on the above, here are some things we will do to improve and streamline future maintenance windows:

  1. Ensure Owen is available and ready to bring the LDAP server back online if/when rebooted.
  2. Communicate the downtime schedule on the gnome-infrastructure list as well as the devel-announce-list. We will aim for 48hrs notice as well as a reminder just before the outage begins.
  3. Before the next maintenance window we will address issue #3 above regarding manually starting of services required.
  4. Our standard operating procedure for rebooting servers will be updated to include a priority list and dependency list (reboot order).

Again, thank you for your patience during our maintenance. This all goes towards a better and more mature infrastructure.

As usual, if you have any concerns, questions, or praise to share with us feel free to drop by and let us know.

Christer

Infrastructure Downtime: 2010-10-06 10:00am EST – 11:00am EST

The GNOME Infrastructure Team is planning regular maintenance for Wed 2010-10-06 at 10:00am EST (UTC -4). This will include brief downtime for all major services while security errata are applied.

Please be sure to finish any work and log out of any servers before that time.

The expected maintenance window is 10:00am – 11:00am (1hr).

If you have any questions or concerns, please contact us in on irc.gnome.org.

Hello World!

I don’t even know where to start! There have been a lot of changes and improvements going on within the GNOME Sysadmin Team lately, and I thought I should share a few of the things we’ve been working on, as well as some of the changes we have planned.

First of all, as Olav mentioned in our inaugural post, I’ve been hired as the GNOME part-time Sysadmin. This role includes a number of responsibilities, many of which will be transparent (and or boring) to most of you, but there are a few that will affect the community and that feels like a good place to begin.

First of all, we’re going to be publishing our progress and changes regularly on this blog. This will include scheduled maintenance on services (reboots, downtime, etc). Our plan is to schedule maintenance windows, and give the community adequate notice before we take down any services. We haven’t forgotten that our job is to make sure you can do your job, and improved communication is a key part of that.

Second, we’ve spent the past two weeks focusing on trying to clean up our rough edges and establish our baseline for the future. Part of this is defining where our priorities lie, and which projects are the most important. While we have an existing list of pending tickets in bugzilla, we’d also like to hear from the rest of you on what you’d like to see done. I can’t promise that we’ll be able to please all of you all the time, but we also can’t address your issues if we don’t know what they are! In this regard, if you have any issues we need to know about, please stop by bugzilla and file a ticket.

Beyond that, please feel free to contact me directly with any concerns or ideas you might have. I’m happy to discuss them with you.

Christer