New GNOME sysadmin: Andrea Veri

We’ve added a new member to the GNOME sysadmin team: Andrea Veri.

Andrea has been handling the accounts queue for a very, very long time. Furthermore, he’s involved with the GNOME membership committee (they handle the applications for GNOME foundation membership + elections). And thus now a GNOME sysadmin.

Aside from GNOME, he also does stuff within the Fedora sysadmin/infrastructure team, does some Fedora packaging and is a Debian Developer.

Personally, looking forward to him still handling every single account request. This next to cleaning up our infrastructure and documenting it 😛

New servers

Small update regarding sysadmin things:

  • Two new machines have been added. This to replace the very old hardware which does not have a support contract anymore (button, menubar, window, container). Stephen Smoogen assisted in getting them racked up and networked
  • Our RHEL entitlements expired. Bastien Nocera assisted in extending them. We now also have a procedure to update them.
  • Our RHEL5 machines now run RHEL5.7. This was severely needed as the sssd version in 5.6 was really buggy.
  • Our mail and DNS server often locks up. Unsure of the cause, seems to happen after heavy spam connections (from loads of IP addresses). The machine is old, but problem does not seem hardware related. Hopefully fixed by RHEL5.7. That said, we’ll anyway to migrate all services on this machine (due to lack of support contract).
  • We still lack 3 RHEL entitlements (ETA: next week?). Need those before we can continue moving services off the obsolete hardware.

A little uneventful

I think it’s been a little while since my last post, so I thought I’d toss something out there to our loyal readers (all three of you!). What have we been working on this week? Let’s see…

I pushed hard (maybe too hard) to get the new servers out of my house. The Foundation purchased two new servers and we’ve been trying to coordinate their installation for quite some time now. It finally looks like we have the correct address, contacts and coordination to get them put in. I should be able to, fingers crossed, get them out this weekend.

Let’s see, previous to that we had the big GNOME 3.0 release which overall went well. It put some pressure on some places, but that turned to be a good thing because we found where we could improve. Each time we’re pushed we learn one more thing about how we can streamline and improve our setup. It’s a good challenge, if anything.

I’m happy to say, after the initial tweaking, we stood up to slashdot which you might imagine is not a simple feat for many sites.

Ohh, a few more somewhat sysadmin related announcements have been in regards to the snowy project. If you’re not familiar with Snowy, think “Ubuntu One” but open source and specific to Tomboy. You can find out more here and here.

For now that’s all I’ve got. Signing off.

Migration aftershocks

As you may have read, we’ve finally made the move to WP 3.1 and migrated from the WPMU installation to the integrated multi-site support in the mainline WordPress latest release. This happened yesterday afternoon, and other than some DNS propagation we didn’t anticipate any problems.

Well, it turns out that some of the caching plugins and configuration didn’t properly migrate / upgrade / function after the move so the server went down under heavy load over night. I spent some time this morning tweaking and monitoring, and tweaking again until I felt like the right caching solution was in place. It should now be applied to all blogs within the multi-site network, and should improve performance significantly.

During this monitoring and tweaking I was closely watching the error_log output for anything that might also need attention. There was one other thing that I found that needed attention. Outdated and unmaintained WordPress themes that were causing PHP Parse errors.

I spent some time trying to find upgraded versions for the themes, and updated where I could. There were three themes that I was unable to find updated versions for and so it was decided to clobber them entirely.

If you’ve found that your blog has suddenly reverted to the default theme, it is likely that you were using one of the outdated themes that had to be removed. My apologies. Please feel free to select a different theme or suggest a new theme for installation.

As usual, please accept our apologies for these service disruptions.

Upgraded to 3.1 and Migrated to RHEL6

In preparation for the big release next week the GNOME Sysadmin team has been kickin’ ass and taking names! Today (finally, my fault) we have finished the migration and upgrade of the blogs.gnome.org WordPress installation to 3.1. The site has also been moved to a new RHEL 6 virtual machine. Some of you may notice some weirdness as DNS propagates, but it won’t be long before everyone will be using the new site.

When you login I’m sure you’ll be able to tell that the site has been upgraded. The admin UI has been polished quite a bit and you’ll notice a bar at the top of your browser inside and outside of the admin panel if you are logged in, providing added functionality. It reminds me a bit of what you might have seen if you’ve used Blogger.

For those interested in the changes, you can read the WordPress 3.1 release notes here.

In any case, this is one more ticket we’re able to close and I’m happy to finally have this done. It took longer than I had anticipated due to repeated testing (I didn’t want to clobber anything!). It is better to be safe than sorry after all.

As usual, if you run into anything wonky please drop us a line at gnome-sysadmin@ or drop into IRC to .

 

Downtime Report : After the dust has settled…

Late last week the main GNOME database server suffered a major crash which resulted in extended downtime for major services such as bugzilla, blogs, and anything else requiring a MySQL backend. There was some data loss, but less than 24hrs worth. This means a few blog posts were lost, and some stats data. A few bugzilla bug reports and comments were lost as well but overall, considering the nature of the hardware failure, the data loss was minimal.

We were able to bring the machine back up with help from the data center technician, briefly, but it failed yet again. It was becoming clear that we would be needing additional help.

The next step was to get the IBM on-site support technician. I think the best way to describe the IBM support staff is.. thorough. They have a long diagnostic process. Apparently this includes downloading onboard logs, updating and flashing firmware. After these diagnostics and updates the machine came back online, and quickly collapsed under a pile of RAID errors. As I’m sure you can guess, this was bad news.

The good news (if you can call it that) at this point was that it was clear the motherboard and controller were bad so the next step was to replace the motherboard. The bad news was that there was not a replacement on site. A replacement motherboard was overnighted to the data center, but by this point it was nearing the weekend and it was clear that the machine would not be in a reliable state by end of business. It was decided that restoring from backups onto a different machine would be the best plan.

Thanks goes to Owen for taking the lead on this effort. He was able to gather the backups and begin the restoration process. It took some time to transfer the backups, which are stored in Raleigh, and get them to the datacenter in Phoenix. This was done and a new database server was setup. Again, the only data that was lost was that between the previous backup (24hr cycle) and the hardware failure. Everything was imported into the new database server and services began to come back online.

There remained a bit of work to migrate services to point to the new server, but that was minimal, and handled quickly. At this point everything appears to be online and stable. Before long we will want to migrate back to the original database server, but I’m sure we’ll save this until after the GNOME 3 release.

Overall I think the process was handled well. The downtime was longer than any of us would have liked, but we did the best with what we have.

All of the admins pitched in and did the work that needed to be done. Thanks goes to everyone on the team for their help, but especially Owen for taking the lead on coordinating the on-site technician and restoring from backups. We expect to be stable and reliable for the upcoming release, and we appreciate everyones patience during the process.

Another busy week

I realized today that it’s been a week or two since my last public update regarding the GNOME Sysadmin team. We’ve been working away on things and are making good progress. Let me just toss out a few things that we’ve done in the past week:

  • Worked with the marketing team to publish gnome3.org. If you haven’t seen the announcements or the site, take a minute and check it out.
  • Added some spam filtering to the foundation blog and a few others upon request. Hopefully that has already shown improvements.
  • Submitted the Sysadmin hackfest proposal to take place at SCALE 2011. See http://live.gnome.org/Hackfests/Sysadmin-SCALE9x
  • Did some follow-up regarding hardware donations.
  • Worked with a sponsor which has graciously agreed to donate domain transfers and registration for all the foundation domains.
  • Continued work on Nagios including a new Python web interface, automatic per-host monitor generation and more efficient testing.
  • Started the RHEL 6 build out and testing with two VMs. We’ll be migrating from nss_ldap to sssd for authentication and caching, which is part of this update.
  • Spent time on researching and planning a migration of blogs.gnome.org to a subversion-managed WP 3.x installation. This is ongoing.

All this has just been the highlights over the last seven days. There has of course been all the regular maintenance that goes along with it.

I think things are really going well and we’re keeping our queues clean as best we can. Next month three of the Sysadmins will be together at SCALE and we’ll have a lot more to report regarding our planned projects then.

As usual, if you need anything from us please email me (cedwards AT gnome DOT org) or drop into on irc.gnome.org.

Progress Report – Jan 10, 2011

I thought I’d give a quick progress report after taking a long three day weekend. While my break was nice, I came back to quite a few tickets and emails to take care of. I think I’ve managed to get ahead of everything again. We’ll see how long that holds up this week!

This morning was a lot of Bugzilla/RT queue management. Account creations, mailing list creation, migrating a project from Google Code to GNOME, etc. Nothing terribly exciting. I’ll also need to remember not to let the moderation queue go for that long again. Usually I attend to it once daily. After nearly four days it took quite some time to moderate everything in all the queues!

One nice thing that I did manage to finish today was the addition of more monitors in Nagios. We had to wait for a firewall exception at the Red Hat data center, but I’m now able to remotely monitor much, much more on a large number of servers. Today, for starters, I added a monitor for load averages. I was also able to fix the monitors for a few mysql server checks. I’ll continue to add more until I feel like all the critical bits are covered.

What else is on the list for this week?

  • Submit Sysadmin hackfest proposal to take place at Scale 9x in Feb.
  • document all allotted IPs and corresponding hostnames.
  • attend to nearly full filesystem on window
  • re-address building the RHEL 6 vm for the wiki migration

Let’s hope we can get all of this done this week. Fingers crossed.

Monitoring and Security Improvements

Here I am sneaking in some last minute updates before the end of the year. I know I just posted an update a few days ago, but I’ve implemented some additional improvements since then that I wanted to share.

First of all, and this is pretty cool I think, we’ve implemented HTTP Strict Transport Security for all GNOME domains that require SSL. For the end-user this primarily means bugzilla and tomboy-online. If you’re not familiar with HTTP Strict Transport Security, check out the spec here. Essentially what it does is has the web server send a specific header ‘Strict-Transport-Security’ to the client browser, including a TTL. This header value is stored and used by supported browsers (Chrome and Firefox 4 currently) which force the browser to connect directly on https on the next connection (more specifically, during the TTL time value). This means that once you’ve visited bugzilla and connected to it via SSL, any future connections inside that TTL time value will go directly to the SSL site and never touch the non-SSL. (Normally connections not explicitly connecting to https are redirected from http to https.)

I think this is a pretty cool security addition, and closes another sysadmin ticket!

Second, my main project for the day has been our monitoring system. It got some attention today by way of adding a few missed hosts as well as adding some remote checks for services. We’re now tracking the health of most of our mysql servers and watching a few new values such as load average, etc. I’ve also added checks to monitor the status of our SSL certificates. Currently these checks will start to alert us when the certificate is within two weeks of expiration.

Ohh, I also added a certificate to mail.gnome.org which previously didn’t have one. This now allows (but doesn’t yet require) secure connection to the mailing lists, archives and administration pages. I felt this was important given that credentials are passed between client and server for mailman list administration and queue management, which is another task myself and the Moderators team handle daily.

I think we’ve made some good progress this week and hopefully it sets a good precedent for the new year.

Christer

December Update 2010

Just before the end of the year I thought I’d give one last status update on what I and the team have been working on. As you may imagine, things have been slow due to the holiday season, but we’re still here and still keeping the gears moving.

The most recent success has been improving and fixing the monitoring solution, Nagios. Just last night I finally properly implemented SSL for Nagios administration logins and setup redirects from the old URLs. Currently Nagios is admin-only, but I am considering the idea of a public view so that the GNOME community can get a glance at what we monitor and can check the status of services and hosts.

Some of the other tasks we’re working on is preparing to migrate some hosts to RHEL6. We’ve got the RHEL6 images imported into our build system, but unfortunately we’re stuck on a networking issue in the automated installation. I think once the whole team is back from holiday we’ll get it figured out and have some built. The first boxes on the list for RHEL6 are the wiki, snowy and blogs.gnome.org.

We made a recent update to the mail server filter to remove the SORBS rbl. They’ve changed their policies and added some questionable address ranges which have caused us some problems. They’ve been replaced with a different rbl.

We’re also looking to consolidate the GNOME related domains into a single registrar. We’re shopping around for a dependable, free software friendly registrar. If you have any suggestions, please comment or contact the team. We’re very interested in input regarding where our domains can call home.

Beyond this I’ve mainly done general administration and maintenance. Mailing list queue moderation, account updates and creation, had to take care of a corrupt table in the piwik database.. you know, the general day to day stuff. As of today I’ve clocked about 33hrs for the month. I hope, during this last final stretch before the new year, to add ten more hours to that number and really tackle some more bugs.

As usual, if you have any questions or comments for the team please let us know. We’re happy to help, we just ask that you communicate and follow-up with us with any issues you have.

Christer

We run the servers that run GNOME.