Maintenance Downtime 2010-10-06: Report

This morning we had scheduled maintenance on all GNOME servers, which caused rolling outages of services. All servers should now have all the latest security errata applied, and all services should be available. In the interest of transparency, below you’ll find an outline of our maintenance and any issues we had:

Task:

  1. Reboot all servers to apply the latest kernel updates and ensure all other errata was applied cleanly.

Issues / Lessons Learned:

  1. We were reminded that our LDAP server requires manual intervention when rebooting. This needs hardware attention/replacement, but is no longer covered under any support contract. In the future Owen will need to manually bring the machine back up via console/KVM access.
  2. When rebooting servers, the LDAP server and NFS server should be last. These both host critical services related to the functionality of the other servers.
  3. The server that hosts the translations website (l10n.gnome.org)  has problems with starting it’s services on boot. Manually starting services is required.
  4. The server hosting bugzilla and git was problematic coming back online. This requires more investigation, and is unknown whether it’ll be a consistent problem.

Our maintenance did extend beyond the originally announced schedule based on some of the above unexpected issues, but now we’re aware of them and can prepare for them in the future. We appreciate your patience while we brought everything back online.

Planned Improvements

Based on the above, here are some things we will do to improve and streamline future maintenance windows:

  1. Ensure Owen is available and ready to bring the LDAP server back online if/when rebooted.
  2. Communicate the downtime schedule on the gnome-infrastructure list as well as the devel-announce-list. We will aim for 48hrs notice as well as a reminder just before the outage begins.
  3. Before the next maintenance window we will address issue #3 above regarding manually starting of services required.
  4. Our standard operating procedure for rebooting servers will be updated to include a priority list and dependency list (reboot order).

Again, thank you for your patience during our maintenance. This all goes towards a better and more mature infrastructure.

As usual, if you have any concerns, questions, or praise to share with us feel free to drop by #sysadmin and let us know.

Christer

Infrastructure Downtime: 2010-10-06 10:00am EST – 11:00am EST

The GNOME Infrastructure Team is planning regular maintenance for Wed 2010-10-06 at 10:00am EST (UTC -4). This will include brief downtime for all major services while security errata are applied.

Please be sure to finish any work and log out of any servers before that time.

The expected maintenance window is 10:00am – 11:00am (1hr).

If you have any questions or concerns, please contact us in #sysadmin on irc.gnome.org.

Hello World!

I don’t even know where to start! There have been a lot of changes and improvements going on within the GNOME Sysadmin Team lately, and I thought I should share a few of the things we’ve been working on, as well as some of the changes we have planned.

First of all, as Olav mentioned in our inaugural post, I’ve been hired as the GNOME part-time Sysadmin. This role includes a number of responsibilities, many of which will be transparent (and or boring) to most of you, but there are a few that will affect the community and that feels like a good place to begin.

First of all, we’re going to be publishing our progress and changes regularly on this blog. This will include scheduled maintenance on services (reboots, downtime, etc). Our plan is to schedule maintenance windows, and give the community adequate notice before we take down any services. We haven’t forgotten that our job is to make sure you can do your job, and improved communication is a key part of that.

Second, we’ve spent the past two weeks focusing on trying to clean up our rough edges and establish our baseline for the future. Part of this is defining where our priorities lie, and which projects are the most important. While we have an existing list of pending tickets in bugzilla, we’d also like to hear from the rest of you on what you’d like to see done. I can’t promise that we’ll be able to please all of you all the time, but we also can’t address your issues if we don’t know what they are! In this regard, if you have any issues we need to know about, please stop by bugzilla and file a ticket.

Beyond that, please feel free to contact me directly with any concerns or ideas you might have. I’m happy to discuss them with you.

Christer

Mail.gnome.org statistics

In case you haven’t heard, the GNOME Foundation Hired a System Administrator, Christer Edwards.

Christer was already a volunteer GNOME sysadmin, so he already knows a lot about the GNOME infrastructure. He fixed various things already, but I’ll leave it up to him to blog about that. The one thing I really like is that he cleaned up the Logwatch output for the various hosts that GNOME has. After which I requested he cleaned up the menubar (mail.gnome.org) Logwatch output (was 3.5MB), which he did :).

So now finally we can easily see some data for mail.gnome.org for Monday September 20:

Postfix

      238   *Warning: Connection concurrency limit reached
        1   SASL authentication failed
       29   Miscellaneous warnings

  520.825M  Bytes accepted                       546,124,884
    1.675G  Bytes delivered                    1,798,214,094
 ========   ================================================

    63595   Accepted                                  19.73%
   258725   Rejected                                  80.27%
 --------   ------------------------------------------------
   322320   Total                                    100.00%
 ========   ================================================

      757   Reject relay denied                        0.29%
     7655   Reject HELO/EHLO                           2.96%
   196808   Reject unknown user                       76.07%
    13564   Reject recipient address                   5.24%
     1320   Reject sender address                      0.51%
      318   Reject client host                         0.12%
    37547   Reject RBL                                14.51%
      756   Reject header                              0.29%
 --------   ------------------------------------------------
   258725   Total Rejects                            100.00%
 ========   ================================================

     3690   4xx Reject recipient address              21.26%
    13667   4xx Reject sender address                 78.74%
 --------   ------------------------------------------------
    17357   Total 4xx Rejects                        100.00%
 ========   ================================================

   185662   Connections made
    89877   Connections lost
   185650   Disconnections
    60230   Removed from queue
     1854   Delivered
   135819   Sent via SMTP
     4809   Forwarded
       45   Resent
     4317   Deferred
   140470   Deferrals
     2050   Bounce (local)
     2271   Bounce (remote)
      356   Expired and returned to sender
        1   DSNs delivered
     2622   DSNs undeliverable

     9055   Connection failure (outbound)
     1870   Timeout (inbound)
    11557   Illegal address syntax in SMTP command
       13   Numeric hostname
       45   SMTP commands dialog error
     4629   Excessive errors in SMTP commands dialog
    40629   Hostname verification errors
       27   Hostname validation error
       23   Enabled PIX workaround
        7   SASL authenticated messages

Amavisd-new

    21374   Clean passed                              90.02%
      121   Spam passed                                0.51%
      121   Bad header passed                          0.51%
       16   Malware blocked                            0.07%
     2111   Spam blocked                               8.89%
        1   Banned file name blocked                   0.00%
 --------   ------------------------------------------------
    23744   Total Messages Scanned                   100.00%
 ========   ================================================

    21495   Ham                                       90.53%
     2232   Spam                                       9.40%
 --------   ------------------------------------------------
    23744   Total Messages Scanned                   100.00%
 ========   ================================================

        2   MIME error
     2458   Extra code modules loaded at runtime

Clamav

 Viruses detected:
    HTML.Phishing.Bank-1259: 2 Time(s)
    HTML.Phishing.Bank-593: 1 Time(s)
    W32.Sality.Q-1: 2 Time(s)
    Worm.Mydoom.I: 9 Time(s)
    Worm.Mydoom.M: 5 Time(s)

Note that mail.gnome.org is the mailhub for GNOME. All outgoing (mailing lists, bugmail, etc) and incoming mail (spammers, spammers, spammers and some minor valid mail) for all machines is handled by mail.gnome.org. From the logs you can easily see that we get regular distributed dictionary attacks (high number of unknown users errors), plus Greylisting that was deployed (also done by Crister)

PS: As you noticed, there is now a GNOME sysadmin blog. It is syndicated at http://news.gnome.org/.

We run the servers that run GNOME.