Afternoonus horribilis

August 16, 2007

So yesterday evening I was hacking on ekiga, and compiling it all over again. For some reason, the load began to rise, and the box was soon unresponsive… Something it had already done a few days ago. Sigh. It ended up with an unnice reboot.

This time too I finally rebooted the box (magic keys). Of course I asked for a sync. But on reboot… the bios rebooted, but grub stopped with a sinister error 16.

Since it’s not the only computer at home, I could search what that mean : something wrong with the filesystem. Ouch.

Easy escape : since I’m in the process of selecting a new laptop to buy, I have very recently burned a live-CD (well, two : ubuntu & mandriva). So I try to boot on them.

Ha! That was supposing that the flaky CD burner which successfully burned them just a few days ago was still able to work. It doesn’t.

So I find myself with a box unable to boot on the harddisk, and unable to boot on CD. Things get interesting at that point.

I end up peeking in the BIOS, trying to find out what that thing is able to boot on (I have an external USB harddisk). Ah! It is supposed to boot from network too.

It took some time to dig for an appropriate ethernet cable to hook to another box. I installed a dhcp server, a tftp server… fought long to make things work, and finally the laptop boots on debian’s netinstall. Yeah!

Unfortunately, debian’s netinstall has everything to configure network interfaces, but nothing for harddisk access. Aie.

Well, that’s easy : just get another kernel+initrd.img, then reboot. Yeah again!

Since grub has issues booting, I just went straight at /boot/grub/ to see what it looked like. I just get a list of the files in there with “permission denied” in front of them. That explained a lot.

A little mv later, I had no more buggy /boot/grub/ to hinder me. But since I highly doubted it would be enough, I decided to go for reconstruction surgery : I still had grub’s .deb in /var/cache/apt/archives/, so with little effort (ar&tar) I could rebuilt my directory except menu.lst and the system map. That had to be enough.

And it was : since it had no menu, grub gave me a command line. Typing qwerty on an azerty was a game after what I had already gone through, but I finally could boot on my poor system.

Both my system and home partitions had many, many errors, and needed repair.

The final steps for reviving the system was just playing with grub-install, update-grub, a little customization and update-grub again.

I think reiserfsck made a good job repairing my partitions, but I’m still wondering :

  • what did I lose except /boot/grub? (perhaps I could have salvaged it if I had run the check from the debian image instead of after booting) ;
  • how come I lost this directory?! I don’t think it was in use when the problem occurred, so I’m bewildered at how it could be affected…

I think I’ll keep my backups even more up to date.

5 Responses to “Afternoonus horribilis”

  1. Snark Says:

    Hmmmm… I could problably indeed use ZFS, but doesn’t RAID-Z require several disks — something unlikely to be found on a laptop?

  2. Snark Says:

    Gasp… I removed the comment I was replying to… stupid me :-/

  3. Someone Says:

    the main advantage of ZFS for you is that it is an atomic transactional file system: it’s never in an inconsistent state, even after catastrophic os crashes. RAID-Z is useful to mirror your data over two (or more) disks, partitions or any block device (large files) in your cheap disk, so that if your disk goes crazy and scratches some part, you still have an automatic backup in a different part of the disk. and there’s always the ditto blocks in ZFS, also, which are automatic extra backup blocks of the important meta-information of the filesystem. there are many other interesting features, like being able to do snapshots of the whole disk at any time, much like versioning in SVN/CVS.

    more here: http://www.opensolaris.org/os/community/zfs/whatis/

    All operations are copy-on-write transactions, so the on-disk state is always valid. There is no need to fsck(1M) a ZFS file system, ever. Every block is checksummed to prevent silent data corruption, and the data is self-healing in replicated (mirrored or RAID) configurations. If one copy is damaged, ZFS detects it and uses another copy to repair it.

    ZFS introduces a new data replication model called RAID-Z. It is similar to RAID-5 but uses variable stripe width to eliminate the RAID-5 write hole (stripe corruption due to loss of power between data and parity updates). All RAID-Z writes are full-stripe writes. There’s no read-modify-write tax, no write hole, and — the best part — no need for NVRAM in hardware. ZFS loves cheap disks.

    But cheap disks can fail, so ZFS provides disk scrubbing. Like ECC memory scrubbing, the idea is to read all data to detect latent errors while they’re still correctable. A scrub traverses the entire storage pool to read every copy of every block, validate it against its 256-bit checksum, and repair it if necessary. All this happens while the storage pool is live and in use.

  4. Someone Says:

    oh, btw, I usually see this unresponsiveness under high loads in linux when the swap partition is too small or is not mounted at all.

  5. Snark Says:

    Eh, I’m sold on ZFS — but without RAID, since I don’t plan to put several disks in a laptop 😉


Leave a Reply

You must be logged in to post a comment.