I’ve been using BackupPC to take offsite backups of all my machines over the network for over a year. It seemed to work well enough and, it seemed, would always email me if it hadn’t been able to backup a certain machine for a few days.
Yesterday I discovered that it has not done a successful backup of one of my machines since March! I just suddenly noticed on the status screen that instead of a table of 8 backups (2 full and 6 incr), only 3 were shown — 2 full, both dating back to March, and 1 “partial” from the day before yesterday. Looking at the logs I see this:
2006-07-24 06:00:05 full backup started for directory /data/work; updating partial 678
2006-07-24 06:20:28 full backup started for directory /home; updating partial 678
2006-07-24 06:20:34 Got fatal error during xfer (fileListReceive failed)
2006-07-24 06:20:39 Backup aborted (fileListReceive failed)
2006-07-24 06:20:39 Saved partial dump 678
Exactly the same thing has been happening every day for the past 4 months. Backuppc didn’t email to tell me. It’s email system was definitely working because during that time it did mail me about a machine that was offline for a while. So it appears it doesn’t bother to send mail to notify you of a failed backup!
I had no idea what might be causing this. It just started out of the blue, having worked flawlessly before March. It only affected one machine. The configuration had not changed. It always failed on /home but was apparently ok with /data/work.
Something weird in /home? To find out, I set
tar loose on it:
$ cd /home; tar cf - . >/dev/null
tar: ./jammin/.gxine/socket: socket ignored
tar: ./jammin/.kde/kdeinit-\:0: socket ignored
tar: ./sarah/.totem.sarah: socket ignored
tar: ./sarah/.xine/session.0: socket ignored
Surely not. Surely it couldn’t be something as trivial as a couple of stale socket files causing my backup to fail? Well, I’m not using any of those programs, so I deleted the sockets, and told backuppc to start a full backup. What do you know — it worked.
So is it that it doesn’t like sockets? Or has the poor thing got confused by the funny characters in the filename of that KDE one? I’ll test this out at a later date when my backups have recovered.
There are three major failings by BackupPC here. One, failing over a simple socket or dodgy filename, and not giving much clue why. The second, not bothering to email when a backup fails halfway through. But most concerning of all is that it kept trying to add to the same partial backup, instead of starting a new one — so I no longer have 2 weeks’ worth of incrementals even for the part of the backup that succeeded. Every day, yesterday’s backup was being overwritten by today’s. If I needed to recover a version of a file in /data/work from 2 days ago, I couldn’t. That sucks.
This has made me realise something. Quis custodiet ipsos custodes?. Why am I relying on *one* backup solution? It’s a SPOF, and it has quite spectacularly F’d. I still want to find out why, and ideally fix it, but I’m also going to start setting up something else alongside. Since backuppc is server-driven, the alternative should be client-driven. All recommendations welcome. The two major requirements are that it must support ssh, and be bandwidth-efficient because I’m backing up over ADSL.
All the client machines run Linux.