OFF: Drives and backups (was: Re: +++stop press+++advertising aid item french hassan+++one man isolator+++)

Mon Aug 25 12:23:26 EDT 2008

On 25 Aug 2008, at 10:52 AM, Arjan Hulsebos wrote:

> On Mon, 25 Aug 2008 09:31:24 -0500, Carl Edlund Anderson wrote
>> Well, up until now, I use the incredibly dumb and brute force method
>
> Out of curiosity, has anyone gone through the drill of restoring  
> their data
> from a backup? We once backed up a fileserver, swapped its disk for  
> a fresh
> one, and did a restore. The process seemed to be successful, only to  
> find out
> at the next boot that mere random patterns had been written to the  
> disk. The
> only way to get a usable disk again was a low level format....
>
> Not to scare you in any way, but if you're _not_ using Carl's brute  
> force
> method, do go through the drill. It's worth it.

I've restored from backups several times, including, I'm happy to say  
as a Mac user, a complete system restore successfully from a Time  
Machine backup.  (I've also done many successful restores from a  
Tivoli TSM regular backup.)

Usually, when I'm archiving data to DVD, I'll include an MD5 checksum  
of all the data on the disc along with the files themselves.  Then, I  
check the written data against that.  It's no protection against the  
DVD subsequently becoming unreadable, but at least it guards against  
the backup-wrote-random-gibberish-instead-of-the-data-you'd-intended  
scenario you mention above.  All decent backup software will also let  
you verify a backup, too.  DDS DAT drives, which I've used in the past  
for backups, feature read-after-write tape heads.

One of the things to remember about hard drives is that though they  
have automatic bad sector reallocation, it is only triggered on a  
write.  So, if you have data sitting on a sector that subsequently  
goes bad, there's nothing the drive can do about it.  That's where  
redundant schemes like RAID come in: it can try and recover the data  
automatically from other drives in the RAID, or from parity  
information.  Alas, this is where you can also discover that other  
drives in the RAID, say same models from the same manufacturer or same  
batch, have also failed in similar fashion, and the multiple failures  
cause the RAID itself to fail.

Because of this cluster failure phenomenon, enterprise level RAID  
controllers will usually have an option for the controller to  
periodically "police" the entire surface of all attached drives,  
making sure the data are readable.

Arjan, you might want to look into  using the ZFS filesystem for your  
fileservers.  One of its main design features is to try not to trust  
data coming from various subsystems unless it can verify it.  Thus, it  
employs various levels of checksumming and redundancy.  It tries to be  
proactive about data integrity, too.  It has a "scrub" function that  
tries to discover bad sectors, and, in a RAID configuration, automatic  
resilvering when bad data are discovered.

ZFS is supported by Solaris and FreeBSD 7+, to give two examples.   
(Mac OS X 10.5 Leopard features read-only ZFS support and a read-write  
ZFS kernel module via the Developer Zone.  Hopefully, Time Machine in  
10.6 will use ZFS as its underlying file system...)

Cheers,

Paul.

e-mail: paul at gromit.dlib.vt.edu

"Without music to decorate it, time is just a bunch of boring production
  deadlines or dates by which bills must be paid."
         --- Frank Vincent Zappa