So last week my big storage box started acting up. Random reset, dropping a drive, all and all, not good.
So let me give you a quick rundown of this storage box. I am running freenas. I have a total of 11 drives currently. 9 of these drives are 2TB drives. Configured in 3 Raid 5 configurations. There is a small OS drive and a 128gig SSD just for cache. Then striped across giving me a total of 9.63TB of storage with redundancy. I store everything here, all my video and photo work. My media collection. My ESXi environment mounts iscsi off this thing. So it’s pretty critical my geek life.
I did all sorts of testing. Flashed the OS drive. Replaced the OS drive. No matter what I did. 4 minutes uptime, kernel panic and reboot.
So I ordered new parts which arrived yesterday. I take the system out of the rack, put it on the table, open it up…. found the problem…
Ouch. A small fire in my server.
So things went from bad to worse. Shortly after finding and fixing this. I reinstalled the OS and brought everything up for a 24 hour burn in. This worked. Ok good, lets go back to the SD card for the OS. Fresh install, 24 hour burn in. Lets go!!
12 hours in. System reboots. Doesn’t come online… No prob, Ill fix it when I get home…….. (do you see the foreshadowing here? cause I didn’t)
I get home, not booting right… Ok, reinstall os…. nope. ok, maybe sd card is bad. Back to the SSD. Nope..
Clean OS. No auto Import. Everything is fine…. import zfs volume…. kernel panic. Dead..
Time to research. Ok so from the inter-webs my prognoses is “screwed, data gone.
Apparently desktop memory and zfs are to blame here. Not like I wasn’t trying to keep my data. I had 3 raidz vdevs in a zfs pool.
So after contemplating all my poor poor data I decided to try to recover it.
Disk scans (SpinRite for 36 hours) = nothing
zdb scan (multiple hours but kept crashing because ran out of swap) = nothing
OpenIndiana live cd = nothing
Finally I found a post where someone talked about trying to force the volume only as read only. I figured, “hey, I’ve already spent 4 days trying to recover, why not”
So I boot up freenas. Get on the console and type
zpool import -f -o readonly=on -R /mnt vol
It didn’t kernel panic….. wait, what?!
Holy $%^&* IT MOUNTED!!! I’m jumping through directories all giddy that my data may still be intact. But read only isn’t going to do me much good. Need drives!!!!
I don’t have 10tb of external drive…. AJ!!!!
So I go to my buddies and steal all his externals. I plug the all in at once and start the very very very slow copy. After 5 days of copying to externals I was finally able to rebuild and start putting my data back.
So now the lessons learned:
- Regularly check that your offsite back ups are working
- Build a secondary nas for snapshot backups (this box will eventually be at AJ’s since we have a VPN between our places)
- Identify what is replaceable and what isn’t and dump that somewhere else too.
This was a long process but its coming to a close. I will be doing snapshots of critical data to a secondary freenas box. Once the initial snapshot is done, I will take the box to AJ’s and the snapshots will continue to backup there.