OT: Calling MGoNerds/CompSci people - Heavy Duty Data Backup
Hello - this post will probably seem Greek to nearly everyone here but I'm hoping there's a few of you out there that work at Amazon AWS/Google GCE/Microsoft Azure or have a NAS/SAN at home.
The problem - I have a growing archive of Michigan Athletics video (4-5TB now). Eventually I can expand my NAS (Windows Server 2016 using ReFS - to hell with RAID card firmware bugs) to 32TB (8x8TB in RAID1) but I will run out of room at some point. I usually like to leave one spare 'dual bay' worth of free space open in the event I need to pull two hard drives and replace them with two larger ones. For example, if I had 8x4TB, 16TB useable, I'd like to leave 4TB free so I can swap out 2x4TB for 2x8TB + 6x4TB to give the NAS room to 'breathe' and grow.
The big problem I'm facing - cold storage. I'd ideally like to move older seasons onto cheap 2x2TB hard drives that are $40 apiece (far cheaper in the long run than Amazon Glacier). Each file saved on the disk would have a CHECKSUM (to detect bitrot) amd then I'd have a program run a comparison to see if the CHECKSUMS between the two disks match for every file, or at least report that one of the files has a discrepancy compared to the other.
For the CompSci guys- apparently PAR3 uses Reed Solomon correction codes - is it worth it to generate these recovery records for a single 15-20GB file (this is about the average size of a football game)?
Does anyone know of an application (freeware or otherwise) that create checksums of each file on a drive or directory, write it to a text file, and then re-scan and compare the historical checksum that was written previously with the new checksum?
This is pretty OT but I'm leaving it up because
a) It's probably not that greek to most of you, and
b) I'd like to know the answer.
I definitely can't speak for the rest of the board, but I understood none of this post.
EDIT: correction, I do know what Michigan athletics videos are...
(I am one so this is ok. :P )
i don't know how good you are w/ coding, but the checksum thing sounds like a relatively simple task to accomplish using Python + pretty much any database.
this seems like a good place to start: https://stackoverflow.com/questions/3431825/generating-an-md5-checksum-…
I second the StackOverflow post, although I don't think Python or a database is needed. The script in the SO's top answer maps nicely to the author's request.
Bash scripting is either in Windows Server 2016 now or soon to be there.
And make sure you have a copy of the data somewhere else. Preferably on the other side of the planet (in case of a large meteor strike).
EDIT: Actually, the StackOverflow post I was looking at was the following
https://stackoverflow.com/questions/36920307/md5-all-files-in-a-directo…
That's a good SO post. I'm going to write something and put it on Github.
How do you load a replacement file if you intend to only maintain a single archival original file and the data should corrupt slightly over time? Perhaps I'm missing sight of the forest for the trees.
The 2x2TB external drives would have mirror identical data on them. I'd 'check' each drive every few years to detect to validate against the historical checksums vs. the new checksums. If there was a mismatch, hopefully there wasn't bitrot on the other drive, and I can copy it over.
For the first time around - the drive would be connected via USB3 dock, generate checksums with the mythical application I described above, I'd pull the drive and have it sit on a shelf for a few years, reconnect the drive every few years and the mythical application scan and generate new checksums and compare those checksums with the historical checksums, to detect for bitrot.
Essentially - 'cold storage' RAID, but the two drives are independent of each other, just a mirror image in case bitrot occurs on one of the drives.
I 'cold storage' the old seasons (I can fit two seasons onto a 2TB drive) to move them off of the 'hot' NAS to free up space.
And I believe they would work in my context. But here's some advice I found
https://secure.clcbio.com/helpspot/index.php?pg=kb.page&id=181
The issue is that I have about (betwene football/basketball/hockey) about 150-200 files on disk. I'd probably need to write an application that generates md5s for each file initially. Years alater, I'd like the application to perform a scan to generate a new md5, compare it against the old md5. If the new/old md5 match, great. If not, alert that the md5 on the file has changed. Hopefully the md5 on the 'mirror' drive is still the same.
Just verifying that in this plan, the storage of the hashes does not reside on the actual cold-stored drives. If the "known good" hash record gets ruined, the checks don't matter.
This is obvious, I know that; but I have seen some really dumb mistakes screw up otherwise awesome plans. DIdn't notice this detail listed elswhere and decided to come say the obvious.
The md5sum utility from Linux can both create initial hashes and later verify those hashes.
$ md5sum -b * > archive_hashes.txt $ md5sum -c archive_hashes.txt
The first line create hashes for all named files and stores them in archive_hashes.txt. The second line reads hashes and file names from archive_hashes.txt and checks that the named files haven't changed.
Combine with the find utility or otherwise wrap in a bash script to recurse through your directory tree. If you are a Windows only person, I recommend installing cygwin to gain access to a mountain of Linux command line utilities including bash and md5sum.
come with bashing tools?
You are thinking April 1 each year at the Diag for said hashing, which is beyond the scope of this thread.
OP: I thought your post was at least a 4-star (out of 5) on the OT scale. Quite interesting.
Unsoliticed advice: I'd give some thought to the size of your archive and its ultimate purpose before going hog-wild on technology specs. (Perhaps you've already done this.) Yeah, storage isn't all that expensive, and you seem to know what you're doing, but are all five terabytes worth keeping?
It wouldn't be more than $50-60 bucks a year to have a really high quality long term archive.
I could ask the Bentley Library people - 'hey...do you guys want this stuff?' but then we'd probably get into a conversation of where I got this video and yeah, that would probably be where the conversation ends.
I feel that solution is designed primarily for backups that increment in small amounts - the issue is I'm generating around 600-700GB new data a year.
The technical problem I'm having is finding a solution that programmatically analyzes the integrity of my 'cold storage' data. Sure, I could generate an MD5 for each football game. But I'd like an application to point at a drive and tell it, you should find the md5 of the file you scanned a few years ago in the same directory as the file itself. Re-scan the file - if the md5 matches, continue with the next file. If not, alert me and tell me that the md5 of this file has changed.
thanks, I'll check it out.
but check this blog see if anything works for you.
https://www.raymond.cc/blog/7-tools-verify-file-integrity-using-md5-sha…
I see some good software there
This will get an upvote from me every time... :)
If you want cheap freeware, just write a routine yourself using the "comp" command available in windows.
https://technet.microsoft.com/en-us/library/bb490883.aspx
This is a byte by byte comparison.
There a many freeware or cheap software packages that do this to, but I have never used them. Buyer beware.
http://www.files32.com/Byte-by-byte-Comparison.asp
Aside from the checksums, have you considered optical media for archival storage? The new M-discs are supposed to last an extremely long time without degrading, and the cost isn't significantly different than blu ray. At least something to consider.
That even the big guys (Facebook/Google) were still using optical media and programmatically loading/unloaidng disks out of cold storage when the data was needed.
The issue though is that a 25-pack spindle is $61...That's $61 for 625GB of storage. I could get 4TB (2x2TB) for $80 - http://www.ebay.com/itm/162441343250?ssPageName=STRK:MESINDXX:IT&_trksi…
And the drives would basically spin for a day or two every two years, so it's not like there's a huge load being placed on them. And I'd back them up to Crashplan, which supports external drives, which are then disconnected.
Are you more concerned about cost or data preservation? I interpreted your post as being more concerned with the latter. You can burn multiple copies on M-discs and store them offsite in case the building burns down.
I think it's still cheaper for me to get 2x2TB drives, mirror the data to both, write an application that generates checksums, and give one drive to a friend. Every few years, I'll pick them up and scan them for any bitrot.
I was with you until you said you wanted to use a nascent, unproven Microsoft file system due to your fear of bugs in RAID card firmware. If there is any big software company out there right now who's entirely forgotten how to publish good quality code, it's MS.
It comes from the Windows Server group, not the operating systemg group.
Personally I'd take hardware RAID over software any day, but I work in IT. Maybe budget precludes buying a decent card/motherboard, but running software RAID, especially from M$ would make me nervous if I cared about the data.
I understood every word of this post, and I am so happy I decided not to take that Amazon interview and work in games instead.
...your flux capacitor? It might solve the issue.
So I work for GCE, although what you want is GCS (Google Cloud Storage). You should look at Coldline storage $.007 GB/Month. You get instant access to the data and all of this backup/maintenance is managed for you.
I guess the question is whether this is worth $35/Month
Thanks, I just looked up the calculator - https://cloud.google.com/storage/pricing
$35/month is a bit out my price range, but perhaps I can convince Brian to have this as a business expense? I'd be ok with having the Hoke/Rodriguez years live in cold storage.
I think I'm still leaning towards my 2x2TB sitting on my shelf, having mirrored data, with MD5s for each file on each drive. Pulling the drives off the shelf every few years during the offseason and running a scan/comparison on each drive.
Oh, and uploading the data from the external to Crashplan (they let you connect an external, upload the data, and disconnect the external). Crashplan has been around awhile, who knows how long they'll last though - https://www.code42.com/about-code42/
I would assume that any small player in the backup space is going to die a slow death. The economics of storage are brutal.
If you are willing to do the work on your own storage media; that's the cheapest way to go. You run the risk of a single source catastrophic event (e.g house fire) -- but Michigan football video from the Hoke era is the least of your concerns in that event.
I tried to rally Brian to the idea of cloud web servers a while back -- I don't think IT is his favorite topic. I even used MgoBlog as my interview topic to get me a job at Google: its a good case study for a site with spiky loads and weak underlying infrastructure.
I just read on r/datahoarder that Crashplan is shutting down, immintently.
But yeah, I have my requirements sheet written out for the applicaiton I'm going to write - it will scan a directory/drive, placing a .txt file with both the MD5/SHA256 of the file in the same directory. The GUI will have another button that will scan a directory looking for the MD5/SHA256 of the file/scan the file and compare the results. Or, you can feed the application a file/folder full of .txt files that contain the filenames/directory structure of a scanned drive (in case the MD5/SHA256 themselves get corrupted - it's text, I'll e-mail it to myself and place it in a 100 other places) and it will tell you the results.
I will call it MGoArk - I'll keep one of the hard drives, my MGoFriend (Hi Eric!) will get the other.
Two by two!
You could upload the MD5/SHA256 to a free tier of storage and compare it to that. The can assume that the free tier is backed by triplicate storage with bit error correction so that it's more or less always a golden copy.
Alternatively, you can safely assume that no one is going to care about corrupted pixels of Sam McGuffie's umpteenth concussion. Or M00N. Or Shane Morris vs. Minnesota. Or anything between else that happened between Jan 2007 and Aug 2015 (except Denard videos -- those should be in the cloud right now)
I don't know any of the tech stuff, but I do know that if all your backups and your backup's backups are in the same place, then a fire, flood or a break-in could be catastrophic. If it's super critical to not lose anything you might want to put Copy 2 in a separate location.
And I don't know the answer to this, but does Youtube compress/alter the original uploaded video? If not, you could upload to Youtube, then download at some future point using one of the many browser extensions available for that purpose.
Your post reminded me of an interesting (to me, at least) article I read in IEEE Spectrum about the movie and TV industries' problems with storage. For those interested, I recommend checking out: http://spectrum.ieee.org/computing/it/the-lost-picture-show-hollywood-a…
Money quote:
“There’s going to be a large dead period,” he told me, “from the late ’90s through 2020, where most media will be lost.”
Fight the good fight, my MGoBrother. Preserve that UofM history. This is the most on-topic post ever.
Keep circulating the tapes!!!!
Interesting unforeseen consequence of cameras going digital.
Have you considered writing to optical discs? Check out M-discs - blu rays designed for archiving. Can last 100-150 yrs. Then you can just upload to SkyNet.
Even if you are using the NAS for fast access...disk backup wouldnt be a bad idea. No need to worry about continual corruption and checksum process. Granted this will be more expensive than cheap HDs.