Back That Thing Up —
I’m going to mention two backup projects. Both have been under way for some time, but the world randomly decided the end of November 2016 was the big day, so here I am.
The first is that the Internet Archive is adding another complete mirror of the Wayback machine to one of our satellite offices in Canada. Due to the laws of Canada, to be able to do “stuff” in the country, you need to set up a separate company from your US concern. If you look up a lot of major chains and places, you’ll find they all have Canadian corporations. Well, so does the Internet Archive and that separate company is in the process of getting a full backup of the Wayback machine and other related data. It’s 15 petabytes of material, or more. It will cost millions of dollars to set up, and that money is already going out the door.
So, if you want, you can go to the donation page and throw some money in that direction and it will make the effort go better. That won’t take very long at all and you can feel perfectly good about yourself. You need read no further, unless you have an awful lot of disk space, at which point I suggest further reading.
Whenever anything comes up about the Internet Archive’s storage solutions, there’s usually a fluttery cloud of second-guessing and “big sky” suggestions about how everything is being done wrong and why not just engage a HBF0_X2000-PL and fark a whoziz and then it’d be solved. That’s very nice, but there’s about two dozen factors in running an Internet Archive that explain why RAID-1 and Petabyte Towers combined with self-hosting and non-cloud storage has worked for the organization. There are definitely pros and cons to the whole thing, but the uptime has been very good for the costs, and the no-ads-no-subscription-no-login model has been working very well for years. I get it – you want to help. You want to drop the scales from our eyes and you want to let us know about the One Simple Trick that will save us all.
That said, when this sort of insight comes out, it’s usually back-of-napkin and done by someone who will be volunteering several dozen solutions online that day, and that’s a lot different than coming in for a long chat to discuss all the needs. I think someone volunteering a full coherent consult on solutions would be nice, but right now things are working pretty well.
There are backups of the Internet Archive in other countries already; we’re not that bone stupid. But this would be a full, consistently, constantly maintained full backup in Canada, and one that would be interfaced with other worldwide stores. It’s a preparation for an eventuality that hopefully won’t come to pass.
There’s a climate of concern and fear that is pervading the landscape this year, and the evolved rat-creatures that read these words in a thousand years will be able to piece together what that was. But regardless of your take on the level of concern, I hope everyone agrees that preparation for all eventualities is a smart strategy as long as it doesn’t dilute your primary functions. Donations and contributions of a monetary sort will make sure there’s no dilution.
So there’s that.
Now let’s talk about the backup of this backup a great set of people have been working on.
About a year ago, I helped launch INTERNETARCHIVE.BAK. The goal was to create a fully independent distributed copy of the Internet Archive that was not reliant on a single piece of Internet Archive hardware and which would be stored on the drives of volunteers, with 3 geographically distributed copies of the data worldwide.
Here’s the current status page of the project. We’re backing up 82 terabytes of information as of this writing. It was 50 terabytes last week. My hope is that it will be 1,000 terabytes sooner rather than later. Remember, this is 3 copies, so to do each terabyte needs three terabytes.
For some people, a terabyte is this gigantically untenable number and certainly not an amount of disk space they just have lying around. Other folks have, at their disposal, dozens of terabytes. So there’s lots of hard drive space out there, just not evenly distributed.
The IA.BAK project is a complicated one, but the general situation is that it uses the program git-annex to maintain widely-ranged backups from volunteers, with “check-in” of data integrity on a monthly basis. It has a lot of technical meat to mess around with, and we’ve had some absolutely stunning work done by a team of volunteering developers and maintainers (and volunteers) as we make this plan work on the ground.
And now, some thoughts on the Darkest Timeline.
I’m both an incredibly pessimistic and optimistic person. Some people might use the term “pragmatic” or something less charitable.
Regardless, I long ago gave up assumptions that everything was going to work out OK. It has not worked out OK in a lot of things, and there’s a lot of broken and lost things in the world. There’s the pessimism. The optimism is that I’ve not quite given up hope that something can’t be done about it.
I’ve now dedicated 10% of my life to the Internet Archive, and I’ve dedicated pretty much all of my life to the sorts of ideals that would make me work for the Archive. Among those ideals are free expression, gathering of history, saving of the past, and making it all available to as wide an audience, without limit, as possible. These aren’t just words to me.
Regardless of if one perceives the coming future as one rife with specific threats, I’ve discovered that life is consistently filled with threats, and only vigilance and dedication can break past the fog of possibilities. To that end, the Canadian Backup of the Internet Archive and the IA.BAK projects are clear bright lines of effort to protect against all futures dark and bright. The heritage, information and knowledge within the Internet Archive’s walls are worth protecting at all cost. That’s what drives me and why these two efforts are more than just experiments or configurations of hardware and location.
So, hard drives or cash, your choice. Or both!
Categorised as: Archive Team | computer history | Internet Archive | jason his own self
Comments are disabled on this post
[…] Also: Jason Scott’s explanation of the cost — in bytes and dollars — of creating a backup of the Internet […]
We gave a few bucks yesterday, pal. Keep up the great work!
I just realized I have several empty TB in my ZFS array. So here ya go.
I particularly like that IA.BAK lets me specify how much space to keep free, so I don’t need to worry about running out if I need to use that space myself.
[…] Jason Scott: […]