ASCII by Jason Scott

Jason Scott's Weblog

The Backing Up of the Internet Archive Continues: Hop In —

A little over a month ago, I cooked up a grandiose plan with the Archive Team to back up the Internet Archive, and discussed what that might entail and how one might approach it.

How’s that thing going, anyway?

iabakRemarkably Well would be the best description.

The working group (and let’s be clear – I’m doing the least amount of “working” in the Working Group) has cooked up a bunch of language, procedures, and taken on volunteers at a great clip.

Currently, the system can allow additional clients (volunteers) to join up and be fed the most-needed shards (data sets) and then check on these clients, alerting them after a couple weeks they haven’t checked in, and expiring them out after a month of not checking in, backfilling the lost client’s dataset.

We’ve intentionally and unintentionally punched clients in the gut and watched the system recover. We’ve also added a leaderboard, as well as feedback of what clients tend to have what.

Currently, the IA.BAK system stands at 27 terabytes backed up. To some, this might sound like a drop compared to the vast stores of the Archive, but that’s because they’re looking at it a different way than has emerged during the project’s research efforts.

For example:

  • The IA.BAK project is housed on zero Internet Archive infrastructure.
  • Only data and collections accessible to the public are backed up.
  • Each item is backed up to three separate locations other than the Archive.
  • Collections of items are hand-chosen for historical value/rareness on the net.

The result of many discoveries along the way, these sorts of choices came from discussion, testing, and some very smart volunteers throwing ideas back and forth at each other.

Obviously, having the whole thing not depend on Internet Archive at all quickly became the goal – even the website explaining how it all works isn’t hosted there. As for the public-only, it was important that the project not depend on having some insider access or knowledge (after all, this might be a useful thing for other major data stores). And that three-other-locations thing is murder – we’re already up to almost 60 terabytes of volunteered, shared space.

Finally, the hand-chosen aspect has been particularly enlightening – given this approach to backing things away, what collections would the world truly be poorer for not having? As we walk through the various piles of history on the Archive, the team of IA.BAK volunteers are finding some really wonderful sets, items which could use a little spotlight for the world to check out again anew. This is, after all, both an expedition and an experiment.

The client suite for becoming one of our volunteer storage spaces is now many times easier to use, and a lot of error correction has been built in. We’re not quite to the “space on your laptop or desktop” E-Z install phase yet – it’d be good if your disks were connected to the internet constantly, and if you had, say, more than 500gb free disk space lying around.

The system is built so you can choose to remove a collection you don’t want to back up (and it won’t return) and for you to be able to start using some of that provided disk space for your own uses, just leaving whatever gigabytes you have left for the project. In other words, you can make the same use of disk space that isn’t doing anything like you can use CPU time that wasn’t doing anything for SETI@HOME. We have people contributing half a terabyte drive they aren’t using, while others are going for the gusto and offering tens of terabytes.

So, if this intrigues you, please come visit the IA.BAK homepage, see how we’re doing (after just a month!) and learn how you might help.

Categorised as: Archive Team | Internet Archive

Comments are disabled on this post

One Comment

  1. Peter says:

    I’m intrigued. Mostly worried about the network use and load. Disk-space is cheap, The network is a limited resource in many environments. Can network usage be throttled and scheduled from within the client? Is this pure archival, meaning a onetime download and no uploads?