ASCII by Jason Scott

Jason Scott's Weblog

Archivebot and Automatic Betterment —

One of the most successful (and ongoing) projects that Archive Team has done is Archivebot. It’s a crowdsourced archiving operation that goes after anything that needs archiving, be it webpages, tweets, videos and other online scrapings, and allows them all to be captured in a competent, useful manner, 24/7. It has grown over the years and is ridiculously flexible now, with a command language so variant that it has this unbelievably high quality manual for all it can do.

Archivebot is at its best with on-the-ground shifting-sands events, like tragedies/celebrations, they-said/others-said controversies, and capturing things the way they “are” just before the new news flattens the pages before they’re taken down. If you watch the Archivebot Twitter account, you can see the bot in action, in real time.

As it goes along, generating these WARC’d pages, it stores them in 100 gigabyte chunks. (Yes, Gigabytes.) Usually between 100-300gb of this data comes in a day. The chunks are then uploaded into a collection, where they are then absorbed into the Wayback machine within a day or two, meaning the world benefits from the data almost immediately.

ArchiveBot has two mascot images. They are both accurate:

VY5wNAvslogo_header

Looking at the Archive’s statistics, some of these pages are called up through the wayback machine hundreds of thousands of times. (One of them has gone past 600,000 recalls.) Many, many more are called “merely” tens of thousands of times.

Occasionally, Brewster gets an idea in his head, and I hear of it, and fly with it. (It’s nice to have a boss that inspires.) One of them was that the new Internet Archive interface is rather visual in nature (by default), and so it would be nice if some of these more data-heavy items in the collection had some sort of visual component to them, if possible.

I set off on that work last year. The idea was to use a WARC playback mechanism (WebArchivePlayer) to bring back pages out of the WARC files, take screenshots, and then upload those screenshots into the items as “previews” of what’s inside.

It’s taken a while, because these have to be downloaded, then played back, then screenshot, then checked if the screenshots are any good, and then removing the ones that don’t have any data in them, and so on.

But it’s coming along really well.

news.ycombinator.com-shallow-20150322-124509-4t3kz-00000.warc.gzwww.cubcentral.org-inf-20150322-141813-3c758-00000.warc.gz

The played-back WARC files work remarkably well. Naturally, if something is a Javascript or embedded-object nightmare, it doesn’t look good at all. But many do.

I’ve got the script working chronologically right now, so it’s doing the oldest items first and then moving on from there. It takes it anywhere from a half-hour to a couple hours to make the preview images. Most of that is because I’m giving every single grab a chance to produce images, and some of those grabs go into the gigabytes.

archivebot

The result, which is now 99% automated, are pages after pages of beautifully rendered, verified-as-historically-relevant or at least gawkishly-fascinating web pages and sites. The thumbnails look good, and going to the individual chunks will give you an interesting (and potentially disturbing) slideshow of events and strangeness.

I mention this all for two reasons. The first is just to put a line in the sand at a point in Archivebot’s journey to reflect on how far along that amazing project has come. It continues to innovate, thanks to the efforts of an all-volunteer force, and addresses the ever-changing aspects and requirements of being a chronicle of the archiving of the web.

But the other is the template that this thumbnail and slide-generating aspect represents at the Internet Archive, which is heavily machine-augmented human work. I go through the items that come out of the contraption, pulling out the sideways-broken ones and the weirdly-off rejects, leaving thousands of screenshots with no human intervention whatsoever. It’s grinding away 24/7, doing something both useful and not worth throwing a person at. It’s how I think a lot of the work will continue to be handled with an ever-increasing workload.

And it’s fun.

00_coverimage00000_Header00000_Header (1)


Categorised as: Archive Team | computer history | Internet Archive

Comments are disabled on this post


Comments are closed.