ASCII by Jason Scott

Jason Scott's Weblog

One Year Update: Jason and the Internet Archive —

I officially started work at the Internet Archive over a year ago.

Let’s remove any tension – it has been a fantastic year, where I have gotten more done in the way of preservation and computer history work than my entire previous 40 years combined. Internet Archive seems to like me, I really like them, and I’m staying.

Here I am with the boss:

Brewster and I are two rather different people, and although the Venn diagram of our interests does not intersect everywhere or even close to it, what we do share in terms of goals and passions is very similar. There’s no hidden agenda with this guy – the headquarters in SF isn’t secretly a meth lab, we’re not actually some lobbying group or anti-whatever think tank trying to destroy or anything. There is only the Mission, the goal to bring as much human knowledge as universally as possible, and to preserve and keep all matter of knowledge as reliably as possible.

Oh, there’s occasional office flare-ups and disagreements and I’m sure some clenched fists and not every day is an endless buffet of awesome, but every single person in this organization understands the Mission and pretty much 100% of the disagreement is how best to achieve that mission with what resources there are (or to gain new resources). That’s rather refreshing from, oh, let’s say, every other goddamn place I’ve worked at, where the goals of some people are “get to retirement age” combined with others who mostly have signed up for “do absolutely nothing until you either get bored and leave or get fired”. That’s not going on here. It was a shocking office culture to run into, everyone just kind of pressing in towards the overarching mission without being waylaid by one group trying to undermine the others for some other bonzo reason unrelated to what the place was going for. Again: People leave, people join this place, but they all understand that dream, that plan, that hope, that dream. Maybe this happens elsewhere, but not in my previous lines of work. So that’s somewhat mind-blowing on a daily basis.

I will feel really stupid if I start listing out names of co-workers and then miss some, so I will tell you that I have someone who is a “handler” for me, and she is far and away one of the best bosses I’ve ever had, and I’ve had a few really, really great bosses in my time. I have people I sit with when I’m in San Francisco who are brilliant, hardworking people (again, all aimed at this goal) who do stunning work. We have communication channels where various groups talk, and it’s like shoving your face into a Brilliance Fountain 24/7. I’m not making this stuff up to butter anyone.

Remember, this isn’t people all sitting around figuring out how to monetize farting or who are blowing up paradigms with slide-scale infradoobles using Ruby on Crack combined with Hibbledoo Middleware. This is a non-profit online library providing petabytes (petabytes!) of data to millions in the most efficient way possible. Speaking of which…

One of the job descriptions/goals for me was “bring in data”.

I just checked the internal tracker to see how I’ve been doing on the upload front. Very well, apparently – I have uploaded 120 terabytes of data. That’s into 82,438 individual items, which could be anything from texts or songs up through to .tar files of web captures. When I started, I said my goal was to upload a terabyte of data a month. As I am apparently doing ten times that amount, I’ll consider that goal met.

I’ve brought in so much “stuff”, in fact, that it would nearly impossible for me to tell you all of it. Let’s throw out some highlights.

I was asked to look into bringing in software. So, I started out with CD-ROM shareware discs, not dissimilar to what I have with cd.textfiles.com. Well, that has been a wild success. I am ready to declare The Internet Archive as the largest collection of shareware on the Internet. Seriously. First, there’s over 1,100 CD-ROMs and DVD-ROMs contained in the CD-ROM collection. But oh, it gets better. You see, functionality was added this year to allow you to browse inside the ISO images. Feast your eyes inside this CD-ROM, for example. You just add a slash at the end of the ISO image reference and there you are. But let’s go even further than that: Let’s take a GIF file of Winter from 1991: http://archive.org/download/SoMuchSharewareV1_918/SoMuchSharewareV1_1991.iso/GIFS/WINTER2H.GIF

You see how you can reference a file inside a CD-ROM image in a permanent URL that can be pulled from anywhere? That’s why, as far as I’m concerned, The Internet Archive now has well over four million shareware programs, artworks and documents online. At least. That’s a game changer. And this year? We’re going to double it.

 

Computer magazines. Lots and lots and lots of computer magazines. Out of print, fondly remembered and otherwise obscure magazines on a range of technical subjects, currently the province of attics and basements and long-unopened warehouses and a smattering of living spaces – now up and readable.

This collection of computer magazines as well as a smaller spanish-language set constitute  30 years of technical publication, and well over a thousand individual issues, many of those in the hundreds of pages rage, which means there’s a lot of history squished into all this data. I’ve already been informed of university and high school classes out there using these issues to bring up discussions of history or to point out aspects of computer technology that have shifted or changed. Some of the issues have indexes already (the Compute! Magazine collection is a shining example) and I hope more will get them over time. I’ve got lots more issues to add, too.

Manuals! Damn, do I love getting manuals up where people don’t have to search like crazy to find them. It actually saves the environment to some small amount, since people will happily buy older equipment knowing they can get the manual easily and make the use of the item. So manuals are a big deal:

Arcade manuals. DEC manuals. Synthesizer manuals. Commodore manuals. Whenever I track down a cache of these or get sent them, they go up. I want to be able to have someone grab any piece of equipment new or old and understand what exactly everything does on it, and maybe even the why.

Audio! Video! 59,000 open-licensed albums. 2,100 nights of live and club music. Hours of GET LAMP raw interviews. A complete port of 10 years of Jesse Thorn’s The Sound of Young America. Bit by Bit. There are many other such audio and video projects where I use scripts to get them into the archive as collections – part of my work has been writing stuff to inject massive amounts of data into archive.org’s servers to make it that the uploading is the least of the issues. Which brings us to:

FUCK YEAH, ARCHIVE TEAM. I can’t begin to really describe how much data Archive Team has brought in – so many people working together to take snapshots of important things that are being shut down with poor or no notice, as well as proactive “panic downloads” where we recognize things are on the outs and we grab as best a copy as we can.

Like the Internet Archive itself, Archive Team’s collections are not always meant to be short-term beneficial and in fact are pretty clunky – 50gb .tar files and the like. What they are meant to be is raw material for later efforts and rescue of lost data – the panic downloads are basically someone stepping in at the present time and running the duper just before a whole range of data disappears forever. Some of it will be absorbed into the Wayback machine. Some will be filleted for their GIFs or mp3s or who knows what else. And still others will result in data, meaningful first-generation data about how people used computers or how solutions were found to old problems. Or maybe we’ll just laugh at the hair.

I’d go off more on Archive Team but I’m scheduled for something like a half-dozen speaking engagements around the world this year related to it, so I’ll probably just link to those talks when they come out. Actually, here’s a talk I gave about it a couple months ago, which is hosted at, and took place, at the Internet archive.

As we speak, Archive Team is uploading something like 25 gigabytes an hour into the Internet Archive. Chew on that for a bit. So many good people, so much good work, on both sides of the wire.

This is getting a bit long, and I’ll split more out into entries this year to give context and meaning, but the upshot is that this has been a very successful year, a lot of amazing things are happening and continue to happen, and every single waking moment I spend related to this “job” is what I’ve always wanted to do.

And that’s pretty nice. Thanks for taking the gamble, Brewster!

 


Categorised as: Archive Team | computer history | jason his own self

Comments are disabled on this post


3 Comments

  1. I’d start ranting about how awesome you were for doing all this, but it’d probably end up as an archived comment. 🙂

    Anyway, thanks for your perspective. The world *will* need that someday, if not now.

  2. Randall’s right, archive.org is one of a handful of the most important things on the Internet. Don’t think we’re not watching or don’t notice (we’re just quiet about it).

    Party on Wayne,

  3. Peter says:

    Yes, archive.org is pretty much the best website ever. From The Computer Chronicles and NetCafe to the live music archive and archives of art films and old commercials. I downloaded the PC-SIG shareware CD-ROM iso from archive.org, after learning about it from a Computer Chronicles episode on shareware from archive.org, and thanks to the wayback machine, I was able to download old RealAudio files of a show on Atari I enjoyed back at the turn of the millenium called Back In Time. Thanks, archive.org!