ASCII by Jason Scott

Jason Scott's Weblog

Life Inside Brewster’s Magnificent Contraption —

This essay is going to tell you at the end to subscribe to the Internet Archive. If you want to go ahead and do that now, the link is here. You can also do a one-time donation at that link as well.

I joined the Internet Archive in March of 2011, after a very short meeting that came a few weeks after asking for employment at a conference held at the Internet Archive’s headquarters in San Francisco, CA. 

So it’s been a little over two years since that meeting and my hiring. Let’s review, oh.. let’s review everything.

archive-1200

I work primarily out of my home in New York, but the Internet Archive is located in San Francisco. It’s in one of the most beautiful buildings one can imagine, a renovated Christian Science church (it hasn’t been a church in a long time) that now has servers, offices, and all sorts of amazing one-off architectural individualities.

Some time ago, I took a bunch of photos within the walls of the building, just because I figured they should have them. I’ll put a bunch of those here, so you can see what I’m talking about when I say that it feels like a little bit of heaven every time I’m on site.

8847169172_c0c016220c_b

For what feels like 99.9% of the world, the Internet Archive has precisely one feature and one purpose: the Internet Archive Wayback Machine, an immediate call-back into webpages going back nearly 20 years. It has saved so many necks, proven so many wrongs, and brought back so much history, that I completely understand why it’s considered one of the internet’s crown jewels. There are people working very hard to incorporate endless scraping of data from the web, all the time, and even having occasional injections of older data acquired from other sources. No joke, we’re talking petabytes of web history stored in it. It’s a miracle and you should be giving money to the archive just on the principle of that solitary feature alone.

(Fun fact, while I’m here: the entire web crawl collection of the failed search engine Cuil, over 310 terabytes, was donated to the Internet Archive and are in the Wayback Machine. You can even download the raw data.)

But take it from me: knowing about the Wayback Machine at the Internet Archive and little else is like going to Walt Disney World, riding Space Mountain once, and then going back to your car in the parking lot.

The Internet Archive’s collections are massive, truly massive. Many petabytes. They’re just about to surpass having 2 million books available to read or download. I could spend the rest of the year aiming you at the different collections, like Atari Computer Related Books or Dance Manuals or Cookbooks or whatever else I feel like throwing at you. Millions of books! Online! Right now!

Let’s set aside books for a moment. How about audio? This place is loaded with audio, from podcasts and old time radio programs all the way through to a truly astounding amount of music shows and jazz collections and of course a world-class Grateful Dead live recording collection.

The films! There are tens of thousands of films in this place. There’s television shows about computers. There’s the stunning diversity of the Prelinger Archives. There’s sports videos. There’s tons and tons of feature films, just waiting for you.

But how about miracles? Have you even taken a little time to visit the TV News Archive, one of the more stunning search engine accomplishments of the 21st century? Just go in there, type in a search term, and watch the Internet Archive do something that would normally take you hundreds and hundreds of hours to accomplish, and do it in seconds.

8846562045_6eea912f29_b

When I was hired, it was to improve the software collection, which sat among these other massive collections but hadn’t gotten the attention it deserved. It’s gotten that attention and I wrote an entire entry about it. Summary: Largest collection of vintage software on the internet. Period. I’ve been busy.

No, I’ve been really busy. Checking my upload statistics, here’s what I’ve added to the Internet Archive: Over 169,000 individual objects, totaling 245 terabytes.

Ow.

I have a name for the system that drives the Internet Archive: Brewster’s Magnificent Contraption. The methodology for making sure data isn’t lost, that URLs stay static for decades, and to ensure you can make whole scale improvements to the underlying machinery without disruption… well, it’s a wonder to behold. I behold it frequently.

The Contraption uses the cheapest hardware possible. It strives to ensure you will always, always, have access to the original file uploaded. This alone makes the Internet Archive stand head and shoulders over so many other environments that take your 13 gigabyte MPEG original and shell-game you into a 256mb “pretty ok” version for the web.

You can add things to the Contraption using a browser-based upload, FTP, S3-“like”, and physically mailing material in. It’ll all funnel in quickly, get treated by endless scripts that poke, prod, and derive alternate versions, and then pop up on the web for immediate, complete, free download. It just happens, most of the time. Data is flooding into the place from all directions. I’m doing a lot as an individual, but I’m one of many entities bringing in this material, and the Contraption takes it all.

8847380494_f1dc065c63_b

(Every time a picture of these boxes of external disks comes by, someone writes a missive along the lines of “what are you doing! externals are so much more expensive!”. However, this was during the great disk drive shortage after floods in Thailand knocked out a bunch of factories, and careful investigation revealed that buying the external versions of drives were notably, notably, cheaper than the so-called “bulk” or “internal” drives. This is the kind of attention to detail the Archive is soaked with.)

Does the Contraption have quirks? Hell yes, it does. You learn to deal with them – for example, it passes duties from machine to machine, so you’ll see an item in the collection “populate”, that is, sit there with no real links, and then a preview clip shows up, and a few minutes later a link or two shows up, until maybe an hour or a day later, all the different derivatives have done their work and shown up as well. A little weird, but at the end of the day, you personally did nothing, the machines did the work, and there’s this wonderfully complete collection of file variations to choose from.

Every thing or set of things is an Item. An Item can be grouped with other Items into a Collection. A group of Collections can be grouped under yet another Collection, making them all Sub-Collections. You get used to it. It’s not easy to go through the stacks, but it keeps the stacks sane.

The tradeoff is that the Archive is putting in hundreds of millions of these items, and once they’re up, they’re pretty much up forever at the same URL. You’re not going to go to this collection of Yours Truly, Johnny Dollar radio shows in a month or six months, or feasibly a decade or a century and have it no longer be there. It’ll shift machines and storage media and a whole host of other aspects but it’ll stay there.

8847193192_f56971c32d_b

 

I stopped thinking about storage, and bandwidth, a long time ago. How much storage do I get, I’m often asked. “Enough”, I say. How much bandwidth? “Enough”. It’s just not a factor I bring into my calculations these days. What I’m focused on, entirely, is acquiring the data that’s out there, on the net, shoved into hard drives and out of the way sites and who knows where else. Stuff that people felt didn’t have a home or which they were terrified would get Reddit or Slashdot or Gizmodo attention and then put them in the poorhouse or in a permanent 404. Not with the Archive. I spend a lot of time contacting people, helping them get the data into the site, and then let them hotlink us to hell, forever serving the data for their audience without being worried about what the next ISP bill will bring. It is handled. That’s Magnificent in its own right.

8846563627_ce4793e6af_b

 

As I said, I’ve been inside the Contraption for some time and I’ve totally gone native. I still get the occasional surprise, the weird unexpected output, the hilarious balloon crash of script vs. data. But they’re easily dealt with, and the great people who I work with help it get sorted out.

The people!

It may seem like the most mundane and obvious thing, but when people ask me about the life with a non-profit like the Internet Archive, and want to know what I find most unique and striking, it’s simply this: everybody has the same goal.

Now, let’s not kid ourselves – there’s disagreements and bickering and flameouts and hand-waving, but they’re all doing it because they’re trying to get to the same goal. Different paths, different approaches, same goal: Gather the Largest Free Collection of Information on the Internet to Provide to the World. I spent a lot of years in places where if you asked folks around the office and the divisions what the “goal” was, they’d often have to tell you their personal goals because they’d know balls-all about what the “company” ultimately wanted to “do”. It just wasn’t part of the discussion. Here, people are on the same page. They want to make stuff come in. They want it to be useful, and they want stuff to go out. Everywhere. To everyone. Now. As fast and as efficiently as possible. This is breathtaking if you came from a world where your company would do something insane because of reasons and forces you were not privvy to and not welcome to investigate or try to understand. Your head is down. At this place, your head is up.

8847206818_e5f8091cd8_bIf you work at the Internet Archive for three years, they make a little terra cotta figure of you and it lives at the Archive. There’s a lot of these around. Show me the other places that care about your time like that – there aren’t many anymore. (You have to go pretty far afield, like, say, Blizzard Entertainment.) I still have a year to go!

8847172582_80c346a32b_b

So look, this place is pretty amazing. But it’s also a non-profit that utilizes grants, selling of scanning/archiving services, and donations to stay afloat. When Archive Team and I were starting to blow out their projections for disk space usage for the year, I got right up into the financials of the place. I’m allowed to tell you them.

The Internet Archive, all of it, including the Wayback Machine, the hardware, the bandwidth, and the people, costs $12 million dollars a year.

If you know anything about what stuff costs, this is by far the best bargain out there. $12 million is peanuts for what this place provides to the world. It also barely makes this amount every year since economic times went a little south, and so I resolved I was going to do something about it.

Hence, the aforementioned, the presaged, the I-gave-you-plenty-of-warning request that you consider getting a subscription to the Internet Archive. It’s great to send in a bunch of cash, and believe me it is really important the place get donations of that sort, but subscriptions provide something one-time donations do not: stability. It gives a regular income from people towards the non-profit. It rounds out hard choices having to be made each year when the board and administration looks at the budget. It allows the Archive to grow.

Even though it is technically not in my job description, I have shoved it in there anyway – I want the archive to have a huge number of subscribers, people who benefit from this place constantly and who can throw $5, $25, $100 a month at the place for what they do, to ensure they keep doing it.

If I can help shepherd money into the Archive’s operations, money that gets squeezed for every last bit of value, the resulting benefits are enormous. The drive space increases, the bandwidth increases, the hosted material comes in at a faster clip. The world benefits.

My boss, Brewster Kahle, has a pretty amazing biography. Go check the writeups about it out there. There’s a lot of them. And they all point out that in the 1990s, he ended up, through boom time sales, with a lot of money.

He could have bought a huge-ass boat. He could have bought a sports team. He could have cornered the market on software patents and made massive culinary books. He could have done anything with that money, and just drive with it in a handmade car into the sunset. But he didn’t.

He did this.

He hired some amazing people and knocked on a lot of doors and he never stopped dreaming. This puts him pretty far up there in my book.

And so while he’s not the type to shake trees for cash, I am. Especially when the money does something this utterly wonderful. So I’ll say it again.

Subscribe to the Internet Archive. Get people you know to subscribe. Talk to me or to the Archive itself about ways you might want to donate funds or resources to it. (They’re tax-deductible, after all.) This is some of the best money you could possibly spend.

All Hail The Contraption!

 


Categorised as: computer history | jason his own self

Comments are disabled on this post


4 Comments

  1. Fippy Darkpaw says:

    I’m in for $5/mo. And I hope we get to see your terra cotta figurine in a year. 🙂

  2. I don’t know where I’ll find myself in the future but the years I spent working with the Archive will always be among the best of my career.

  3. ersi says:

    Unfortunately, this isn’t tax deductible to us that aren’t in the US.

    Fortunately, I don’t care. I’m a bit hesitant on how much I want to subscribe to – but $10/monthly is just $120/year. That seems like a given!

  4. iPadCary says:

    If Jason doesn’t mind, I’d like to cite an absolute realworld instance of the
    very important work the Internet Archive [IA] does.

    There’s a listener-supported FM radiostation here in NYC called WBAI.
    Jason’s appeared numerous times as a guest on various shows.
    Right now, WBAI is in the midst of a pledge drive for emergency funding.

    On August 7th, one particular show on WBAI, the 30 — yes: THIRTY — years & counting
    “The Personal Computer Show”, wanted to use as an incentive to get people to donate money
    during this emergency funding drive something they called “The Lost Interview”.
    One of the show’s hosts had done a half-hour QA at the 1986 COMDEX with a then relatively unknown Bill Gates on his thoughts for what he thinks the future of personal computing will be.
    They called it “The Lost Interview” because when it was time to play it
    on the air that year, the casette tape with the interview on it had been misplaced.
    The hosts tore the studio apart looking for it, checked thier personal stashes of
    cassette tapes at home, etc., etc.
    Nothing.
    To reiterate: this is 1986 when all this happens.

    Fast forward to 2013.
    The host who conducted the interview gets an email from somebody at The Washington Post
    saying they found the interview.
    The host, understandably elated, asks: “Where, in God’s name?”
    Do I have to tell you the answer? lol
    Yes, of course: he found it in an IA datafarm archive.
    Don’t believe me?
    Well, listen to the host tell the story himself > http://www.pcradioshow2.org/2013/pcrs_2013_aug7_32k.mp3 > 32:00
    And “The Lost Interview” itself is here > http://archive.org/details/pra-IZ1146

    So I just wanted to get that out there as, again, a very realworld example
    of the importance, the utility & the necessity of the work of the good people at IA.
    Thank you.