ASCII by Jason Scott

Jason Scott's Weblog

The Frightening Cornucopia —

I am an extremely lucky person.

I’m lucky for a host of reasons, but in this particular case, I’ve been matched up with the Perfect Job very early in my life – my 40s. Some people get earlier, of course, but many more get it later, if at all. Life at the Internet Archive is just what I wanted it to be. Conflicts are barely anthills. Achieved dreams loom in every direction. Triumphs have been many, failures often more hilarious than troublesome.

When I joined in 2011, I was given several overarching aspects to think about, and I added a few of my own. One of them was software and another was the emulation in a browser thing, both of them going quite swimmingly. Another was to spiff up the donation page, and at this exact second the design’s a little cramped for the holiday matching fund drive, the flexibility of the new design and the addition of subscriptions turned out to be well worth my attention.

So, 2014 looms. What’s got my attention and why did I use a word like frightening in the title of this entry?

First of all, I’m not “done” with the JSMESS project and I’m certainly not done adding software items to the archive – those will continue and may even dwarf the rest of what I’m doing for some time to come. They’re both big, important things and I’m working on them nearly daily, as are many others.

We needed easier money donation, and we needed software emulation in the browser, and now we have that, and it will get better as time goes on.

In 2014, I want to go after two other weaknesses in the Internet Archive arsenal: Metadata and Discovery. (And maybe Accessibility if we can swing it).

When I interact with professional librarians and archivists, or even folks who are really, really into the subjects that I’m focusing on (vintage software, crazy old crap), the conversation quickly turns to how in fact these items are being described and given metadata. And then the question of how it can possibly be found at all.

So, in the very specific realm of software, bear in mind we’re making up for decades of institutional neglect. Oh, hobbyists and intense amateurs were getting shit done, let’s not diminish that work at all. But it was all being done under this cloud of “are we in trouble” that meant that the hosting and interaction of the materials meant that a few random brave souls would make good collections (Home of the Underdogs for binaries, MobyGames for metadata, for ROMs) and then things would go south for a variety of reasons and the information and data would disappear again, sometimes for good. No institution stepped in. Not really. And so here we are, with the Internet Archive now stepping in. Become the largest historical collection in the world? Check.

To do this, we absorbed many terabytes of data, from a wide range of software. Some people were very specific about high-quality descriptions and naming. Others…. were not. But again, to make up for lost time, in it went.

Same with old documents related to computers, old videos, old audio. My philosophy has been, and continues to be, get it online first. GET IT ONLINE FIRST. Deal with EVERYTHING ELSE LATER.

If it’s online, it’s not in a box in a basement or attic. If it’s online, it can be commented on. If it’s online, it can be shifted around effortlessly and included in greater and greater things. And if it’s online, it isn’t rotting on some piece of magnetic plastic or dimpled plastic or broken plastic. Granted, we’re buying a whole other range of long-term problems putting it on spinning disks and what have you, but the long-term preservation of the item is now a whole lot easier, should we be responsible. Being online is a great thing.

Once stuff is online, and as I just implied, an awful lot of stuff is now online, then we can talk about metadata, organization, discoverability.

And that time is now.

I unintentionally got quoted all over the archiving and library scenes when, in a talk I was giving at the New York Public Library, I said “Metadata is a Love Note to the Future“. This rang true with a lot of people, and it speaks to the oddness of what metadata is and who and how it serves.

Intense, machine-searchable information about artifacts and collections, be they digital or physical or whatever, has a value that is primarily based on faith. You can enjoy the object right now in your hands, but turning it into a photograph or a .wav file and then tacking on a whole range of information you might not have even had at the moment, is preparing for a future that you have no idea about.

I assure you, there are hundreds of books contemplating the nature of objects in past, present and future, and how we as human beings interact and interface with these objects. I’m not going there. But I’m going to say that the effort put into generating contextual data about an item provides all sorts of benefits, but almost completely in theory unless you know you have an audience waiting for it. That makes it a very tough sell for people to ‘just do’, like they might bookmark or do a retweet or notation in a weblog. It’s involved. you usually have to pay people. And if you pay people, it gets expensive quickly.

So my efforts will be to make metadata generation for items on the Internet Archive as painless, as collaborative, as rewarding as possible. I’ll likely utilize custom scripts, wikis, let’s-raise-the-barn events and shout-outs for folks to get involved however they want to. I also will work on automation of same, where a person is signing off on the efforts of machines, instead of typing in the year when the stupid thing is telling you the year right there and in a billion obvious locations.

It’s a tough problem with a lot of moving parts! Hence it’s a goal, to be implemented over time and with endless refinements as I progress. I’ll let you know how that goes.

Even more fundamental is the issue of Discovery and Exploration.

There are people who have no idea the Internet Archive exists, Wayback or digital media or anything. There are people who only know it for Wayback. And then there’s people who know it “pretty well”, knowing we have a whole bunch of audio and video and books and software. You are likely among this last group.

And you still have no idea, no idea, how much stuff is at the Internet Archive and its collections.

I just checked The Thing That Tells Me Stuff and it tells me that in my time at the Archive I have personally uploaded 229,000 individual “items” (some of which are grouped files) for a total of 262 terabytes of data.

I’m throwing a lot in, but I’m hardly the only one throwing a lot in. Some of my co-workers in the “collections” group I work at have shoved in millions of individual items, ranging from documents and journals through to the video, audio, and so on. Let’s not even touch the wayback, which has over 368 billion (with a b) URL captures.

When I send you somewhere, say, deep into a collection of magazines or over to some Apple II documentation or up into a massive audio record… well, forget the surface, we’re not even scratching the surface of the surface.

It is a terrifying, frightening cornucopia. It is a horn of plenty so pitch-dark with content that I am not 100% convinced the problem is solvable, unless the nature of humanity changes overnight and even then we’re talking a couple years of hard work.

But there you go. In conjunction with other efforts by other folks at the Archive, the plan is to make strides in discoverability, usefulness and access to the vast and ever-growing stacks of the Internet Archive, which, again, I promise you, are massive.

Every site that has a forward-facing website and then terabytes of goodness down the line has this exact problem, by the way. Every museum and archive with warehouses and storage units extending into the darkness has the problem as well. It’s not a new problem, but it’s one I’m willing to tackle.

Hey, if they weren’t called ratholes, everyone would want to go down them.


Categorised as: computer history | Internet Archive

Comments are disabled on this post


  1. Josh Renaud says:

    Glad to hear about your efforts to improve metadata and discovery. One thing I have often wished for is a way to search the text of magazines and books that you’ve added in the IA computer collections. The text is all there… the magazines have been OCR’ed and you can search _within_ a magazine. But there’s no way to search _all_ of them to find the particular issue that might discusses what you’re researching.

  2. Steve says:

    Not the best idea, but have you considered checking every item with an image, and then asking people what the image says (possibly showing them what the OCR software sees?)?

  3. Chris Orcutt says:


    This is a marvelous post. Clear, informative, and inspiring.

    Two thoughts:

    1. Perhaps such an application already exists, but if not, you and your Archive Team guys might consider developing an application that enables relative laypeople to do some of the metadata tagging for you. For each category of material in the Internet Archive, there are, say, 100 possible metadata checkboxes. As the title and description of each item (if a description now exists) comes up, the person doing the tagging clicks all of the appropriate checkboxes for that item, and then you could have fields that contain other descriptors not included among the checkboxes. The bottom line here is, it would be good if this metadata project were able to be done by technology laypeople. I’m sure you could get a lot of librarians and archivists to participate as well.

    2. In American English, commas and periods *always* go inside quotation marks. For example, my favorite short stories are “The Lady and the Dog,” “The Chaste Clarissa,” and “The Five Forty-Eight.” Another example: “JSMESS is really humming along right now,” Jason said. “But there’s a lot of work to be done at the Internet Archive in metadata, discovery and accessibility.”

    A great post, Jay. Really.

    Keep up the great work, and I’ll see you on the 17th.


  4. Jason Scott says:

    I’ll look more deeper into this problem of “comma placement”, “style rules”, and “grammar”, at another time.

  5. Jason Scott says:

    On a more serious note, yes, one of the major fronts of attack to this problem will be utilizing already-extant collaboration software and tagging/classification tools (there are actually many) into a way that works with the unique aspects of the Internet Archive. This is impossible without the audience participating.

    • Chris Orcutt says:

      I think it will be really exciting if you guys can open it up to librarians and archivists at large. You could turn this into an international campaign to preserve data and to draw more attention to the work the Internet Archive does.

  6. TPRJones says:

    If only you could turn metadata tagging on the internet archive into a facebook game, you could have it done in no time flat. As well as making a ton of micro-donations to the Archive if you monetize your FTP metadata game by selling in-game items.

    But more seriously, are you considering some smartphone apps to allow easy and quick access to reviewing and tagging? Rather than trying to get people to sit down for an evening of metadata production, I bet selling it as something anyone can spend their ten minutes on the bus doing to help would work well, if you can make it a smooth experience to get into and out of quickly.

  7. ted says:

    as far as ease of navigation and design goes, i’d use as a template. an overwhelming resource of text/audio/video, much like internet archive, but extremely well laid in terms of visual presentation. plus, the guest curator feature on the right-hand side is an interesting touch that might work for internet archive.