(I do not speak for my employer. I am just very loud.)
Pretty much any organization of any size has certain themes, beliefs and outlooks baked into them. Some of them might be obvious from the outside. Others are so inherent that the members might not even notice they’re completely steeped in it.
At the Internet Archive, there’s a philosophy set about access and acceptance of materials and presentation of said materials that’s pretty inherent throughout the engineering and the website. Paraphrased, in my own words, it’s this:
- Always provide the original.
- Never ask why a user wants something.
- Now is better than tomorrow.
- We can hold it.
- Many inexpensively is better than one or none luxuriously.
- Never send a person where a machine can go.
- Enjoy yourself.
Some of this exhibits itself in how people use the site – they can grab anything, they can get a “library card” account but they don’t have to, and they can embed or direct-download anything they want. While the machines will derive out versions of the content, you can always find that massive .AVI, .PDF or .WAV that the content came from. They don’t keep user logs to any real degree. They don’t get in the way.
Internally, the rest shows itself in engineering and code – use commodity hardware which will break more often but which can be bought in much greater amounts instead of “Ol’ Trusty” that’s intended to work for five years without fail and “Ol’ Trusty” is all we have because we can’t afford more. The code will put an item up before it’s fully “baked”, that is, you’ll see the original .AVI file for a video item and maybe 20-60 minutes later, another derivation will show up, and then maybe another one after that. This sliding window of material population really confuses the end users in some cases, I’m sure. But it means you get it now, now, now, instead of when it’s all wrapped in a bow.
As things currently stand, and based on my now three years (!) of working for the organization and going out into the world to speak about the place and get feedback, the resulting good and bad of this approach is this:
- Good: Nobody is doing what we’re doing in many cases, we have so much stuff, every time I wander there I lose an evening walking the stacks.
- Bad: The site looks like poop, and it’s pretty hard to find the stuff.
So, to get out ahead of “poop look”, efforts are underway to redesign the site, and what I’ve seen, I really like. That’s all I’ll say because it’s not my project.
Regarding finding material and there being stuff, I think the priorities of the Archive have been really firm and right-minded: get the stuff first, quibble on accessibility or presentation later. Turning things away is how tragedy happens. What’s worse – something was taken in and put into a big storehouse? Or something was offered, and because it failed to have a MARC record or a metadata post-it-note on the outside of the archival-quality container file, it was sent back out into the night?
But the real miracle, the one that is perhaps really not obvious from the outside, is how much of the Internet Archive’s work is done by machinery and code.
When an item is uploaded, the user can designate and mention all sorts of aspects of what was sent in – the title, the description, when it was made, who made it, and a bunch of other interesting data attributes. The format allows a lot of extension, so if you want to indicate which of the 300 audio files you uploaded have dog sounds and which ones are recorded using a specific type of microphone, you can do that. It might not mesh with other items all that well, but that’s not your problem – you’re adding things that a machine might not ever know.
But a set of machines at the Archive do know a lot about your item, and will do work to add it all and create other versions of your item. For example, you can upload a .zip file of .jpeg images, and if you happen to name it *_images.zip, it will create a .PDF file of it, an OCR’d version of any text in it, and an animated GIF file of the pages. With movies, it will take a massive .AVI and it will create a thumbnail set, a web-ready version (if it can), and so on. And bear in mind, this collection of tests is massive – it tries to determine the average pixels per inch, the orientation of texts, the framerate of the video, the number of tracks in a collection of MP3s and if there’s any tagging built in. It does a lot. And most importantly, with zero human intervention.
By “controversy”, of course, I mean “people murmuring under their breath in the area of disciplines the Archive overlaps with”. Other organizations and practicioners of the arts of archiving, you see, have their own baked-in philosophies and credos, spoken and unspoken. And they don’t exactly see eye to eye.
Some I’ve encountered and observed:
- Machines can’t beat people.
- Zero metadata beats inaccurate metadata.
- Digital is a Cult. Physical is a Truth.
- Another six months won’t hurt.
- Who are you and why do you want this?
- Pick a format, document it utterly, and use it forever.
- Justify, Justify, Justify
There’s many more. Some come from policy, some from personality, and some from how people are brought up into the discipline. We’ve destroyed the term “disruptive” as being meaningful in discussions, but the concept that a new outlook or idea could fundamentally change the nature of the realm it is part of is still quite valid. To some extent, the Internet Archive is an upending of century-old approaches, while still loving and promoting the shared beliefs:
- Our history depends on our artifacts and writings.
- Education without context is flimsy and transient.
- Reading is fundamental.
- What happened before is important tomorrow.
- Humanity is worth the trouble.
In my capacity in outreach, I find myself in a lot of conferences, restaurant tables, hallways and sidewalks talking to people who believe in these shared beliefs but don’t buy 100% into what the Archive is up to. They question whether a Robot Army is the way to do this inherently human activity, that of cataloging and classifying, of summarizing and representing.
The problem, ironically, is that people think of it as binary: all machine, all people.
Where we are now, the machine takes a rough stab and occasionally a refined stab at what comes in the front door. It will try to OCR the text, it’ll figure out the orientation or how many pages or what baked in records exist in the digital object, and it will report those. It would also appreciate your input as uploader, thank you very much, but it doesn’t stop dead waiting for you, either.
To this end, the resulting output, especially the machine-generated side, is not perfect. But most importantly, it can be overruled. Always. It can always be shoved aside as “that’s not perfect, this is perfect”, but the amount of items getting that “perfect” treatment are going to always be a small percentage of total. They just are.
So, this week, I was working on a way to make the endless piles of texts on the Archive more accessible. The solution I cooked up was to take the OCR’d text generated for all “texts” classified objects, throw them into a word frequency generator, remove the obvious stupid ones, and put that up into the Archive. That actually has worked out pretty well.
It’s not, perfect, of course. Never perfect. But here’s what it returned (and put up) in 10 seconds of analysis on a 945 (!) page book on Architecture:figure; landscape; design; standards; soil; concrete; architecture; water; surface; aggregate; landscape architecture; saver standards; asphalt concrete; tor landscape; standards tor; water table; water level; standards lor; lor landscape
The “standards lor” stuff doesn’t fly – it’s an error. But the vast, vast majority of it is what a person might reasonably need to know “what the hell is this book about”. You can make decisions in a very short time if this is the book you want to browse through. You have more information than you had before.
Similarly, you can probably guess what these books are about from the keywords:software; ibm; computer; graphics; apple; color; disk; program; commodore; game; hard drive; hard disk; word processor; disk drive; megabyte hard; deluxe paint; sale price; retail price; public domain moog; modular; output; arturia; modulation; input; filter; frequency; manual; sequencer; moog modular; modulation input; connection jack; key follow; low pass; keyboard follow; audio output; audio input; input connection; trigger input iso; wedding; lovegrove; julie; bride; pictures; chapter; shoot; shot; picture; wedding day; wedding photography; light matters; healthy profits; business strategies; wedding photographers; opposite figs; finoncial mastery; exposure compensation
Again, perfect? No. But each of these was generated, automatically, and without a miserable intern or low-paid person doing a job that would probably never be funded in the first place. But those keywords tell you a lot, and they’re getting the job done, even if you have to keep an eye out for what exactly “finoncial mastery” is.
And frankly, nothing stops the addition of a second set of scripts for quality control, that provide lists of all the generated tags and allowing a person to go “that one doesn’t look quite right” and to have it taken away. The difference is, now it’ll be one person overseeing hundreds or thousands of items at once, using the brainpower so that in one weekday they will do more resulting work than a year of the most highly-trained, perfect and precious professional dedicated to metadata entry. And in the case of the Issue of “Compute!” Magazine, the Moog Synthesizer Manual, and the Professional Wedding Photographer book above, you’ll get what you need, now now now.
And as a side note: I love this is what my mind is being used for. I love that I work for a place where this sort of thinking is what is needed. And I love what the result of this effort is – a place where millions of items are flying out the front door every single day, spirited away for a thousand reasons, and making the world a better place.
I can’t imagine doing anything else. Keyword: “happiness”.