The Robot Army of Good Enough — May 13, 2014

(I do not speak for my employer. I am just very loud.)

Pretty much any organization of any size has certain themes, beliefs and outlooks baked into them. Some of them might be obvious from the outside. Others are so inherent that the members might not even notice they’re completely steeped in it.

At the Internet Archive, there’s a philosophy set about access and acceptance of materials and presentation of said materials that’s pretty inherent throughout the engineering and the website. Paraphrased, in my own words, it’s this:

Always provide the original.
Never ask why a user wants something.
Now is better than tomorrow.
We can hold it.
Many inexpensively is better than one or none luxuriously.
Never send a person where a machine can go.
Enjoy yourself.

Some of this exhibits itself in how people use the site – they can grab anything, they can get a “library card” account but they don’t have to, and they can embed or direct-download anything they want. While the machines will derive out versions of the content, you can always find that massive .AVI, .PDF or .WAV that the content came from. They don’t keep user logs to any real degree. They don’t get in the way.

Internally, the rest shows itself in engineering and code – use commodity hardware which will break more often but which can be bought in much greater amounts instead of “Ol’ Trusty” that’s intended to work for five years without fail and “Ol’ Trusty” is all we have because we can’t afford more. The code will put an item up before it’s fully “baked”, that is, you’ll see the original .AVI file for a video item and maybe 20-60 minutes later, another derivation will show up, and then maybe another one after that. This sliding window of material population really confuses the end users in some cases, I’m sure. But it means you get it now, now, now, instead of when it’s all wrapped in a bow.

As things currently stand, and based on my now three years (!) of working for the organization and going out into the world to speak about the place and get feedback, the resulting good and bad of this approach is this:

Good: Nobody is doing what we’re doing in many cases, we have so much stuff, every time I wander there I lose an evening walking the stacks.
Bad: The site looks like poop, and it’s pretty hard to find the stuff.

So, to get out ahead of “poop look”, efforts are underway to redesign the site, and what I’ve seen, I really like. That’s all I’ll say because it’s not my project.

Regarding finding material and there being stuff, I think the priorities of the Archive have been really firm and right-minded: get the stuff first, quibble on accessibility or presentation later. Turning things away is how tragedy happens. What’s worse – something was taken in and put into a big storehouse? Or something was offered, and because it failed to have a MARC record or a metadata post-it-note on the outside of the archival-quality container file, it was sent back out into the night?

But the real miracle, the one that is perhaps really not obvious from the outside, is how much of the Internet Archive’s work is done by machinery and code.

When an item is uploaded, the user can designate and mention all sorts of aspects of what was sent in – the title, the description, when it was made, who made it, and a bunch of other interesting data attributes. The format allows a lot of extension, so if you want to indicate which of the 300 audio files you uploaded have dog sounds and which ones are recorded using a specific type of microphone, you can do that. It might not mesh with other items all that well, but that’s not your problem – you’re adding things that a machine might not ever know.

But a set of machines at the Archive do know a lot about your item, and will do work to add it all and create other versions of your item. For example, you can upload a .zip file of .jpeg images, and if you happen to name it *_images.zip, it will create a .PDF file of it, an OCR’d version of any text in it, and an animated GIF file of the pages. With movies, it will take a massive .AVI and it will create a thumbnail set, a web-ready version (if it can), and so on. And bear in mind, this collection of tests is massive – it tries to determine the average pixels per inch, the orientation of texts, the framerate of the video, the number of tracks in a collection of MP3s and if there’s any tagging built in. It does a lot. And most importantly, with zero human intervention.

And here’s where the “controversy” happens.

By “controversy”, of course, I mean “people murmuring under their breath in the area of disciplines the Archive overlaps with”. Other organizations and practicioners of the arts of archiving, you see, have their own baked-in philosophies and credos, spoken and unspoken. And they don’t exactly see eye to eye.

Some I’ve encountered and observed:

Machines can’t beat people.
Zero metadata beats inaccurate metadata.
Digital is a Cult. Physical is a Truth.
Another six months won’t hurt.
Who are you and why do you want this?
Pick a format, document it utterly, and use it forever.
Justify, Justify, Justify

There’s many more. Some come from policy, some from personality, and some from how people are brought up into the discipline. We’ve destroyed the term “disruptive” as being meaningful in discussions, but the concept that a new outlook or idea could fundamentally change the nature of the realm it is part of is still quite valid. To some extent, the Internet Archive is an upending of century-old approaches, while still loving and promoting the shared beliefs:

Our history depends on our artifacts and writings.
Education without context is flimsy and transient.
Reading is fundamental.
What happened before is important tomorrow.
Humanity is worth the trouble.

In my capacity in outreach, I find myself in a lot of conferences, restaurant tables, hallways and sidewalks talking to people who believe in these shared beliefs but don’t buy 100% into what the Archive is up to. They question whether a Robot Army is the way to do this inherently human activity, that of cataloging and classifying, of summarizing and representing.

The problem, ironically, is that people think of it as binary: all machine, all people.

Where we are now, the machine takes a rough stab and occasionally a refined stab at what comes in the front door. It will try to OCR the text, it’ll figure out the orientation or how many pages or what baked in records exist in the digital object, and it will report those. It would also appreciate your input as uploader, thank you very much, but it doesn’t stop dead waiting for you, either.

To this end, the resulting output, especially the machine-generated side, is not perfect. But most importantly, it can be overruled. Always. It can always be shoved aside as “that’s not perfect, this is perfect”, but the amount of items getting that “perfect” treatment are going to always be a small percentage of total. They just are.

So, this week, I was working on a way to make the endless piles of texts on the Archive more accessible. The solution I cooked up was to take the OCR’d text generated for all “texts” classified objects, throw them into a word frequency generator, remove the obvious stupid ones, and put that up into the Archive. That actually has worked out pretty well.

It’s not, perfect, of course. Never perfect. But here’s what it returned (and put up) in 10 seconds of analysis on a 945 (!) page book on Architecture:

figure; landscape; design; standards; soil; concrete; architecture; water; surface; aggregate; landscape architecture; saver standards; asphalt concrete; tor landscape; standards tor; water table; water level; standards lor; lor landscape

The “standards lor” stuff doesn’t fly – it’s an error. But the vast, vast majority of it is what a person might reasonably need to know “what the hell is this book about”. You can make decisions in a very short time if this is the book you want to browse through. You have more information than you had before.

Similarly, you can probably guess what these books are about from the keywords:

software; ibm; computer; graphics; apple; color; disk; program; commodore; game; hard drive; hard disk; word processor; disk drive; megabyte hard; deluxe paint; sale price; retail price; public domain

moog; modular; output; arturia; modulation; input; filter; frequency; manual; sequencer; moog modular; modulation input; connection jack; key follow; low pass; keyboard follow; audio output; audio input; input connection; trigger input

iso; wedding; lovegrove; julie; bride; pictures; chapter; shoot; shot; picture; wedding day; wedding photography; light matters; healthy profits; business strategies; wedding photographers; opposite figs; finoncial mastery; exposure compensation

Again, perfect? No. But each of these was generated, automatically, and without a miserable intern or low-paid person doing a job that would probably never be funded in the first place. But those keywords tell you a lot, and they’re getting the job done, even if you have to keep an eye out for what exactly “finoncial mastery” is.

And frankly, nothing stops the addition of a second set of scripts for quality control, that provide lists of all the generated tags and allowing a person to go “that one doesn’t look quite right” and to have it taken away. The difference is, now it’ll be one person overseeing hundreds or thousands of items at once, using the brainpower so that in one weekday they will do more resulting work than a year of the most highly-trained, perfect and precious professional dedicated to metadata entry. And in the case of the Issue of “Compute!” Magazine, the Moog Synthesizer Manual, and the Professional Wedding Photographer book above, you’ll get what you need, now now now.

And as a side note: I love this is what my mind is being used for. I love that I work for a place where this sort of thinking is what is needed. And I love what the result of this effort is – a place where millions of items are flying out the front door every single day, spirited away for a thousand reasons, and making the world a better place.

I can’t imagine doing anything else. Keyword: “happiness”.

Categorised as: Internet Archive

Comments are disabled on this post

8 Comments

Jason Scott says:

May 14, 2014 at 12:48 am

Sidebar:

This is the same philosophy that inhabits the book scanning operation of the Internet Archive. Using Canon cameras to take photos of books under glass from a distance will never be the same as splitting apart the book, bringing it in under a $50,000 scanner, and using a team of forensic professionals to get the be-all end-all 5000dpi scan of the material. Then again, for 99% of what is being scanned, you don’t need that all, either.

Instead, the Archive does a non-destructive scan of books, and stores the books. For most purposes (like information transfer and verification), this 300-500dpi scan is quite enough and does the job well. And they can take in two pages every 3-5 seconds using the current method. More books, more materials, faster, and less expensive. It works well.

When someone, say a film company or a researcher, suddenly discovers that a specific book or image from some page is in high demand or has particularly deep value, then the scanned online book works as a preview, and a guide of what to ask for for the super-quality scan. Then the super-quality scan can be done at that time. If every page was treated this way, it’d take a couple days to do a book, instead of the dozens or even 100 books that can be done in a week.

It took me a while to buy in, but I’ve bought in. It’s the way to go.
E[X] says:

May 14, 2014 at 7:21 am

I wish the same philosophy was followed at the Computer History Museum, they acquired a lot of materials on Engelbart’s NLS 7 years ago but none of it is available through the internet (AFAIK).
It’s a shame that such an important part of history of computers is not on the internet.
Michael. says:

May 14, 2014 at 1:49 pm

You’ve convinced me. 1. Something is better than nothing. As robots can provide something far quicker and easier than people can, then: 2. Robots are perfectly fine for providing that something. 3. Where something is not sufficient, the original still exists, and people come along and make that something better.

And, if you do the software on the web-side right, you get the added: 4. Put it on the web, and people will fix it for you. See also the NLA’s Trove digitised newspaper, where anyone can come along and suggest fixes for the OCRed text. You don’t even have to trust people. You can leave the bad OCRed text there, and wait until you get enough confirmations before you actually fix it. It’s amazing.

My only concern is that you potentially end up with all this material at one location, with all the potential problems that centralisation can mean. And because it’s so many terabytes and petabytes, it’s almost impossible to make a good copy that can be stored on the other side of the continent (or even on another continent). But I’d be happy to be wrong on this issue.
Jason Scott says:

May 14, 2014 at 2:28 pm

The point of the way things are set up on Archive.org is that if people see data they are really nervous about the longevity of, they can easily grab a copy to store elsewhere. The process for mirroring collections or items at the site is actually rather simple.

I do worry we don’t have enough people doing this.
Simon Crowley says:

May 14, 2014 at 4:20 pm

I’m reminded of an Edward R. Murrow quote: “Difficulty is the excuse history never accepts.” You’re doing good work, and I’m glad you’re doing it.
daggar says:

May 16, 2014 at 2:24 pm

As a symbol of this dedication, Mr. Scott heads the post with original source data– a seven megabyte photo unencumbered by reduction or optimization algorithms.
ericnystrom says:

May 20, 2014 at 6:50 pm

Great idea with generating keywords from the Internet Archive’s OCR’d text. I’ve been taking that approach with the 9CHRIS project (http://9chris.org) which tries to help users find relevant documents in a set of historic records, briefs, and transcripts bound together in volumes digitized by the Internet Archive. It’s certainly not a perfect approach, but it’s far more accessible with the automated keywords than without, and “pretty good now” beats “best, later” almost every time.
infinitelyremote says:

June 16, 2014 at 1:13 pm

I love that YOU are there too, Jason and that you have taken the time to lift the veil a bit on one of the most important resources on the Internet. I am a firm believer in “if you can get a computer to do the job then do so.” The whole “not ready for prime time” is for the birds. So what if an archiving system is less than perfect?… Most folks are content knowing that Billy’s 4th birthday party pictures are on drive D. If they need to retrieve them bad enough they will.

To put it another way on moving day write the name of the room on boxes when packing them with items – no need to waste precious time listing contents and colors. At the new location if you need a colander look for the box marked kitchen. When you open the box and find a pair of socks along with the colander throw the socks into your new bedroom dresser – you don’t have to put it in the box marked bedroom, because you’re home now.

anyway… all the best!

(I do not speak for my employer. I am just very loud.)

Share this: