Scanning: Some Thoughts —
If perfect is the enemy of done, scanning has whipped out a switchblade and standing next to perfect, ready to kick done’s ass.
Scanning is probably the number one thing I deal with, in terms of a process that folks can do themselves and which everyone has completely wild ideas about. How to scan, what to scan, and why scanning should be done. It’s a morass, a mess, a jungle of weird social mores and beliefs and anger and delight.
Before I go super-depth into this subject, I’ll take the approach of a quick pop-science list so you can grab that and run into the night, if that’s all you were looking for.
Here’s a quick scanning primer:
- If you’ve scanned something nobody has before, you win.
- If you scan something somebody has before, but better, you win.
- If you scan something as well as somebody before but add metadata, you win.
- If your scanned, metadata-laden collection of something is browsable/downloadable, you win.
- If you scan something worse than what’s out there, you lose.
- If you scan something and destroy it in scanning it, you lose.
- If you refuse to scan something because you’re scared, you lose.
Or, as Michael Pollan would say, “Scan things. As much as possible. Mostly ephemera.”
Let me spend the rest of this going into what I mean about all this.
Life found me in 2008-9 in the basement of Steve Meretzky, who at the time still lived in Massachusetts in an absolutely lovely home that his kids had moved out of, and which, ultimately, he and his wife Betty Rock sold to move out west. (Sadly, the family that bought it tore it to the ground.) In the basement, Steve had collected years of material connected to his game-making history, and the history of several companies he’d worked at, including Infocom and Boffo. And I mean, he’d collected EVERYTHING – memos, scraps of paper, sketches, maps, even technical printouts, ads for materials tangentially related to game making, and invitations to parties and events he was running. He’d put them in binders and he kept the binders in a shelf. As I was making a documentary about subjects related to Steve, I asked if I could scan them to some amount.
Initially, I was in the basement. I sat down there for hours and scanned and scanned. Eventually, Steve trusted me enough to lend me the binders for an extended period of time, and I drove them home, these precious one of a kind materials, and I set up a scanning station in my office. All in all, I scanned a pile of these binders, with something like 9,000 pages in total.
I did NOT scan every single scrap of paper. I definitely did not scan all his binders.
The binders are now in the care of Stanford University, who have them available but charge for various types of access. I understand why they do this. I also know they have no immediate plans to scan them.
To scan, I used a $350 Epson scanner. The scanner was hooked up to a Windows laptop running Vuescan software. I scanned these documents at 600 dots per inch (dpi), which is relatively high. I scanned them into .TIFF files, which is a lossless format. I also made .jpeg versions on the fly as I scanned, so that there were easier to browse (but lossy) versions of the pages. It worked out to many gigabytes of material.
Now, the reason these aren’t public yet is purely a case of decency – I need to go through the pages, find every time a person is mentioned, and assemble a version of the pages with that person in them to get signoff from that person. So I’d assemble everything from or mentioning Dave Lebling, and make a multi-dozens or multi-hundreds set of Dave Lebling pages for him to consider OK. When this project happens, all the material I scanned will go on archive.org.
600 DPI is pretty intense. Here’s a scan from a page:
That’s graph paper, which Steve used to do maps of his work. From this dpi, it’s possible to get a very nice closeup shot for, say, a documentary, which is why I did this work in the first place. GET LAMP has a bunch of intensely close shots of design documents, and you go in close enough that you can see indentation and pen strokes, not to mention tiny flaws in printer output and individual dots in color magazines.
Let’s step back and study this non-hypothetical hypothetical.
I’ve definitely taken a bunch of this material from the realm of “will never be seen” to “may likely be seen by fans and interested parties” by scanning it – someone could go to Stanford (or Steve’s house, way back when) and get to look at it as well, but that was a very small number of people compared to who might see it now. So, that’s good.
I scanned it at 600dpi. This means that it’s possible to get pretty close to the material in question and zoom in if text is small, or if you want to do a high-definition photo or screenshot of the material.
I put it in TIFF, meaning that the scan I did was not translated down into a “good enough” but low disk-space version of the material – you will not get lossy artifacts that JPEG provides when you zoom in. Image zoom routines might even give a pretty good illusion of higher resolution. The tradeoff in terms of space consumption is minimal in the era of the sub-$50 terabyte hard drive.
And, of course, by scanning these I made sure that information-wise, these pieces of paper and information about sales figures, design documentation, interoffice memos and whatever else are not left on a single piece of paper in a dude’s house – Stanford has a copy of the scans I did, and I have them in a number of other locations as well. So they’re “saved” or “preserved” by some definitions of the word, and definitely the ones that people use when they find out something is just lying around somewhere, like, say, a dude’s basement.
(As an aside, these binders were already getting moldy – I swapped out the binders and Steve inquired about why I did that, until I showed him how the old ones looked compared to the new ones, and he was quite glad to see the old ones thrown out.)
Let’s step to the side and talk about everything that’s terrible.
First of all, I’m using a flatbed scanner, and flatbed scanners definitely can have dust, hair, imperfections in the glass, all of which might lead to some of the images being a little bit weird looking. I might have pressed a thumb against something, and if I didn’t notice it for a while, that print might be discerned here and there if you know to look for it. In other words, a person touched this all in a non-sterile environment. You do your best – you wipe the glass, make sure to have alcohol nearby, but it’s just not perfect.
Next, 600 dpi is awesome for some things, but to throw out the originals would be a crime. I didn’t get all the items, I possibly missed the backside of a two-sided document, I didn’t mark down the metadata about how the notebooks were arranged… a whole set of information would be lost by going about it that way. And you never know when you need to zoom in a little farther…
Finally, a TIFF is not paper. A scan is not the object. The photo of something leaves you a couple steps away from the original item, no matter how good the lighting, how true the color, how right the process.
The fact is, almost any scan be ripped apart depending on your level of standards, and what you think a scan needs to represent. In the item above, a certain level of people are delighted to be able to see photographs of an Atari 400 and related products, the fact it came from Sears, the prices for everything, the launch/available titles, and, maybe, the unique grille of the television set the example machine is connected to. This is all information and if you didn’t have the catalog this came from, you just got some great information.
Or you can concentrate on the smaller resolution, the fact you can see ghosting from the other side of this thin catalog page, and the yellowing/discoloring. There’s likely a dozen other optical and arrangement flaws I’m missing, but they’re there. For some, these are incalculably fatal mistakes. It moves this from the realm of captured history to cheap trinket.
Let’s break it down further.
As I said above:
- If you’ve scanned someone nobody has before, you win.
Maybe this isn’t the ideal specimen. Assuming they didn’t destroy the catalog to get this scan, it’s still a great thing they did it. There’s information aplenty in this thing, it proves the page existed, it proves the catalog existed, that Sears sold these, and it gives you a range of pointers to dig deeper, should the mood or necessity strike you.
Of course it could be done better – you could scan it at ludicrous DPI. You could do multiple scans at various contrast/brightness levels, and recompose a top-quality version (that maybe never existed in reality, mind) that would be suitable for a poster or an art book. You could scan the other side, and using some amazing algorithmic mojo that probably killed some graduate student’s year remove all trace of the back page from this scan. You could even painstakingly RE-DO the page from scratch, using this as an artistic guide while you vector this to perfection, meaning near-infinite zoom capability.
The flaw, the miserable mistake that I’ve now seen over the years are the people who think that this scan, not being perfect, should not be done.
I get the mails. I have the conversation. The people who sit on items that are important, that have historical heft, who are waiting for some mythical moment in time when the ability to scan something perfectly every time, conducted by themselves or by some institution paying for it, will ensure the majesty of the item be maintained forever. I know these people. They are among my people.
The most potent of the arguments they have against doing ‘okay’ with scanning is that the ‘okay’ scan will flood out any future attempts to scan some material, because someone “did it” and nobody will want to do it anymore. The craptastico initial version will be the “winner” and that’ll be that, the history is lost.
I happen to think this is the position taken by lonely competitive personalities.
It’s faith-based either way, since it relies on actions not yet taken or actions not yet avoided, but there’s plenty of examples of re-doing something so that it’s a better version than was before, or taking an extant item and remixing it into a more complete or contextual experience. I happen to think that doing an ‘okay’ scan (without doing an intentionally poor scan) is an excellent first step – as long as it’s paired with the approach of non-destruction.
- If you scan something and destroy it in scanning it, you lose.
I have never seen two parties come to a conclusive agreement if one of them is a bearded nerd going “NO, NO, NO!”. But I will say that passion as regards destruction of an item is an understandable reaction. In many cases, items are bound, glued, stapled, attached, and otherwise not compatible with scanners as we tend to deal with them. There’s a decision tree there: Get a good scan as best you can without wrecking it, meaning some information doesn’t make it? Or do you destroy the pages, cut them up, remove staples, and end up with a broken not-quite-awesome pile of what used to be the item, and better scans?
Well, there’s subtlety at work here, and nerds don’t always do subtle. There’s definitely the ideal of a “frame off restoration” in the realm of cars, where you take the car completely apart, down to the frame, and fix the frame (maybe even fabricating a new one) to eventually end up with a better-than-new car at the end. Similarly, if you’re seriously dealing with an item that is so important to capture every little detail (say, the first Worldcon program or a prototype magazine that came from a publisher’s private collection), then you’re in a realm of recovery and intensity that is likely being accompanied by funding or donations anyway.
Or, another way to look at it, is that it’s “just a magazine” or “just a book”, and so pulling it apart to get a better scan might be a willing sacrifice for you. Certainly many of these materials have been thrown away, with that person doing nothing to save it. One being gutted to bring an item online might, with that perspective, be worthwhile.
It puts me in a precarious moral position, but I’m used to those: I do not like that it’s done. I will regardless take items in which this was done and use them. If one thinks of the destruction as being inevitable, than destruction plus scanning almost makes sense, and certainly scanning plus availability is the best of a poor situation.
But that’s the core issue of this entire scanning situation. I’ll limit it to what we call the “vintage computing” culture, but the scanning of these older technical materials amounts to an altruistic act in service of a murky future. It is not clear what people will use these for, or what part they will play in lives, or how valuable the information being saved represents. It is a gamble, a shot in the dark, and the question becomes, quickly, how important is it for these items to be “perfect”, or, again, whether we will be satisfied with digital copies of material without any remnant of the analog, real-world materials (paper, floppies, cassettes) they came from.
To some extent, we’re in good shape – nobody has issued a concerted effort to wipe computer history from the face of the earth, nobody has banned it, nobody murders the people behind it, and the items, while experiencing dips and valleys in perceived value, are generally considered to be “neat”, i.e., worth keeping in the attic a few more years.
There are a lot of copies of magazines out there, especially the big ones that people remember like BYTE, Creative Computing, Compute!, A+, and so on. If it handled home computers, or video games, chances are there’s quite a few copies out there and a person who issues a concerted enough effort and is willing to outlay a bit of cash in various silly directions could get a complete set. That’s not the situation with, say, event programs from computer and hacking and vintage festivals. It’s definitely not the situation with corporate memos or warranty cards or identification badges. The more we move away from “periodical”, the murkier it gets. And when dealing with people, I find a lot of folks put value and understand the meaningfulness of “magazines” or “newsletters” and less on, say, the free bookmark from a prominent computer store that existed in the 1970s.
This is why, like I said above, scanning ephemera, which doesn’t usually have a binding and which can fit in a flatbed scanner, is generally better for people to be scanning than books and magazines. In general. If you want to “help”, that’ll “help” more than anything. 600+ DPI. Use TIFF or another lossless format. Look it up.
But what if you want to scan books and magazines anyway?
In terms of what the best case situation for scanning a book without destroying it, I am, perhaps, luckier than most. I have a $25,000 book scanner in my house.
Last year, I requested, and got, one of the Internet Archive-created book scanners. They’re called Scribes, and they’re a masterwork of metal, glass, optics and mechanisms designed to allow easy scanning of books.
I have it installed in a room in my house. To my great shame, this year has been very, very busy and it is only recently we have it functioning to the point that I can really begin scanning books in earnest. But here it is, and it’s ready to take in books.
It has been incredibly informative to see how Internet Archive (and the books group there) have approached scanning of books, and the different advantages and disadvantages of this approach.
The Internet Archive Scribes are a great example of choosing what’s important to you and going with it, even if other stuff has to take a bit of a hit. The Scribe presents a v-shaped holder that a book is cradled on, to which two high-quality cameras take a photo of each page at the same time. The resulting pages are then stored on the server, and you turn the page of the book, and then do it again. Here’s what a station looks like:
The advantages are this: You can scan a book really goddamned quickly. It’s possible to do 1000 pages an hour if you’re really on top of your game and the book isn’t a finicky mess. 1000 pages an hour! No joke. You can blow through a standard 200-300 page book in about 10-20 minutes or less if you’re (again) lucky. The V-shaped cradle means you go RIGHT up to the binding of the book and you do not break the binding to do so. In other words, the original item is not destroyed.
To do this, again, a high-quality camera is taking the photo – but it is, after all, a camera some distance away. DPI can be between 300 or 600, and it will never be as good as putting it into a flatbed and letting that insane camera element slowly drift across the page, pulling in every last optical feature from a half-inch away. But it will be very good, and you will get a hell of a lot of books doing it this way. The Internet Archive is able to add a new book up every ninety seconds, 24/7, using this method.
Doing odd objects, like placemats or software boxes, are much more difficult with this setup. Magazines, especially ones with shiny paper, are also a bit of trouble. This is the tradeoff. Stick with books, and there’s a LOT of books, and you do very well by these machines. Otherwise, you work a little harder. (And sometimes working harder is quite worth it.)
The V shape design is similar to something Google uses, and something similar to what the DIY Bookscanner uses. It’s a method that requires human labor (no automatic feeder or page turner) and it results in images that need cropping and adjusting afterwards.
This is the secret sauce of the Internet Archive Scribe – the processing software is amazing. It presents pages nicely, lets you declare items as metadata, rescan images, remove broken items, do cropping to sets of pages, and otherwise get repairs going very quickly on these books before wrapping the whole thing up in a pretty bow and shoving it into the Internet Archive ingestion system, itself a wonder of programming and automatic repair. You also can declare what pages are tables of contents, chapter headings, covers, indexes.. and so the resulting item is much more usable. Without being destroyed. And only a hair less readable than a carefully-by-hand, take-all-day process of a flatbed and/or cutting the book completely up to turn it into a pile of flat sheets.
It is a wonder.
The letter that asked me to write about scanning asked for me to touch on legal issues. Here’s a hint from someone who’s actually been through the legal system: there’s no point in discussing the legal issues. No. Point.
Scan because you’re concerned that there’s no record of an item that is easily accessible to a future audience. Scan because you think you have something unique and want to ensure there’s multiple copies of it, even if those other copies are simply digital files. Give your work to people who will hold it for you. Put it up yourself. But if you’re truly asking yourself if you can do what you are doing, nothing I say is going to give you advice on that. I am neither a legal expert or a pep rally. If you’re uncomfortable with doing something you think you should be doing, look around until you find someone who is comfortable with doing that thing and give it to them. That’s all I have for you.
Finally, there’s some inherent question in the whole process of scanning.
Scanning is, at its most fundamental nature, a photograph. It’s an analogue, a rendering, a rendition of a thing, an item. It allows some percentage, always less than one hundred, of that item to be in multiple locations and reach a wider and more diverse audience. It is a leverage and a bargain against oblivion and elitism, where the chances improve that this item’s information and nature will travel far beyond a single place.
It is also a foundation, one can hope – a foundation of existence and reliability for an item to be expanded upon. It can be contextualized, parodied, referenced. It stands as some level of proof of an idea, and it holds itself in the center of more in-depth historical research and tales. The nature of digital creation is not one of preservation in the sense of an exacting clone of a once-offline object – it’s to create a new object imbued with the advantages of the digital world and the memories of what it once was.
Arguments and hand-waving come when different factions of creators, historians, fans, caretakers, professionals and gatekeepers all converge on the original and digital objects that came from the same source, and they bemoan the attributes unique to these objects and what advantages and disadvantages each hold. There has been a lot of wasted time and inches both printed and displayed over this fundamental nature. What matters to me, ultimately, is that human beings saw some value in an artifact that came from the past, and they want to sustain it, for reasons selfish and charitable, for the future.
That’s why we scan. And that’s why we scan again and again.
Hope this helped.
Categorised as: computer history | jason his own self
Comments are disabled on this post
[…] —Scanning: Some Thoughts, Jason Scott, October 3, 2013 […]
Dear Jason, thanks for this great writeup! May I ask why you used TIFF over PNG? I’ve seen people do scanning and saving in TIFF format, but accidentally used lossy JPEG compression instead of lossless LZW. Of course this depends very much on the interface of the software used, which might change all the time. PNG doesn’t even offer lossy compression in its specs and seems simple enough to implement that even Internet Explorer can display it, but are there any other drawback to it?
It’s pure format bias at this point for me. I happen to choose tiff because I have always known it as a extremely old yet dependable full bit space format. PNG is still the young upstart with uneven implementation and support to me. As long as it’s not lossy, I don’t care.
TIFF is certainly an older, more established format. From the perspective of someone starting out without knowing much about file formats, it’s also a more confusing one. TIFF has a lot of options for supporting many use cases– but if all you want is a flat lossless image, it’s just extra things to be confused about.
Alpha channels? Color profilies? Spot colors? These are very useful options to someone who knows how to manage a color space, these options are indispensable. To the dude with a boxful of old notebooks and a desktop printer/scanner, they’re mystifying.
For that gentleman, he can select “save as PNG” and be done. If more programs saved to TIFF with intelligent defaults, instead of interrogating the user for the settings, this wouldn’t be an issue.
so when are you going to scan those boardwatch magazines? you’re one of the few
people who has them and people sent them to you to be used.
Def looking forward to some Boardwatch scans. <3
Thanks for this Jason.
If you had have written it just a couple of months ago, I would have referenced it in an assignment. It’s very well written, and is quite persuasive about some of the reasons originals shouldn’t be destroyed. Of course, space is always going to be an issue, and in many cases having a “good enough” scan, means that the original can be discarded to make way for other material. And then because there wasn’t a proper backup system in place (off-site for example), the digital version was destroyed in a fire. And everything was lost. Ahem.
i’ve pointed my best man, Bob, at your page. he’s wrapping up a project to scan the plate stack of the 100s of thousands of astro photos taken at the harvard college observatory from the mid-19th through late 20-th Cs. he’s built the scanner and the plate cleaner. the team is now working in the basement of the stacks, eta: 5 yrs @ 40 hrs / wk.
That project is amazing!
How about an open source book scanner that turns the pages for you?
I have some stuff here, mostly old computer manuals, that aren’t scanned yet. They aren’t exactly rare though. What frustrates me is a lot of this stuff was originally typeset on a computer and exported to Postscript at one point, but the company who published the manuals (*cough* Apple) doesn’t have any interest in releasing a PDF or something.
“Form is henceforth divorced from matter. In fact, matter as a visible object is of no great use any longer, except as the mould on which form is shaped. Give us a few negatives of a thing worth seeing, taken from different points of view, and that is all we want of it”
From Oliver Wendell Holmes on early stereo photography, even then they were ready to do away with originals!
I remember typing in BASIC data statements from a magazine article (probably COMPUTE) that became an assembler for my VIC 20. This still amazes me. Magazines were an incredible resource back then. They were a real lifeline. This is hard to imagine with the ridiculous volume of free information available today.
This makes me feel better about scanning my old journals and then burning them in a cleansing fire. No one will ever want them but me, and I probably don’t want them either, but why take the chance?