ASCII by Jason Scott

Jason Scott's Weblog

Destroying the History to Save It —

Hi Jason,

I have a question for you, I’m hoping you can help me decide what to do.

I have a lot of vintage computer magazines and books. Not Information Cube level “lots”, but still, boxes and boxes of them.  A lot of this stuff is destined for AtariMagazines.com and AtariArchives.org. Some of it, I don’t have permission to post but it’s still interesting and good stuff, and maybe I’ll have permission one day.

2.5 years ago I moved all this stuff from my house to my new office. I unpacked some of it, never unpacked a lot of it. Now, my family and are planning to move from northern California to Portland, Oregon this summer. Which means moving all of these boxes of magazines once again.

Which brings me to my question. I have a great duplex scanner. Two actually. Should I just cut the bindings off these magazines and digitize them all? Should I just decide that the content is the important part, and not fetishize the objects themselves? Right now, they’re hard to access, the information is impossible to search, etc. Or is it better to have the actual *thing*?

If you could digitize everything in the info cube, but destroy the originals in the process, would you?

And what about particularly rare mags — early issues that are hard to find, expensive on ebay?

My feeling is that for the stuff that isn’t extremely rare, I should just digitize it, bringing it one step closer to OCR and getting it online. . . or at least easily searchable on a hard drive. Then toss the original paper and move on. But I would like a sanity check from you on this.

One related thing that may interest you. I feel that OCRing these magazines is critical – something I have been doing for years at AtariMagazines.com and AtariArchives.org. But as I’m sure you know, OCR alone is not that great, you need human proofreaders to clean up the text if it’s going to be online. So I am creating a tool that will OCR pages, then send the OCRd text and images of the corresponding pages to Amazon Mechanical Turk, to have actual human people proofread and correct the text. It will allow me to get a LOT of high-quality, human-proofread OCR quickly. Basically it will be like Project Gutenberg’s  Distributed Proofreaders project, but it could be used with any text, not just PD text. (In addition to me using it for old computer magazines, it will be a web-based service that businesses could use.) Is that tool something that would help you in your preservation efforts?

Thanks for your thoughts on this stuff,

Kevin

Hey there, Kevin. Thanks for thinking of me.  Sorry it took so long to respond to this.

I am sorry that this strange, weird little world of computer and technological history has to experience the same issue as so many other realms do – that of doing terrible things in the name of good.  I shouldn’t be surprised this is the case. But one could always hope that just as computers seem to be the tool to end all tools, the machine that makes machines that make even better machines, there might have been a chance it wouldn’t fall prey to the same Faustian bargains extant in a thousand other situations. But there we have it.

In the case of documents and materials that are perfect bound, that is, attached by adhesive like so:


Well, with current scanning technology the best way to absolutely get the most effective scan/snapshot of the material is to destroy the binding. Just break that poor thing apart, scan it flat in a nice scanner, and then end up with a broken, used, impossible-to-keep pile of paper.

Now, don’t get me wrong – there’s been an enormous amount of effort applied out there to deal with the binding-being-broken issue. For example, some scanners of particularly rare books take a head-on photo of a flat book page and then use all sorts of mathematical trickery to calculate the curvature of the pages from the binding to flatten them out. Google does it when they scan books for their massive blorb of content. A lot of really smart people are working on that problem, and if you’ve never heard of Unpaper before now… well, you’re welcome.

But at the end of the day, in the currency of the present, the absolute best material to have would be a series of paper sheets and scan them flat, at a nice and high resolution.  And if you have something that you can get into that form, the resulting scans will be much better – but again, you’ll have destroyed the source material in the process. Wrecked it.

This is a huge internal debate for me. Huge. As big as it gets.

After much thought, I came up with the following rule-set for the day I destroy something to save it.

IF I have a document or paper set that requires some level of destruction to scan properly AND IF I have three copies of it AND IF there is no currently-available digital version of the document AND IF there is a call or clamor for this document set THEN AND ONLY THEN I will split the binding and scan at a very high resolution and additionally apply OCR and other modern-day miracles to the resulting document so that the resulting item is, if not greater than the original, more useful to the world.

This is, as you might imagine, an impossibly high standard. So high, I haven’t had anything pass it yet.

I’ve certainly embarked on large scanning projects before – for a year I scanned over 7000 pages of documents from Steve Meretzky’s collection, a scanning project that saved a lot of time for the archive that eventually took those documents over.  I also scanned these items at an insane rate,  800 dpi, meaning that you could see this level of detail in the final images:

In his case, though, I didn’t have to worry about hurting these one-of-a-kind copies of Meretzky’s notes and papers – they were all in a binder and they could be brought out, scanned, and put back. I was lucky. And, by extension, a lot of people are lucky. (There’s still plans to put all these scans on archive.org – ideally in a few months.)

Sitting in my cube are entire collections of magazines, entire runs of all the issues that ever came out. The IF there is a call or clamor part of the above statement usually kicks in first and I haven’t scanned them in. For example, if you want an entire run of a newsletter dedicated to the typesetting software TeX; well… I got a box I can show you. But it just hasn’t seemed justified to go and scan that all in, in hopes someone will find it interesting. I’ve been focusing on other things as of late.

And then every once in a while, I discover someone has embarked on a project that I would normally be doing if my ruleset had been achieved, but since they have a smaller ruleset, they got there quicker. Such it was, recently, that it turns out someone is scanning in a bunch of issues of BYTE magazine.

Here’s the thread in question.  The scanning fellow shows up regularly and points to a multi-hundred-megabyte PDF file of an issue of BYTE magazine, including a nice introduction and overview of the contents, and the resulting downloaded file is easy to read, browse, and enjoy. It is very, very hard to look this gift horse in the mouth and find faults – I mean, this guy is scanning hundreds of pages, very quickly, and providing them for free. But here we go, finding cavities anyway…

Somewhere in the middle of the love fest that is this thread, someone points out that one of the pages is scanned improperly in the PDF, and a page is missing. The response from the scanning gentleman, frankly, chills me to the bone:

“I will fish the magazine out of the garbage and get those fixed.”

So after scanning these magazines, he immediately trashes them. Whoop, right into the bin. Now, the PDFs are great, but they’re not exactly excellent. The resolution is sub-par (so you can’t easily read many of the ads or look at details) and any printing or close-up viewing of the page is blurry indeed. But that’s it, they’re in the trash and gone.

Somewhere along the line, I convinced myself of this way of thinking: Well, instead of being a guy who owns these and throws them out, here’s a guy who scans them and puts them online, and then throws them out. This is the same internal gymnastics that makes it possible for me to vaguely respect all those boring-ass condo villages in the suburbs, because at the very least putting all those people in tightly-packed shitboxes sure beats that same amount of people taking up a hundred times the space with houses sporting massive useless lawns. The upside, you see.

But this falls apart quickly when one investigates what happens next: people begin sending the scanner/destructor their own copies of BYTE. Now he’s not just destroying his collection, he’s unwittingly convinced other people to give up their collections to the cause, destroying even more copies along the way, copies that can never be scanned at a better resolution, or given a chance to be cleaned up from said higher resolutions before being turned into a standards compliant and quick-as-lightning PDF. (With an archive of the original TIFFs around, as well.)

I have to stress – there’s no evil at work here.  Scanner-destroyer is donating a lot of time for this project.  People are benefiting from this effort, as they can read issues of BYTE that they never read or heard of when they were younger. BYTE is a world-class magazine in the 1970s and 1980s – as good as it gets in a technical realm. It’s a pleasure to read and hours of thoughtfulness afterwards. It’s good. It’s worth saving.

But this situation, this striking-the-balance problem of destruction versus saving, of trash and triumph – it’s one I haven’t really had to address yet, and I know that that day will come, and with it will be some very sad, very intense feelings as I take a razor blade to something fate and respect entrusted to my care.

I will not enjoy that day at all.


Categorised as: computer history

Comments are disabled on this post


17 Comments

  1. Benj Edwards says:

    I favor destroying a magazine or book’s binding during scanning if and only if:

    1. The magazine/book is not rare. That means:

    a. It was printed at least tens of thousands of times — preferably hundreds of thousands.
    b. It is known that others (particularly collectors, museums, or libraries) posses and preserve physical copies of the same work. At least one known, unmodified physical copy should always remain in the world.

    3. You can actually scan it properly. And I mean it. Very high resolution (at least 600 dpi) in a lossless image format (i.e. TIFF, not JPG, not PDF), possibly OCRed (although than can always be applied later), and no missing pages, image artifacts, or color aberrations that render the photos unviewable or the text unreadable. You need to scan the front and rear cover, every page (including ads), and you should probably scan what is printed on the binding (if anything) before you destroy it. You need to scan the document at such a quality that no one will ever need to scan it again for any foreseeable reason. There are almost always unforeseeable reasons, which is why some physical copies must remain.

    4. You can actually distribute your scanned work widely enough so other people won’t repeat the work, thus destroying more copies of the publication and wasting everyone’s time.

    That last one is vitally important. Without a huge repository of scanned material or a healthy underground trading scene for these publications, there’s almost no point in taking the time to scan them.

    All that being said, good luck and happy scanning!

  2. Vitorio says:

    The DIY Book Scanner forums have a lot of living knowledge about pro-am book and paper archiving: http://diybookscanner.org/

    Their users have their own scanning and post-processing software called Scan Tailor: http://scantailor.sourceforge.net/

  3. Shadyman says:

    I’d go for the DIY Book Scanner first and foremost, though I wonder if there isn’t some way to non-permanently (ie, no glue) reattach a stripped magazine in some form of folder or document cover that would clamp it along the bound edge…

  4. Swizzle says:

    I think turning the documents/magazines into digital copies is far more valuable than the physical medium. How many people are going to be able to read and examine the physical copies? How many people are going to be able to read and examine the digital copies? I think a lot of people will find them of value – many who probably didn’t even know the magazine existed before it was scanned.

    Hopefully people research scanning before they start though, and keep the original files scanned at a very high resolution so they can create better copies later instead of just having the donwngraded pdf result. Even if they don’t – I still believe the digital copy is far more useful for everyone than a physical copy.

    If the document is rare and has some significance (imo – not everything needs to be saved forever physically, but there is little reason not to do it digitally) then care should be taken to scan it in a way that doesn’t destroy it.

    I’ve scanned a lot of documents of the years and most of them have gone into the trash when I’m done with them. I have the digital files afterwards – and so does everyone else who could possibly want it. It brings a tear to my eye to see the documents destroyed in the process – but the end result is for the better good.

  5. Chris says:

    I’ve struggled with this very idea myself. On one hand, I have boxes and boxes of documents that I want to keep and preserve, on the other hand I ust jdon’t have enough room in my tiny house to properly store them anymore.

    I am leaning toward the argument that a good digital copy iof something is much more useful to me and the rest of the world right now than the moldy deteriorating original will be after I’m dead and some cleaning crew comes in my house and throws it in the trash.

  6. Thomas says:

    Hi! A very interesting article I enjoyed reading it….

    I have a couple of comments on the BYTE scans… I am aware these are not museum quality. I think one attribute when measuring the value of a document is the number of people who read it. The BYTEs that are available now are being read by many people. When very high quality copies are made available years from now the number of interested readers will be less.

    The archiving and preservation of high quality copies is important… but most of the people who will peruse them years from now will not remember their childhoods or get that nostalgia kick that current readers are experiencing.

    BYTE is the sixth magazine series I have done in two years (along with 97 books on the Atari 8-bit)… Well over 100,000 pages. Probably nothing compared to some of the efforts here.. but for the people that really enjoyed these magazines and books in the 80’s the reduced quality is worth not waiting for more qualified people to preserve them.

    As far as tossing them out.. Guilty! I usually hold the magazines for a week after people start reading them. Twice I have thrown them out early by mistake and had to fish them out of the garbage. I used to keep them and put them in sheet protectors but after stuffing 4000 pages in sheet protectors and putting them on a shelf… I never looked at them again.. I had a digital copy.

    BYTE will need to be scanned again to preserve it. Another complete set will need to be destroyed. That used to really bother me.. a lot. Destroying issue #1 of BYTE and issue #1 of ANTIC was painful… but I think the extra readers that are enjoying it digitally now are worth the physical set that would be taking up space in my (and a few others) closets.

  7. Jayson Smith says:

    I tend to agree that, where it is known that many copies existed and several do still exist, digital versions are more useful to the world at large than physical copies. For one thing, a digital copy of something can much more easily be copied from place to place, and can be in hundreds or thousands of locations at the same time. Whereas, obviously any single physical object can never be in more than one place at the same time. If something happens (fire, flood, other natural disaster) in the one location where that physical copy is housed, boom, it’s gone, just like that. All that having been said, I would tend to agree that, where possible, at least one physical copy should always remain in the world. Better if several copies can exist in different parts of the world. Just my thoughts.

  8. J. Miller says:

    The solution I would use for perfect-bound and stapled things is to remove the binding – in the case of perfect-bound stuff, this would involve an industrial-class paper guillotine capable of slicing the binding right off in one go – high resolution scans (1600 dpi) to compressed TIFFs and then downsampled and OCRed to PDFs for distribution, and then organize and preserve the remaining pages in another binding (plastic page protectors coming to mind).

    Scanning doesn’t require destroying the pages, just the binding, and you can always re-bind them. The binding has no value (unlike say a genuine antique book where the binding method is no longer used). Content is king, and it’s important to get that content into a state where it has multiple redundant backups in multiple media. That way, even if the media is lost, the message is still heard.

    (The actual best preservation media is to have the original layout documents, so you can further duplicate them. That’s not likely to happen, though.)

  9. Jim says:

    Since we’re talking about viewing these books as historical artifacts, think of the situation like this – would you rather have a chance of keeping the original copy of Homer’s Odyssey in perfect condition or guaranteed access to the words of the book, plus a chance at the original pages, minus binding? First the information needs to be secured, then the physical form, then we can get into the MINT+++ ratings and all that.

    • Jason Scott says:

      Very true, Jim. But let’s also talk about how we’re breaking the binding on that original copy of Homer, scanning it via whatever appears to be the current good technology, then immediately throwing the pages into the fire, going “well, there’ll be others”.

  10. TPRJones says:

    I think it is the content that is important, and any decisions made should center around preserving and cataloging the content in the best way possible. It’s easy to get wrapped up in the history and the value of the physical object, but without the content the object itself would have no particular value. Trade that stack of Byte magazines for a stack of Ladies Home Journal from the same time period, and – at least to most readers here – it suddenly becomes a lot less valuable.

    These are not holy artifacts. Their content is what matters. The rest is just scraps of pulped timber pressed into flat planes and covered in inks. The only reasons to preserve the physical forms are because you haven’t had time to scan them yet or because you believe that a better scan can be done at a later date.

  11. Richard Wheston says:

    TPRJones: Well, there it is. To some people, the physical magazine/book/catalog/whatever *is* the important artifact. The content is indeed secondary to the presentation.

    Which, to some extent, I can get behind. But then the question stops being “should I cut the binding off to scan it” and starts being “why am I even considering taking this priceless piece of history out of its double-locked nitrogen-atmosphere fireproof room?”

    Jason makes this point with the “does another copy exist?” question. But to ask this question implies that if the answer is “no”, then the conclusion will be to not risk damage to the artifact by running it through a scanner.

  12. Chris M says:

    I just wish Ziff-Davis (or whoever owns them these days) would make the entire back catalog of rags like PC Magazine, Windows Sources, Computer Shopper, PComputing and others available electronically.

  13. Ed S says:

    I worked as a professional archivist for a number of years (not bragging, just a fact) and I can testify that similar questions come up all the time, and the answer depends on the priorities and goals of whatever person or institution is doing the imaging. An institution would be unlikely to take on a project like BYTE, for Intellectual Property reasons and because there are so many waiting projects involving older or truly unique material.

    If I had a collection of thick, perfect-bound magazines on thin, clay-coated paper (such as BYTE) and I wanted to digitize them, I would not chop off the bindings. The reason is that the paper is very thin and shiny — and its character probably varies quite a bit over the run of the magazine. It won’t feed well through a multi-page duplex scanner: about every 20 sheets, you would get one that sticks or crumples. Unless he has some really good machinery to hand, I expect that guy scanning BYTE is just laying the pages one by one on a regular flatbed. Not the most efficient way to do it, IMO

    For a run of magazines of that kind, I would image them using a camera stand with a V-shaped support, the way most of the stands on that DYI Scanner forum are set up. Don’t use cheap cameras, but you don’t need to go overboard either. Something on the order of the Canon Digital Rebel, 16 megapixels, would do. It needs two cameras, preferably of the same make and model. If you want to capture color images, you will be learning something a little bit studio lighting, especially how to avoid reflections from the glossy paper. You would need to know a little bit about lenses also, but the photography knowledge required is not huge. You can get the equivalent of 400 ppi without too much difficulty, and that is usually enough to capture the resolution of the original printing process that was used to make the magazine. One can buy a professionally made copy stand of this V-shape design — ATIZ makes one with a V-shaped glass that comes down and presses the pages flat — for 6-7 thousand dollars, which is why people go the build-it-myself route. The reality is that imaging documents properly doesn’t require arcane equipment but it does require time, knowledge and commitment — and realistically, the time you put into it will be your biggest cost no matter which route you take. Maybe easier to find someone who does it as a hobby and loves it, and work out some kind of quid-pro-quo.

    A less expensive (in money) but more costly (in time) route is to get a flat scanner that is designed so that the scanning glass goes right up to one edge of the unit. These are made for scanning bound items. However, they require the person to manually lift and reorient the book for every scan, and half the scans need to be rotated in post processing.

    In any case, I think there is really no point in chopping the magazines. In magazines like that, the glue used to join the papers often seeps between the sheets, so you have to pull them apart individually even if the binding is cut off. The tear marks left behind by that process are visible on a number of the sheets in that guy’s scan of BYTE. But as Jason Said, the guy is putting a lot of effort into it so criticism should be moderated by appreciation.

  14. Ed S says:

    One additional thought: when imaging materials, it is important to scan at the appropriate resolution for that type of material, and with regard to the kind of output you want. Scanning at too high a resolution can waste a lot of your time, because the machine works much more slowly and you may be capturing useless data. Most flatbed scanners won’t scan any more finely than 600ppi. It will let you set the resolution higher, but it just scans at 600 and interpolates the image up to the higher setting. Waste of time. But it gets worse. Quite a number of scanners can scan at 600 ppi, but the optical equipment in the scanner cannot focus that finely. You get a scan at 600, but the image details blur out at maybe 400, so the extra 200 ppi are useless. Believe me, I’ve made all these mistakes. The details of this work are not obvious.

  15. […] hat Scott im seinem Blog die Frage nach der Grenzwertigkeit der Archivierung aufgeworfen. Darf man etwas zerstören… […]