ASCII by Jason Scott

Jason Scott's Weblog

Change Computer History Forever: Well, Here We Are —

When Brewster hired me in 2011, he had the foresight to recognize I’d spread in many directions once I was under the auspiciousness of the Internet Archive, but he definitely had one overarching goal with my employment. Paraphrasing, it was this: the Archive had done very well with books, music, visual items, and of course websites – but it was sorely  lacking in the realm of software. My provided goal was do for software what archive.org has been doing for all these other mediums.

In short summary, I have done that.

Thanks to the additions of the Shareware CD Archive, the TOSEC archive, the FTP site boneyard, and the Disk Drives collection, and the encouraging of the hosting of the Classic PC Games library along with the (in-process) integration of Fileplanet…. The Internet Archive is the largest collection of historical software online in the world. Find me someone bigger.

Through these terabytes (!) of software, the whole of the software landscape of the last 50 years is settling in. But since software is just that, programs and materials, it’s best to have some documentation and writing regarding it as well.

I’m well along on that too: the Computer Magazines collection is well over 10,000 individual issues of computer magazines and journals. If they’re not magazines, they might be newsletters and there’s a Computer Newsletters collection for that, with thousands of THOSE issues as well. Or books! Maybe you’re looking for books and in the Folkscanomy project, I’ve set aside a section just of computer books. Obviously there might be some hardware issues or information you would need, so be aware the Folkscanomy collection has an electronics section that veers here and there into the computer and programming realms as well.

No, no, SIT DOWN. I’m not done. The mirroring of the amazing Bitsavers Project means that over 25,000 documents that group have been digitizing for almost 20 years are right online, readable and downloadable at a whim. I’ve been separating them into company type, but I currently do that by hand so it’s uneven. Either way, an automatic process now does the ingestion, meaning anywhere from 10-100 new documents enter that library a week.

Regular ephemera? I’ve been doing a little of that on the side and working with people. It’s called the Reader Service collection. Gems aplenty there, I can promise you.

So, between all this material, and much more is coming in, the Internet Archive gives you unrestricted access to the largest collection of computer history and software in the world, bar none. Bar none.

So what’s the problem?

Well, our metadata is shit, I can tell you that. We’re not good at having all the careful twee metadata entry that most archives and libraries demand. If you look at, say, the Apple I manual we have online, it’s kind of just that – an Apple I manual. Not much detail, page listing, context. It’s just there. Preserved, easily accessed, easily read – but not described all that much. That’s a thing. People in more formal disciplines might call that a showstopper. I call it a minor issue for the moment, but one worth improving.

The other weirdness is that a lot of material is inside other archives that have to be browsed using the Archive.org’s file browser. So here’s some examples: The insides of a DOOM Level CD-ROM. A view of the entire software output for the Colecovision. The racy insides of the Devil’s Doorknob BBS. There they are, but you have to do a little digging.

Yes, this is a crate digger’s paradise. The cries of “Look what they had, they didn’t even know they had it” should echo through these stacks. The superior feeling of being the first to find a rare demo of a game that nobody ever ultimately released. The citation you note deep in an advertisement in a computer magazine for a promised hardware family that never came to fruition, or did with radically scaled-back qualities. It’s in there.

But these are problems of effort, not of possibility. That’s all they are.

More importantly, here’s the question I now ask the culture, the world, the people who might read this or get pointed to it.

Are you ready for this? Are you in?

What I mean, is that for well over 20 years now, I’ve been in the world and the culture of the software collector, the curator, the theorist, the fan. That is my life, to have been part of this group. Some of them have gone into some very professional circles with this hope in their heart to bring something like this around, but an awful lot of life and fear and reality has gotten in the way. A lot of people are well on their way towards these goals, to have this much online, this much available, this much right there and allowing us to do the Next Steps.

Well, we’re here. Now what.

There is now a fully-accessible, worldwide-reachable, massive-bandwidth and completely unrestricted collection of computer history up right now, in these collections I’ve just mentioned. Some are mirrors of incredible projects that have been around long before this moment, and let me not diminish their continued work. But some of these efforts needed that little extra bit of access, that ease of reading and downloading, and now that is here. The URLs on archive.org are designed to be permanent. This link to a little running cat (NEKO, which has been around since Macintosh days) will, barring incredible disaster, be around for a very long and dependable time. So will this collection of 30 gigabytes of Amiga software. And notably, over 360 people have downloaded that 30 gigabyte collection, absconding like Bilbo Baggins out of the mountain. Fine! Enjoy! Have a great time! But the point is, if someone asks for where it came from, they can point right here, and here it is. In a library. Online, like it belongs.

So where are you?

Where are the students of computer history who needed primary source material, downloadable images and PDF files of every description from which to make their thesis statements and reports?

Where are the bloggers and essayists who are putting together in-depth, critical, long-reaching and ranging assessments of historical events to provide context to today?

Where are the people dedicated to busting some of these lame-ass software patents that have clogged and destroyed so much innovation, all in the name of some corporate worship that says that someone patenting breathing oxygen is helping the world improve?

When do I get to see the brilliance of works like this that shed amazing new light into these old things?

This is it, folks. This is the ideal world I’ve heard whispered about, referenced, and planned for a very long time. It’s here. I know you might have expected it to land with an earth-shattering boom but it was a slow and steady flowering on the Internet Archive’s servers. The Archive of Historical Computer Software is here, and it is very, very large.

Blow me away.

 

 


Categorised as: computer history | jason his own self

Comments are disabled on this post


41 Comments

  1. Kate Bowers says:

    Don’t call metadata “twee.” Metadata is the stuff of discovery and citation. Discovery is what brings scholars, and citation is to the humanities’ what repeatability is to experimental science–nothing can be considered scholarship without it.

    It might look like an old lady in lavender and lace to you, but metadata is also the boiled-down and inter-connected essence of the thing it represents. Useful stuff, metadata.

    Maybe you should hook up with some archivists and librarians, do something brilliant and find a way to auto-generate and share some metadata for all that stuff you’ve been collecting. Putting something in ArchiveGrid (beta.worldcat.org/archivegrid/) would be a good start.

    Where are the scholars? They are where the metadata has shined a light on a collection. Well, at least that’s where the historians are: http://bit.ly/UsAgwu

  2. I’ve been wanting to do some automated projects with the collection of CD-ROMs, starting with determining how much duplication there is (like every single shareware CD-ROM having a copy of the Doom demo) and then finding the unique stuff. Perhaps build a search index of all the textual information (e.g., I’m fascinated with knowing where game companies used to be headquartered).

    Further, a little pre-processing might enable users to find specific files within the opaque CD-ROM images and download individual files they seek.

    I’m glad the metadata deficiency is an acknowledged shortcoming. I also have trouble finding different collections. Often, my only saving grace is that I still have pointers in my RSS feed (here’s hoping that service doesn’t go anywhere).

    • Jason Scott says:

      Well, I’ve taken some steps at pre-processing (more needs to be done, obviously). For example, if you look at an earlier CD-ROM uploaded, like Hall of Fame CD-ROM, you can see a file listing (in two formats) as well as a generated graphics gallery of all images in the directory. Obviously a description went in there too.

      Just the pure act of acquiring and uploading these CD-ROM images is taking a long time – so that’s what I’ve been focusing on.

  3. This post is just all kinds of awesome! I had no idea archive.org had so many non-website things. I grew up with a Commodore 64 and later I dialed many BBS systems, so seeing the magazines and BBS snapshots really brought me back.

    You mentioned you really were hoping to see cool things created and researched based on this material being available.

    I wanted to make huge, massive wallpaper sized images of classic commodore game ads often found in the old computer magazines of the time. I was able to find a torrent of all the Compute Gazette and Run magazines. I then wrote some code to determine whether each page was an ad or not, and then wrote some more code to remove duplicates and piece all the images together.

    Here are the final results:
    http://telparia.com/Commodore_Game_Ads_1.jpg
    http://telparia.com/Commodore_Game_Ads_2.jpg
    http://telparia.com/Commodore_Game_Ads_3.jpg
    http://telparia.com/Commodore_Game_Ads_4.jpg
    http://telparia.com/Commodore_Game_Ads_5.jpg
    http://telparia.com/CommodoreMagazineCovers.jpg

    I wrote a blog post about it: http://cosmicrealms.com/blog/2012/12/31/c64-magazine-game-wallpaper-generator/

    Thanks a lot for a great post. I’m gonna go browse around archive.org :)

  4. stbalbach says:

    There’s a cool utility called httpfs that lets you browse remote files (such as .iso) as if they were local thus not requiring full download to browse contents. Seems useful for this archive. Unix only though (maybe someone can port to cygwin for Windows).

    Download:
    http://sourceforge.net/projects/httpfs/files/httpfs2/

    Example:
    https://gist.github.com/vasi/5365782

  5. Daniel says:

    Wow, that’s beyond amazing! One question, however… Regarding the TOSEC collection, for instance, how did you guys get away with making games available given the copyright restrictions?

  6. mikecane says:

    >>>I’m well along on that too: the Computer Magazines collection is well over 10,000 individual issues of computer magazines and journals.

    Are there any plans to scan these as searchable text?

    • Jason Scott says:

      In every case, there’s OCR text that is generated from them and accessible from the item. In the future, a search engine would be a good idea. Frankly, I’m not impressed with the OCR as it is, and it should be improved. At the very least, some attempt to have the table of contents of these issues should be addressed. Also, the scan quality shifts, although I’ve had cases of a few people who scanned things in go ahead and re-scan them in better since they now know there’s a nice large home for the scans.

      • ihtoitihtoit says:

        Maybe a project for those nice fellas over at Project Gutenberg? Sorry to say, even today OCR’d text isn’t all it’s cracked up to be. Human proofreading is still needed. I would say though, that I have a vast legal library (over six million pages), entirely converted to PDF and fully indexed. Where the content is scanned pages, I’ve run the automated OCR, embedded that as metadata without proofing and indexed that. Works pretty well, I can find most things I’m looking for fairly quickly.

  7. [...] Scott in einem Blog erklärt. Die Sammlung umfasse schon heute mehrere Terabyte an Code aus den zurückliegenden 50 [...]

  8. Bruce says:

    Wow. Just wow. The sheer scale is stunning.

    Actually, one question which might lead to others. If you search for an ISO file in the archive, you can find it easily enough via a web search (e.g. the Devil’s Doorknob one listed above) but there doesn’t appear to be any obvious link from the main collection page to the isoviewer. I saw an archive.org forum post by someone suggesting it a year ago, with no replies.

    Am I missing something obvious in the interface, some really obvious reason why it’s a bad idea, or is there a more meta-level reason why that doesn’t exist yet (or is it something that I could somehow fix?)

    B>

    • Jason Scott says:

      If you go to the TOSEC collection, you can see links to the ISO viewer. The fact is, these uploaded ISOs all need some love, with a combination of generated scripts to link to the browsers, along with better pulled-in data from the ISOs themselves split out. I expect to increase this over the next couple of months and make these better.

  9. [...] of the largest collections of old software available anywhere, according to Internet Archive’s Jason Scott. That includes recent additions of Shareware CD Archive, TOSEC archive, FTP site boneyard, Disk [...]

  10. Archivist says:

    Wow, yeah thats fantastic ! I will for sure spend some time to search for some old games to play them with an emulator !

  11. [...] to partnerships with a number of independent archives, including TOSEC archive, the FTP site boneyard, the Shareware CD Archive, and Classic PC Games [...]

  12. Mike Rhode says:

    Are you interested in having more old software sent to you? I’ve got a bunch of 1990s 3 1/2″ floppies with games and other programs on them that I’d be glad to pack and mail. Some were shareware, and others are boxed commercial releases.

    • Steve says:

      I’ll answer as a member of Archive Team: YES! He would like any and all old software, floppies, CDs, etc.

      • Mike Rhode says:

        Ok, send me an address to mail them to and I’ll get them out soon – hopefully next Monday. I’m glad to find a home for these as I’m a professional archivist myself.

      • Robin Lake says:

        Likewise. I’ve never thrown anything away and I’m 74 years old! Rather than ship you all 200+ cartons of computer-related materials, if you could let the world know what specifically you want/need, I’d adhere to your wishes and filter the stuff before sending.

        • Steve says:

          Not speaking for Jason here, but I’m pretty certain he’d take every last box of materials you’d be willing to share.
          You should definitely send him an email though so you can work out an arrangement with him- jason at textfiles.com

  13. [...] spiegato dallo stesso Scott sul suo blog ufficiale, i responsabili di Internet Archive hanno voluto [...]

  14. [...] – Former 404 guest and Internet Archivist releases Web’s largest collection of historical software. [...]

  15. [...] O’Reilly Radar > ASCII > “Fixing E.T. The Extra-Terrestrial for the Atari 2600” > “E.T. (Atari 2600) [...]

  16. Mark Marino says:

    Jason,

    Congratulations. This addition will mean an awful lot to those researchers working in Critical Code Studies. Is there a tab or homepage for code searches on the archive yet?

  17. [...] Computer Software Archive (Jason Scott) — The Internet Archive is the largest collection of historical software online in the world. Find me someone bigger. Through these terabytes (!) of software, the whole of the software landscape of the last 50 years is settling in. (And documentation and magazines and …). Wow. [...]

  18. Jim S says:

    Aren’t you looking for something like http://manx.classiccmp.org/ which indexes documentation?

  19. D.C.D. says:

    any chance you’ll interlink the TOSEC sets with archived emulators so that I can load Super Mario World and play it in browser?

  20. Mike Rhode says:

    My banker’s box went in the mail to him on Friday. It turned out to be mostly games, about 1/2 shareware; and some software like Multimate wordprocessor. And a disk for a Microsoft Mouse, which someday may be of some sort of historical interest. ;^)

  21. [...] Ascii archives of 50 years of programs, with docs. [...]

  22. Jeremy Nimmo says:

    Awesome! Incidentally, by saving Fileplanet (which has kinda sucked for around a decade) did 3dgamers.com’s great software archive get saved by accident? I can remember being really pissed off when IGN sucked in and destroyed that site.. and then seeing that they sucked in Gamespy and thinking- oh, here we go again.

  23. laserghost says:

    You’re archiving history. Those miriads of demos, maps, mods, little gaming tidbits are history from previous era. Thank you for your effort.

  24. Hmm.

    Well, BetaArchive (http://www.betaarchive.com/) has 4.7 TB of beta and abandonware software and games.. (not to mention SDKs, etc!)

  25. The IBM Stretch archive is a collection of items, primarily documents, relating to the development of the IBM 7030 (“Stretch”) computer project from 1955 to 1961.