Thoughts on a Collection: Apple II Floppies in the Realm of the Now — March 15, 2017

I was connected with The 3D0G Knight, a long-retired Apple II pirate/collector who had built up a set of hundreds of floppy disks acquired from many different locations and friends decades ago. He generously sent me his entire collection to ingest into a more modern digital format, as well as the Internet Archive’s software archive.

The floppies came in a box without any sort of sleeves for them, with what turned out to be roughly 350 of them removed from “ammo boxes” by 3D0G from his parents’ house. The disks all had labels of some sort, and a printed index came along with it all, mapped to the unique disk ID/Numbers that had been carefully put on all of them years ago. I expect this was months of work at the time.

Each floppy is 140k of data on each side, and in this case, all the floppies had been single-sided and clipped with an additional notch with a hole punch to allow the second side to be used as well.

Even though they’re packed a little strangely, there was no damage anywhere, nothing bent or broken or ripped, and all the items were intact. It looked to be quite the bonanza of potentially new vintage software.

So, this activity at the crux of the work going on with both the older software on the Internet Archive, as well as what I’m doing with web browser emulation and increasing easy access to the works of old. The most important thing, over everything else, is to close the air gap – get the data off these disappearing floppy disks and into something online where people or scripts can benefit from them and research them. Almost everything else – scanning of cover art, ingestion of metadata, pulling together the history of a company or cross-checking what titles had which collaborators… that has nowhere near the expiration date of the magnetized coated plastic disks going under. This needs us and it needs us now.

The way that things currently work with Apple II floppies is to separate them into two classes: Disks that Just Copy, and Disks That Need A Little Love. The Little Love disks, when found, are packed up and sent off to one of my collaborators, 4AM, who has the tools and the skills to get data of particularly tenacious floppies, as well as doing “silent cracks” of commercial floppies to preserve what’s on them as best as possible.

Doing the “Disks that Just Copy” is a mite easier. I currently have an Apple II system on my desk that connects via USB-to-serial connection to my PC. There, I run a program called Apple Disk Transfer that basically turns the Apple into a Floppy Reading Machine, with pretty interface and everything.

Apple Disk Transfer (ADT) has been around a very long time and knows what it’s doing – a floppy disk with no trickery on the encoding side can be ripped out and transferred to a “.DSK” file on the PC in about 20 seconds. If there’s something wrong with the disk in terms of being an easy read, ADT is very loud about it. I can do other things while reading floppies, and I end up with a whole pile of filenames when it’s done. The workflow, in other words, isn’t so bad as long as the floppies aren’t in really bad shape. In this particular set, the floppies were in excellent shape, except when they weren’t, and the vast majority fell into the “excellent” camp.

The floppy drive that sits at the middle of this looks like some sort of nightmare, but it helps to understand that with Apple II floppy drives, you really have to have the cover removed at all time, because you will be constantly checking the read head for dust, smudges, and so on. Unscrewing the whole mess and putting it back together for looks just doesn’t scale. It’s ugly, but it works.

It took me about three days (while doing lots of other stuff) but in the end I had 714 .dsk images pulled from both sides of the floppies, which works out to 357 floppy disks successfully imaged. Another 20 or so are going to get a once over but probably are going to go into 4am’s hands to get final evaluation. (Some of them may in fact be blank, but were labelled in preparation, and so on.) 714 is a lot to get from one person!

As mentioned, an Apple II 5.25″ floppy disk image is pretty much always 140k. The names of the floppy are mine, taken off the label, or added based on glancing inside the disk image after it’s done. For a quick glance, I use either an Apple II emulator called Applewin, or the fantastically useful Apple II disk image investigator Ciderpress, which is a frankly the gold standard for what should be out there for every vintage disk/cartridge/cassette image. As might be expected, labels don’t always match contents. C’est la vie.

As for the contents of the disks themselves; this comes down to what the “standard collection” was for an Apple II user in the 1980s who wasn’t afraid to let their software library grow utilizing less than legitimate circumstances. Instead of an elegant case of shiny, professionally labelled floppy diskettes, we get a scribbled, messy, organic collection of all range of “warez” with no real theme. There’s games, of course, but there’s also productivity, utilities, artwork, and one-off collections of textfiles and documentation. Games that were “cracked” down into single-file payloads find themselves with 4-5 other unexpected housemates and sitting behind a menu. A person spending the equivalent of $50-$70 per title might be expected to have a relatively small and distinct library, but someone who is meeting up with friends or associates and duplicating floppies over a few hours will just grab bushels of strange.

The result of the first run is already up on the Archive: A 37 Megabyte .ZIP file containing all the images I pulled off the floppies.

In terms of what will be of relevance to later historians, researchers, or collectors, that zip file is probably the best way to go – it’s not munged up with the needs of the Archive’s structure, and is just the disk images and nothing else.

This single .zip archive might be sufficient for a lot of sites (go git ‘er!) but as mentioned infinite times before, there is a very strong ethic across the Internet Archive’s software collection to make things as accessible as possible, and hence there are over nearly 500 items in the “3D0G Knight Collection” besides the “download it all” item.

The rest of this entry talks about why it’s 500 and not 714, and how it is put together, and the rest of my thoughts on this whole endeavor. If you just want to play some games online or pull a 37mb file and run, cackling happily, into the night, so be it.

The relatively small number of people who have exceedingly hard opinions on how things “should be done” in the vintage computing space will also want to join the folks who are pulling the 37mb file. Everything else done by me after the generation of the .zip file is in service of the present and near future. The items that number in the hundreds on the Archive that contain one floppy disk image and interaction with it are meant for people to find now. I want someone to have a vague memory of a game or program once interacted with, and if possible, to find it on the Archive. I also like people browsing around randomly until something catches their eye and to be able to leap into the program immediately.

To those ends, and as an exercise, I’ve acquired or collaborated on scripts to do the lion’s share of analysis on software images to prep them for this living museum. These scripts get it “mostly” right, and the rough edges they bring in from running are easily smoothed over by a microscopic amount of post-processing manual attention, like running a piece of sandpaper over a machine-made joint.

Again, we started out 714 disk images. The first thing done was to run them against a script that has hash checksums for every exposed Apple II disk image on the Archive, which now number over 10,000. Doing this dropped the “uniquely new” disk images from 714 to 667.

Next, I concatenated disk images that are part of the same product into one item: if a paint program has two floppy disk images for each of the sides of its disk, those become a single item. In one or two cases, the program spans multiple floppies, so 4-8 (and in one case, 14!) floppy images become a single item. Doing this dropped the total from 667 to 495 unique items. That’s why the number is significantly smaller than the original total.

Let’s talk for a moment about this.

Using hashes and comparing them is the roughest of rough approaches to de-duplicating software items. I do it with Apple II images because they tend to be self contained (a single .dsk file) and because Apple II software has a lot of people involved in it. I’m not alone by any means in acquiring these materials and I’m certainly not alone in terms of work being done to track down all the unique variations and most obscure and nearly lost packages written for this platform. If I was the only person in the world (or one of a tiny sliver) working on this I might be super careful with each and every item to catalog it – but I’m absolutely not; I count at least a half-dozen operations involving in Apple II floppy image ingestion.

And as a bonus, it’s a really nice platform. When someone puts their heart into an Apple II program, it rewards them and the end user as well – the graphics can be charming, the program flow intuitive, and the whole package just gleams on the screen. It’s rewarding to work with this corpus, so I’m using it as a test bed for all these methods, including using hashes.

But hash checksums are seriously not the be-all for this work. Anything can make a hash different – an added file, a modified bit, or a compilation of already-on-the-archive-in-a-hundred-places files that just happen to be grouped up slightly different than others. That said, it’s not overwhelming – you can read about what’s on a floppy and decide what you want pretty quickly; gigabytes will not be lost and the work to track down every single unique file has potential but isn’t necessary yet.

(For the people who care, the Internet Archive generates three different hashes (md5, crc32, sha1) and lists the size of the file – looking across all of those for comparison is pretty good for ensuring you probably have something new and unique.)

Once the items are up there, the Screen Shotgun whips into action. It plays the programs in the emulator, takes screenshots, leafs off the unique ones, and then assembles it all into a nice package. Again, not perfect but left alone, it does the work with no human intervention and gets things generally right. If you see a screenshot in this collection, a robot did it and I had nothing to do with it.

This leads, of course, to scaring out which programs are a tad not-bootable, and by that I mean that they boot up in the emulator and the emulator sees them and all, but the result is not that satisfying:

On a pure accuracy level, this is doing exactly what it’s supposed to – the disk wasn’t ever a properly packaged, self-contained item, and it needs a boot disk to go in the machine first before you swap the floppy. I intend to work with volunteers to help with this problem, but here is where it stands.

The solution in the meantime is a java program modified by Kevin Savetz, which analyzes the floppy disk image and prints all the disk information it can find, including the contents of BASIC programs and textfiles. Here’s a non-booting disk where this worked out. The result is that this all gets ingested into the search engine of the Archive, and so if you’re looking for a file within the disk images, there’s a chance you’ll be able to find it.

Once the robots have their way with all the items, I can go in and fix a few things, like screenshots that went south, or descriptions and titles that don’t reflect what actually boots up. The amount of work I, a single person, have to do is therefore reduced to something manageable.

I think this all works well enough for the contemporary vintage software researcher and end user. Perhaps that opinion is not universal.

What I can say, however, is that the core action here – of taking data away from a transient and at-risk storage medium and putting it into a slightly less transient, less at-risk storage medium – is 99% of the battle. To have the will to do it, to connect with the people who have these items around and to show them it’ll be painless for them, and to just take the time to shove floppies into a drive and read them, hundreds of times… that’s the huge mountain to climb right now. I no longer have particularly deep concerns about technology failing to work with these digital images, once they’re absorbed into the Internet. It’s this current time, out in the cold, unknown and unloved, that they’re the most at risk.

The rest, I’m going to say, is gravy.

I’ll talk more about exactly how tasty and real that gravy is in the future, but for now, please take a pleasant walk in the 3D0G Knight’s Domain.