Just Solve the Problem Month 2012: Nitty Gritty —
This will be somewhat long and somewhat involved. It’s a posting meant to give my personal positions and assessments on the Just Solve the Problem 2012 Project, which is “Solve the File Formats Problem”. Ideally, it answers everything but specific choices down the line. If not, the Wiki will have answers. Here we go.
General Purpose of the Project – What is the “Problem”?
I call the File Format issue a “Problem”, and I go farther than that to say that it’s a problem deep enough that it requires hundreds of people a month to work through. For some that’s a very optimistic estimate and for others, well, it’s not clear there’s a “problem”. So let me explain that.
The “Problem” goes like this. There are people, a lot of people, who have information that is encoded in some sort of format, be it electronic or wrapped in colored ribbons or stamped with some bizarre upside-down chicken scratch. They’re faced with an issue that the thing they have within their possession is missing a final piece or set of pieces to release the information within. If you have a floppy disk, for example, you need a drive to read it. But even if you have the drive to read it, the machine to read the data might not be with you, and even if you have THAT and you are able to put the disk in the drive next to the machine, you might not know how to get the information in the format on the drive into something you can read. It’s a problem, you see.
(Now, there’s a SECOND layer of information you’re going to miss no matter what, and I’m just mentioning it to be complete – obviously if there is no record of the context of this data, that’s not going to be evident no matter what you do. If this poem is the last poem a person wrote before going missing, or if this scribbled set of English words that looks like a shopping list is in fact a calculation for committing a bank heist in code, then no amount of what you do is probably going to find that. For that, you need the efforts of lore, of interviews, and of collecting context. But that’s not part of this problem.)
So, the File Format problem comes into contact with the lives of hundreds of thousands of people. In many cases, they take the most efficient route: Fuck it. It’s old, it’s probably useless, you’ve gotten by without it for 15 years, chuck that crap, we need that room for a guest bed.
But that’s not sufficient if, say, you’re an archive or a library that has been gifted with many floppy disks created by a celebrated artist who died young and left a lot of mystery behind them. Then you not only want the data off the floppy disk, you want to really understand the format of the files on the floppy disk, including whether or not you can recover files, find changes in the files, or a whole other manner of data that might have significance. As that article I linked to shows, it has significance indeed – critical changes in lyrics and structure, even to the point of showing possible intentions to change the work further. The simple miracle of pulling data off long-obsolete floppies becomes a bigger problem as you try to understand the formats, and even worse, understand unexpected side-benefits of the formats. There’s a lot there.
So assume, then, that what’s hidden away on that “dead media” or inside some file folders in a .zip file you found has actual significance.
For most institutions and individuals, this sort of interest/dalliance has a very specific path: you have a pile of one kind of crap, maybe two (108 floppy disks, 12 data cassettes) and you want the stuff “off” it however you can. Having looked around, you might or might not find solutions, although solutions do abound. And you are certainly never going to finish getting the 108 floppy disks and 12 data cassettes finished and then launch into a crusade to find every piece of magnetic media on your suburban block and volunteer to help everyone excise their data from the plastic coffins all of it languishes on.
My line of interest and work puts me in touch with a lot of people in the “history” biz, be it the professional archivers and librarians or the intense hobbyists of vintage and retro computing. And what really started to stick with me was the way that almost all of them had this file format problem, and had come up with some level of solution to it. Someone might go so far as to make a product, or a utility, or a code library to deal with it. For example, the ANSILOVE project does a fantastic job of taking ANSI Artwork (a hack of a DOS-based text encoding system that got used to make great pieces of art) and drilling down deep into interpretation to go as absolutely wide in saving the text for future generations. Trust me, for 99.5% of all cases, it “solves” the ANSI Art translation problem. There are exceptions and there will ALWAYS be exceptions, but generally, if you want to see these files presented in a wide range of modern platforms, this utility/project will do it for you.
That said, a lot of people who remember that obscure format might have no idea something like ANSILOVE exists, might not know to go look for it, wouldn’t know where to start. Even though the solution exists, it might as well not exist for these people because they don’t even have the faith or the thinking to consider the solution might not consist of finding vintage hardware, dragging what they want into the old thing, and losing months trying to make the environment work again.
Now, expand this specific situation out. Keep expanding it.
No, really, really expand it out and now you’re running into The File Format Problem writ large, the wide spectrum of missed communication, lost information, and obscurity that things are suffering in. That’s what’s being addressed here.
And when I hang out with people who have an interest in some aspect of this issue, they say the same things: The problem is unsolvable, because it’s too big – too much is left to do, too many things have to be searched, there’s no funding to research this forever. We can’t make it happen, and we’ll just have to make do, keep writing grant proposals, and hope for the best.
This project is meant to put that idea to bed – to make it so that only the most obscure, customized, no-record-exists-of-the-data-or-the-format situations will linger on. To make it that if someone gives you something on media, you can say “yeah, I think we can work with this” and be surprised that it doesn’t, instead of the other way around.
But There Are a Lot of Already Existent Projects Like This, Don’t You Know
I’ve said this a dozen times now and here it goes again: Just Solve the Problem is ABSOLUTELY NOT THE FIRST TIME THIS HAS BEEN TRIED AND IS DEFINITELY NOT THE ONLY SUCH PROJECT UNDERWAY IN THE PRESENT. There are plenty of versions of this project out there. Tons. To imply otherwise would make me a liar and blind on top of it, because many folks have come out to let me know, just in case I didn’t find out myself.
What distinguishes this project from all those similar efforts is the following:
- We have no affiliation whatsoever. There’s no organization with its own politics and biases in place, there’s nobody to go “woah, hey, let’s not go there”, and there’s certainly not anyone who’s going to pull the plug because someone flies a little too close to the sun.
- The white-hot effort is very directed within a 30 day period. Yes, this will stay up after 30 days, but the idea is that people are able to work on this project immediately and know that a bunch of other people are right there, working with them. There will be many eyes, many hands, involved in this effort and you’ll reload to see more and more changes go on throughout the month.
- I will say this even more explicitly in the next section, but this project doesn’t shield its eyes from these other projects and sites; it embraces them. It depends on them. What makes the File Format Problem project even somewhat achievable is the very existence of all these other resources.
Freed of these (entirely legitimate) boundaries of budget and scope the other projects and sites have, Just Solve The Problem 2012 can go in directions heretofore unexplored or left as “frivolous” and “wasting the budget”. That’s the big deal.
And the other big deal is that the effort to enumerate all these items is absolutely a public domain effort. The basic tenet is that all the collation and combination and addition of cross-referenced information this project brings can seep back into all the linked projects. The people working away can paw through the piles of data this project brings and then pick back whatever they want – it’s like they got a team of researchers and contributors for absolutely free. The rising tide that lifts all boats.
So let’s go into the basic fact of the project:
The Project Initially is the Collation of Already Extant Information!
The quote from William Gibson is this:
The future is already here — it’s just not very evenly distributed.
The idea being that a lot of what we think of as “the future” exists, but only in limited areas, available to researchers or the rich or otherwise prevented from being universal. This same situation exists with file format information – it is very, very rare for things that were put into a format to have not had documentation generated for them. In the case of a lot of file formats, software might have been written that reads and writes in that format. The code might function as the documentation, or in someone’s head right now is the format information that would unlock the data from its obscure setup.
I expect very little original research to be necessary to solve this problem for the vast, vast, vast majority of the file formats being addressed. There are people who are well versed in the BetacamSP format. There’s machines, there’s documentation, there’s examples and there’s available tapes. It’s just not right here, just like all of that same sort of stuff for punched cards and piano rolls and Lotus 1-2-3 files are not right here.. but again, they can be.
Myself and a veritable army of volunteers have been uploading Shareware CD-ROM images to archive.org. We’re well past 1,500 CD-ROM images. And we’ve got a couple thousand more to go up. Well, buried on those shareware CD-ROMs are tens of thousands of utilities, written in the present day of the file formats they use, that can transfer between formats, commit action on those formats, and create new files using those formats – and that’s not even counting the documentation, which often shows off the file header information or back-channel knowledge of the file formats being used. The concrete answer to thousands of file format questions are just sitting there, waiting for someone to connect them up.
Good work has been done on the other directories and sites by their staff. In many cases, they have limited resources of disk space, bandwidth, contributors or funding to go too far. We can take what they have and integrate it and link right back to them. We can make it that if someone finds they have an IFF/ILBM image from an Amiga, well hell yes we’re going to have a page with every last piece of collated information, including code and writing, that will help them make that stuff live again.
Realize, therefore, that there will be volunteers on this project who will do nothing but shuttle between websites and add links to the Just Solve the Problem Wiki. That’s all they need do – wander into the entry for FLAC and dump in a hundred informative links, and then move onto the entry for Wax Cylinder and add those. They don’t have to knock on doors, or make phone calls, or run endless nights of coding and experimentation – they have to take someone else’s experimentation and endless nights of coding and link to it. That’ll be quite a heroic act in itself!
What kind of Person Will Be Involved?
I’ve sketched out some roles that people might play in the Wiki:
- Explorers are on the never-ending quest to find more file formats, more obscure references to file formats, and hidden away gems and information the File Format Problem can use. They reference the Sources or even find brand new Sources to use and add them to that list.
- Backfillers go to already-extant entries and add in greater details, including summaries, links to pages others have written in the web about the format and the subject, and acquiring some select images or items to represent the format. They pull from the Sources but also just do the basic effort to make a page be something more than blank.
- WikiWonks look over a given page and fix the MediaWiki encoding so that the items is more easily readable. If you create general templates that can serve out pages better, we’ll apply them to the whole of the pages that fall under the template. The more time freed up to acquire information, the better.
- Essayists are writing or referencing sets of documents to create critical new histories or descriptions that go far beyond a technical view of a format. If there are litigation, research, or human aspects to the format that should at least get a summary, the Essayists are adding them.
All of these are in flux and you need not be one the whole time. But all are needed and all will play a vital part.
I’ve glanced at the Wiki and you do things a Certain Way and So You Are DOOOOOOOMED
So, my experience in some quarters with a project like this is that people feel they need an entire specification written out, must float it amongst committees, must run it for authorization and sign-off from authorities, and only then begin the slow process of applying for a grant to make it all happen.
Yeah, well, guess what. We’re already underway.
As of this writing, we’re dumping in general headings of file formats, building up a huge source directory (including sites, documents, books and other materials), and kind of flinging together the ontology as we go along. Give it a week, it’ll all be different.
I thought you hated Wikis.
I’m known as a major Wikipedia critic but that’s a very different thing than the software itself, Mediawiki. I happen to like that software very, very much. And being under the white-hot testing environment of the Wikipedia, I know the software holds up, and holds out. For the act of collaboration, of calling together all this data and then implementing templates and automatically generated directories, it’s a great way to go. I’m not concerned about the software suddenly hitting some upper limit as we do this. We can concentrate on getting the problem under control.
So now what?
Well, all I ask is that you try.
Write to me at firstname.lastname@example.org that you want to register an account on the Wiki. Give me the username you want. Come on and poke around.
If you think the project is worthwhile, tell your friends or colleagues or communities about what we’re doing. Rope them in – get everyone to pitch in.
I wish we had a thousand people working on this. 1000 people for 30 days would demolish this problem. The resulting directory of file formats and links would be a breathtaking version 1.0 of reference material, the go-to location for getting started on reading or saving a file that otherwise would languish and disappear. It’d be a place where you’d know who to contact and what to use. It’d just solve the problem.
Let’s do it.
Categorised as: Archive Team | computer history | jason his own self
Comments are disabled on this post
This project sounds very interesting. I know this may seem overkill, and yes, I know this project’s middle name is sort of overkill, but I hope people create archives of all those sites to which the wiki links. Of course, this is to avoid situations where umpteen years from now, someone goes to the wiki, finds a link to the specification for insert really really obscure file format here, which is the only known record of that file format, clicks that link, and is greeted with a real, live, working example, not of the obscure, nobody else even remembers it, file format, but of an HTTP error 404.
[…] Here’s a link to a project by Jason Scott that is an attempt to solve the file formats problem. I’m putting it here for both my reference and yours. Share […]
Isn’t a lot of this material already in wikipedia?
Seems like duplication… but a good idea overall.
A lot of material IS on Wikipedia, as well as on a couple hundred other scattered sites.
Besides the fact that Wikipedia has shown a propensity to randomly delete articles based on “not notable” or “lacking citation”, it’s also not geared towards providing long-form collections of overviews of file formats – that is,if an article gets too long (according to an arbitrary measurement) or too detailed (according to an arbitrary measurement), editors will hack it at the knees to make it fit the bed.
But more importantly, there are sites that are dedicated to listing file formats – and there’s sites dedicated to file formats that are wikis! The purpose of this project is to throw 30 days of effort into it from a lot of people and then provide all that information back to all these sources. If the project was painstakingly rewriting already-extant Wikipedia entries, using just that source for its items, then yes, this would be duplication. But we’re focusing on drawing together lots of information and getting it into a form that will make it easier to both look up a file format, as well as benefit a bunch of other attempts to solve this problem as well.