Let’s Just Solve the Problem Month — July 2, 2012

I had really good success when I put out the call for the Archive Team, so let’s try that again, with an entirely new idea.

I would like to declare November 2012 the very first Let’s Just Solve the Problem Month.

Here’s how it works, and what problem I want to solve.

As that sexy pontificator Clay Shirky has said on several occasions, instead of getting hung up on whether Wikipedia is great or not great, instead realize that Wikipedia represents a massive expenditure of energy recovered from not watching television. Not only that, but Wikipedia is one of what could be many different things happening that benefit the world. All you need is a dash of organization, a clear set of principles, and off you go.

I buy into this.

I also buy into the idea behind National Novel Writing Month, which has at its core that everyone has at least one (incredibly shitty, possibly unreadable, vogon-level-quality) novel inside them, and by setting aside one month of you being encouraged, forced, guilted and tortured, you will blow out one 50,000-word novel in that time. What happens next is up to you – burn it and move on, take it aside and polish it until you’re the next JK Rowling (or Hunter S. Thompson), or whatever tickles your fancy. But at the end, YOU WROTE A NOVEL BEFORE YOU DIED. Not bad.

What I know to be true is that there are a number of “problems” out there that need to be solved, that need one single thing to push them from “impossible” to “solved”, or, at least, “1.0”. And that thing that it needs is a lot of human thinking. Often rote, often boring, but necessary, to slam that thing out.

So since I got to come up with this idea, let me declare the first month, November 2012, to be SOLVE THE FILE FORMAT PROBLEM MONTH.

Here’s the problem, in more detail:

In the last couple centuries, we’ve created a number of self-encapsulated data sets, or “files”. Be they letters, programs, tapes, stamped foil, piano rolls, you name it. And while many of those data sets are self-evident, a fuck-ton are not. They’re obscure. They’re weird. And worst of all, many of them are the vital link to scores of historical information.

Everyone knows this problem. It’s why old novelists cry they can’t pull their first novel out of Wordperfect. It’s why someone who used U-matic tapes to record the first meetings of a famous protest group goes “oh well”. It’s why, in all things, someone looks at anything older than five years, and goes “bye”, figuring there’s nothing they can do.

And I’ve had to listen to the mewings about this problem for at least 20 years now, in various forms. A lot. And then the person lights up about maybe solving this problem, and then dims and says “well, we can’t really solve the problem”. Because they know – it’d take an army of people to do it.

Let’s make that goddamned army.

And before I give you a battle plan, let me say: This will solve a major issue. This will give thousands, later millions, access to a whole range of materials now shut off from each other. Stuff being made after 2012 will be scrutinized to see if it has made ways to access it clear. Stuff made before? We’ll have docs, or a thread, or even a few first steps towards understanding what it was. People writing modern software will be able to make filters or plugins that use these standards – it’ll drop from being a needless rathole to being a simple matter of writing a perl library or a javascript routine to pull the data in and make it work with the new thing. That will be very helpful indeed.

Battle Plan:

In October, I’ll be making noise about this happening. We’ll have a logo, and we’ll have some preliminary work done.
It’ll be a big wiki, with people taking various roles of the exciting and boring parts, working on a structure, yanking in what we need.
We’ll scour the internet, and online and offline worlds to pull in every potential format ever. If it sounds like a hierarchy issue, yes indeed it is… but classification’s bugbear is a distant second to acquiring the wealth of formats now extant.
We’ll acquire examples of the formats, links to programs that deal with the formats, known variations or problems with the format, and so on.
We’ll keep doing this from the low-hanging GIF and JPEG and PNG documentation, to the aforementioned piano rolls, microfiche, obscure barcode formats and disk layouts of Cray platters. We’ll just keep doing it.
At the end of the month, having had our knees on the chest of this problem for 30 days, we’ll be dragged off the problem, kicking and screaming and still punching, and see where we are.
The resulting work will be open-licensed and available to anyone.

Now, if you just read all this, let out a big “pffffff” and are having your fingers twitching with the urge to write about how this is all impossible, just get the fuck out now. The project doesn’t need you, now or ever. Just enjoy the summer, grasshopper, and come knocking on the ant’s door in December when we’re at 1.0.

But if you read this and said “Well, I could take a shot at it, might be worth a few hours”, then you’re EXACTLY what is needed.

Think what giving a month every year will do for a problem like this. There’s plenty of others – but this is one that has vital meaning to the work I’ve done with Archive Team and to the hundreds of archivists and historians I’ve met over the past few years. If this problem is in some way handled, if an OED of formats is blown out, lives will change – projects thought undoable will be doable, and the flood of old information saved will be incalculable.

So who’s with me? SEE YOU IN NOVEMBER.

Categorised as: Archive Team | computer history

Comments are disabled on this post

42 Comments

Dag Ågren says:

July 2, 2012 at 10:42 am

Well, I’ve sort of started already. I’ve been slowly investigating and implementing old archive formats for The Unarchiver, and I have the only open-source implementation of at least a handful of archive formats by now. Some are available only as source code and some I’ve written some documentation for here.

Not sure if I’ll have the time to help out further, but I might. I’m interested in seeing how it turns out, anyway.
Benjamin Ragheb says:

July 2, 2012 at 11:35 am

Gosh, I don’t know if I’ll be able to help with this AND write a novel, but I’ll try!
Vince Mulhollon says:

July 2, 2012 at 11:37 am

Go go go. I know a lot of retrocomputing-ish formats (pdp-8 OS/8 disk format, etc) and where to get the info for many more.
Jason Scott says:

July 2, 2012 at 12:13 pm

Enjoyable! We’re doing some preliminary planning (you can’t have a party without knowing who’s catering or what bar we’re renting out) on this page: http://www.archiveteam.org/index.php?title=Just_Solve_the_Problem_2012
Call to arms: solve the file format problem « Unsustainable Ideas says:

July 3, 2012 at 2:49 am

[…] digital objects from their (and our own) past. That’s why I’m really pumped up about Jason Scott’s call that we make November 2012 as “SOLVE THE FILE FORMAT PROBLEM MONTH”. It’s a great […]
Euan Cochrane says:

July 3, 2012 at 3:29 am

Great venture!

I can’t emphasize enough the need to document variants of standard formats and to associate them with documentation about software environments that create them.

I guess I’ll have to get involved in November…
Edward O'Connor says:

July 3, 2012 at 10:40 am

If you’re looking for specs that need writing, the Web platform is missing several specs: http://wiki.whatwg.org/wiki/Specs_todo
Stephen Abrams says:

July 3, 2012 at 10:48 am

Within the library and archives community there are two public format registries that are first steps towards dealing with this problem: PRONOM, from the UK National Archives, http://nationalarchives.gov.uk/PRONOM/Default.aspx; and (newly announced today), the Unified Digital Format Registry (UDFR), from the University of California Curation Center (UC3), funded by the Library of Congress, http://udfr.org/.
Just Solve the Problem Month / The Opt Out Weblog says:

July 3, 2012 at 10:52 am

[…] July 3rd, 2012Independence, Other Efforts “What I know to be true is that there are a number of “problems” out there that need to be… […]
bowerbird says:

July 3, 2012 at 11:02 am

been working on this problem for about a decade
— started off with project gutenberg e-texts —
honing a _simple_ solution i’m about to take wide.

gonna use kickstarter to raise money so that i can
put all the source-code, etc., in the public-domain.
so if people would prefer to contribute a few bucks
rather than a few hours of time, i’d appreciate that…

interested people can e-mail me for a sneak preview:
> bowerbird@aol.com

-bowerbird
Larry Masinter says:

July 3, 2012 at 11:05 am

I’ve been thinking about the file format identification and description problem since the late 80’s. Most recently, I think the breakthrough for me has been to recognize that the identity of a “format” (and its versions and variants) evolves over time, and that we can associate both formats, their specifications, and their implementations.

See http://www.w3.org/wiki/Evolution
Jerome McDonough says:

July 3, 2012 at 11:55 am

Actually, currently working on trying to establish an instance of Ontowiki with customized forms for entering data on file formats that will allow A. anyone to add information to extend the knowledgebase and B. maintain it in RDF/OWL behind the scenes so it will be processable by other machines (for things like DROID, JHOVE). This is part of the Preserving Virtual Worlds work and will require some refactoring of the original PVW ontology but we should have this up and running before November, so….
Leslie Johnston says:

July 3, 2012 at 5:00 pm

You know we have extensive documentation on file formats at the Library of Congress: http://www.digitalpreservation.gov/formats/index.shtml. And I see that Stephen Abrams has already shared the links for the UDFR format registry and the PRONOM format registry. And Jerry – the UDFR that launched this week is built on Ontowiki and supports data entry. You should touch based with Stephen.
- Caroline Arms says:
  
  October 25, 2012 at 5:24 am
  
  FYI to the “Just Solve the Problem” team:
  The format descriptions at http://www.digitalpreservation.gov/formats/index.shtml are derived from XML documents. If those will be useful, the whole set can be downloaded from http://www.digitalpreservation.gov/formats/fddXML.zip
  
  Individual XML files are also available using a standard pattern: The XML source for http://www.digitalpreservation.gov/formats/fdd/fdd000328.shtml is at http://www.digitalpreservation.gov/formats/fddXML/fdd000328.xml
  - Jason Scott says:
    
    October 25, 2012 at 9:08 am
    
    Thank you, Caroline!
Jason Scott says:

July 3, 2012 at 5:49 pm

I am delighted to hear from many friends old and new! Thanks for all coming out.

Yes, part of the tenaciousness and difficulty of this problem is that so many people are trying to solve it, and to solve it in a way of interest to that group, and with an eye towards their goals.

What I am proposing to do is to bring a white-hot laser of focus into this problem for 30 manageable days by a huge variety of folks. One of the things this project will do is drain all already-extent information about all formats into a wiki, and build links to a whole other range of materials. With luck, we can also have the incoming information formatted in a way that the other efforts could pull the information BACK into their little cubbyholes.

That’s the goal here – pour a lot of energy into this, get a real grip on as much of everything as possible, and drill down in every direction. I can’t predict how many people will go for it, but imagine 1000 people for 30 days.
Response to the “call to arms” post « Unsustainable Ideas says:

July 4, 2012 at 2:32 am

[…] I wrote in excitement at Jason Scott’s call to arms to make November 2012 the month to “solve the file format problem“. While excited, I’m not quite clear yet what the problem is, and suggested some […]
Chris Rusbridge says:

July 4, 2012 at 2:47 am

I gather you’ve seen my blog post so you know I’m excited about this. I’ve written lots about aspects of this problem in the past, too, and have even tried (and failed) to do something about it. I’d really like to contribute, the question is, how do we do that? Where do I sign up? What can I do (not been a coder for almost 20 years, so that’s not much help)?
klaatu says:

July 4, 2012 at 5:57 am

Great idea! I feel like I’m a broken record when I warn artist friends that their art could be locked inside a format that no app will decipher in X number of years and blah blah blah, open formats, blah blah. No one listens. No one but Jason Scott, that is!
David Boddie says:

July 5, 2012 at 1:21 pm

I’ll definitely be interested in taking part in this effort, even if only in a small way. I’ve been writing tools to pull data out of files in obscure formats on a non-mainstream computing platform (RISC OS) for over a decade, and information about these needs to be collected somewhere. I can at least start by collecting some links together.

There were potentially quite a few documents created in undocumented, proprietary formats on this platform, many by teachers and schoolchildren because of the widespread use of the platform in schools. It would be good to recover some of that history before it is too late.
Ben says:

July 7, 2012 at 5:11 pm

Maybe setup an announcement mailing list to get the army ready for command?
Kara Van Malssen says:

July 9, 2012 at 9:08 am

LOVE this so much. I teach Digital Preservation for NYU’s Moving Image Archiving and Preservation MA program. I plan on incorporating JUST SOLVE THE PROBLEM into the syllabus for this November, and get the students contributing! In class, out of class, final assignments, etc. I can’t wait. If there’s anything in particular you want 10-15 students to work on, starting in September, please let me know!
Ed Summers says:

July 9, 2012 at 9:39 am

I’m a big fan of an experiment to “drain all already-extent information about all formats into a wiki”. I’m still a bit confused about what the problem is though. Is the problem identifying the file formats, using the file formats, or both? It seems to me that the second problem isn’t very tractable. I’m glad others have chimed in with data sources that would be good to drain. There is also the Wikipedia Digital Preservation Project which aims to improve the content of articles about file formats including their use of the File Format Infobox. I know you aren’t the biggest fan of Wikipedia, but have you thought at all about pointing your laser in that direction?
Scott Francis says:

July 10, 2012 at 6:57 am

+1 for mailing list; it’s a long time between now and October and attention spans on the Internet are notoriously short. Some kind of reminder in my inbox (if not full-on discussions) would be very helpful.
Ludwig Ertl says:

July 17, 2012 at 4:02 am

If looking for some file format Documentation, I’m usually having a look at http://www.wotsit.org
They have documentation for at least a few file extensions.

I wrote some programs to recover data from bad Canon Starwrite 80 typewriter disks, maybe it’s useful for someone who is still using this old typewriter, so maybe I shuold contribute it..?
stelt says:

July 20, 2012 at 12:52 am

What I would love: a file explorer extension (both web browser based and native(-ly wrapped) to Windows, MacOS, Linux) that not only gives me human-readable information on the file format, but also gives me user-friendly functionality (that is of course continuously improving, based on some on-line database/wiki):
– An improved “open with” could give me not (only) installed programs that have registered themselves within the OS file associations, but also web services and OS-native open source programs ready for download. Again of course not just providing a link and leaving it to the end user, but handling the hassle for him/her.
– Extra columns in this filebrowser providing extra information. If it’s XML, is it well-formed? If it’s HTML, is it valid? Does it have dead links? Is it mobile-ready? Is it accessible? Will it give problems in certain browsers? For SVG, does it have semantics? For text, what encoding, what language, what’s the % of spelling errors? You see we need some plug-in scheme here. Plug-ins referring back to OS-native and web services. Flexible, not always running all the checks, not locking on slow stuff (either local or not).
– Extra filetype analysis tools, as the extension isn’t always enough.

Let’s not only fix the problem, also to the end user make it easier to not grow the problem than to do. Provide super user-friendly functionality for open formats and provide some functionality for closed formats too and this will even direct the lazy end user who doesn’t (want to) know about the file type problem in a positive direction.
Alex Keller says:

July 24, 2012 at 3:16 am

Likly to result in XMLeTeX. Though I will both be spreading the word and helping come Nov. 🙂
Robert Marker says:

July 25, 2012 at 6:58 am

Not much of a coder but am one hell of a tester. I can bring any file or system to it’s knees. Just keep pushing the limits until something hangs and then go looking at the logs! This proprietary or otherwise file issue is a pet peeve. I will watch this site and contribute if and when I can. Go get em folks.
Jayson Smith says:

July 25, 2012 at 1:07 pm

I know of at least a few physical object formats that nevertheless have computer data, namely, Casio ROM packs and Yamaha Playcards. These were music cartridge/card formats introduced in the early 80’s by their respective manufacturers. Apparently the Casio Rompack patent provides loads of information, while the Yamaha Playcard patent provides no information about the encoding of the data on the magstrip, although it seems to be a simple AFSK system, however, I don’t know of anyone who knows any more than that. The Yamaha Playcard system never had a writeable version, so only a finite number of distinct cards exist. Casio RomPacks, as I understand it, did have a recordable version, although it was probably backed up by battery.
Darkstar says:

August 8, 2012 at 12:58 am

Yay! GO GO GO!
I’ll try to add some stuff about an old DOS based tape backup system called Central Point Backup, with which I recently had to restore some 200 DDS2 tapes (I had to build a DOS based, networked 386 for that, it would have been so much more convenient if there had been a way to do it on windows/unix)

-Darkstar
Val says:

August 17, 2012 at 1:27 pm

I’m not sure what I can do, and this is an intimidating goal, but if I can do anything to help I’ll be gobsmacked. Read wiki page. Waiting to learn more.
JCB says:

August 26, 2012 at 9:43 am

Excellent idea, Jason.

I’ve mentally flagged this as the “undead media project”, as a tip of the hat to the old Bruce Sterling effort (which sadly seems to have gone the way of what it was trying to document).

If there’s anything my modest skills can help with, I will certainly try to do so. I’m certainly going to get the word out amongst the classic computing and hackerspace denizens I know.
Legacy document formats « Unsustainable Ideas says:

September 27, 2012 at 2:51 am

[…] the Jason Scott November File Format month of action comes closer [update: original post here and wiki page here], and also as I wrestle with trying to access some 50 or so Powerpoint 4.0 files […]
The PowerPoint 4.0 adventure: what did I learn? « Unsustainable Ideas says:

October 15, 2012 at 5:24 am

[…] for the latest state) was to see what I could learn from it, with half an eye on the Jason Scott November month of action on file formats, see also planning here). So, what did I learn that might be of more general interest than the […]
Open letter to Microsoft on specs for obsolete file formats « Unsustainable Ideas says:

October 22, 2012 at 7:34 am

[…] call for action to “Solve the File Format Problem” scheduled for this November (original post here and wiki page here http://www.archiveteam.org/index.php?title=Just_Solve_the_Problem_2012). Jason […]
Chris Muller - Raider of the Lost Archives says:

October 24, 2012 at 9:26 am

Jason,
This is a great idea. Full disclosure: I make my living doing some of the things you’ve talked about (figuring out old tape, disk and file formats). So there’s a personal interest in keeping some stuff close to the vest*. But I do have spasms of altruism and would like to contribute where I can without strangling my cash flow. I’ll bet there are a number of elder-geeks like me that have tidbits stashed away, currently out of google-reach. This type of crowd-sourcing could go a long way.
Chris
- Jason Scott says:
  
  October 24, 2012 at 9:40 am
  
  Chris, there will always be money in transferring and collating/sorting/classifying old data for people, enterprises and other organizations. None of this eats into that. In theory, it increases it because being aware of all the file formats means that no disk is worthless because it has stuff from THAT application, the one nobody can read from.
  - Chris Muller - Raider of the Lost Archives says:
    
    October 24, 2012 at 10:29 am
    
    Jason, disagree and agree. Content analysis and conversion is a bigger part of what I do than is media conversion. One small example: over the past few years I’ve gotten some tapes with files in the oldest mainframe versions of SPSS (70’s, early 80’s), not compatible with current versions. No one had been able to convert, and nobody had funding to support immediate efforts. Contacted IBM, who now owns SPSS, and even the professor who originally created it–he didn’t have any documentation or significant memory of it. So over the months I spent many hours puzzling it out, finally wrote a converter. Now, a few times a year, an academic will run across some of this old stuff and I make a modest amount converting it. I agree the knowledge shouldn’t go to waste, and there are certainly others like me that can contribute, but it means giving something up.
    Now for the agreement part. There are bound to be other situations where the information is extremely unlikely to produce revenue. Also, some of the work done by folks like me is funded by NSF and similar organizations. It might be helpful to urge funders of work that involves converting arcane stuff to require that special knowledge gained during the project (and perhaps source code) be made available to the public. In that case it could be fair for them to pay a bit more for compliance.
Larry Masinter says:

October 24, 2012 at 9:58 am

http://seattleareaarchivists.wordpress.com/resources/webography-of-resources/preservation/

has links to many existing registries of file formats and documentation representing multi-year efforts.
http://www.nationalarchives.gov.uk/PRONOM/Default.aspx
looks promising
Chris Muller - Raider of the Lost Archives says:

October 25, 2012 at 11:25 am

Of course, arcane media/backup/file-system formats are also an obstacle to accessing legacy data. Proprietary backup formats, for example, such as HP’s fbackup which isn’t compatible with other Unix variants. A whole bunch of optical disk formats were created in the early days by vendors that no longer exist. Do any of these format-info repositories cover these things?
Heather Bowden says:

October 25, 2012 at 12:39 pm

I’m IN.
Chris Muller - Raider of the Lost Archives says:

October 26, 2012 at 6:37 am

related topic: Data At Risk Inventory (DARI)

One of the factors that puts data at risk is the very thing we’re discussing here: the format is no longer understood.

More about DARI for anyone interested…

CODATA, the Committee on Data for Science and Technology, is an interdisciplinary Scientific Committee of the International Council for Science (ICSU), was established 40 years ago. Among its groups is the “Data at Risk Task Group” (DARTG). See http://ils.unc.edu/~janeg/dartg/

A major goal of DARTG is to create an Inventory of data that are at risk, and whose unique scientific information is in danger of being lost to posterity. (The Inventory will become the foundation for a Phase II project to design a series of missions to rescue that information.)

Interested parties are encouraged to submit descriptions of the endangered data through the submission form: http://ibiblio.org/data-at-risk/contribution

Share this: