ASCII by Jason Scott

Jason Scott's Weblog

Some Random, Unsorted Thoughts on Sorting —

A few people, when I recently talked about the new hard drives I bought, asked me about how I sort things, since they have absolute tons of random files as well. Totally understandable, and I’ll happily talk about it, but I have to warn you that I don’t do anything according to any known code or formatting. I do what works for me.

If nothing else, I have to stress the most important rule, which I picked up, from all places, AEleen Frisch‘s book “Essential System Administration”. In her book, she tells an anecdote, which I will now tell to you.

“I learned about the importance of reversibility from a friend who worked in a museum putting together ancient pottery fragments. The museum followed this practice so that if better reconstructive techniques were developed in the future, they could undo the current work and use the better method. As far as possible, I’ve tried to do the same with computers, adding changes gradually and preserving a path by which to back out of them. “

A little white-hot cube of brilliance, that is. And that’s the #1 thing: any methods I provide or come up or which you do must be ones that, down the road, you can completely undo as better technology and techniques become available. Specific to the sorting of files, this means I don’t kill off compilations, delete metadata, undo ISOs, or otherwise split apart that which can’t be immediately unsplit. I also, whenever possible, try to keep things together that were always together. In all cases, it’s because as time goes on, things get better.

I have a FreeBSD file server using samba to allow my Windows box to interact with the hard drives. This is important because it lets me choose utilities that work in Windows as well as scripts and applications that work in FreeBSD/Linux. So I get whatever does the job best.

You can’t survive, once you go past a few tens of thousands of files, without some sort of doubles checker. I use a freeware program called CloneSpy as well as some Perl scripts that find duplicates. I actually have a version of the perl script that always deletes the newest files that are doubled; this lets me run it automatically as needed and kill off the redundant newcomers.

I am always erring on the side of “get it again” if I can’t recall if I downloaded anything. As a result of that, I have a lot of doubled data; I just found recently that I had over 40gb of redundant data collected on 15 hard drives across three machines. That’s a lot of downloading the same stuff. But better that than the sad keening I get from people who can’t believe that yyysite.com has gone under and nobody kept a copy. I keep a lot of copies.

So first, I split stuff into generic massive folders. In my case, it’s IMAGES, MOVIES, AUDIO, WEBSITES, APPLICATIONS, and DOCUMENTS. I acquire something, and throw it into one of these massive headers. That’s good enough sorting for my needs, on the spur of the moment. At least it’s generally there.

Underneath each one are arbitrary collections. So, for DOCUMENTS, I have sub-folders like MANUALS, MAGAZINES, BOOKS, and so on. Under MOVIES we have sub-headings like MUSIC VIDEOS, MUSICAL EVENTS, PRESENTATIONS, TECHNICAL DEMOS, CAMERA DEMOS. I built each one up when I had a collection of movies that would fill such a directory. As you can see, these are arbitrary. Is something a technical demo or a camera demo? Is it both? I choose one, randomly.

Under DOCUMENTS/BOOKS I will likely have thousands of documents representing books (and textfiles and PDFs and so on). So, if it starts getting big, I add subfolders under THAT like POSTERS, FICTION, TECHNICAL, SCIENCE, HAM RADIO, and so on. Each one gets a bunch of books.

Now, you would likely split things up differently, and we would probably disagree on what goes into TECHNICAL and what goes into SCIENCE. And indeed, sometimes I will yank something out of one folder and put it elsewhere.

But what I’m doing in all this is reducing the size of any given directory. Instead of having to stare, dumbly, at a multi-thousand-file data dump that I can barely get though the “A”s without glazing over, I have a few trees I can browse in.

We get into an advantage of my personality, which is that I have an unnatural attraction to classification and sorting. I will sit for hours and hours and hours, taking a big pile and adjusting it into dozens of smaller piles, arranged along a hierarchy. I do this all the time, both on my computer and in my office and in a bunch of other locations. (I straighten places I visit, for example.) So for me, this whole approach works because I have so much fun sorting it.

Now, and this is important (and I’ve mentioned this before), what is going on with this data is that it is all STATIC. That is, as opposed to dynamic. This stuff has a specific aspect about it, that is, once I grab “it”, “it” is basically done as far as my interaction with it. I might read it or look at it, but “it” stays the same. This works for movies, documents of a collected nature, music, and so on. I have it and that’s that. So this data is all kept in one place.

In other folders, I have more dynamic stuff, like e-mail I’ve sent, documentary in-process stuff, raw footage, work documents, and so on. This stuff is still being worked with, still being engaged. So it doesn’t make ANY sense to put it on this static location. I might, if it strikes me, put a backup folder on the same drive as the static folder, but that’s simply for redundancy, not because it should be there. I’m basically piggybacking on the infrastructure already there, like leaving my valuables at work because work is unusually protected or secure.

That said, once my dynamic stuff becomes static (new job, documentary is complete), then it becomes static and is shoved on the drive as needed.

So this, in a very simple nutshell, is how I approach my data. I do not pretend it would work for everyone, and I’m not overly interested in hearing about improvements to my system. It morphs, adds and deletes ideas. But for now, that’s how the terabyte storage is split up. And I tell you, I can get an idea in my head (where’s that podcast I wanted? Where’s that old website with the cool pictures I saved?) and I can get to it within a very short time, sometimes a few seconds. That’s good enough for me.


Categorised as: Uncategorized

Comments are disabled on this post


4 Comments

  1. sclozza says:

    In terms of personality, we are the same. I just couldn’t cope with having a mess of randomly named files. Whenever I am showing one of my project supervisors at university some work on her laptop, I cannot fathom how she can get anything done with her random-file-dump approach. Similarly, she can’t understand why I would go to so much trouble to make so many subdirectories

  2. Ryan says:

    First off, thanks very much for listening to our requests….

    I agree that this makes a lot of sense (and it’s the way I store my own meager collections), but I’ve always wanted more.

    I find saving webpages especially frustrating and was wondering how you went about it. Saving pages from the browser seems inefficient, especially when you want to save more than one page from a site, and you end up with rewritten links to a folder you now have to keep static. Do you bother with scripts or Teleport-like programs for your web archiving on a whim? Do you include any metadata of your own in terms of the original URL/WHOIS data, or the links by which you came to find the now archived page?

    That concept of adding personal metadata to a collection is what always got me going and was the root of the question I didn’t fully ask in the last comment. I continue to dump media into categories like you do, but it kills me that there’s such a lack of context and the original file name is often useless in an organizational sense, but incredibly important with respect to proper [read: OCD?] archiving.

    I imagine some kind of card catalog would be impossible on the scale at which you acquire files, but is this something you daydream about too? For instance, what do you do when you’re pulling down photos and they’re named “1.jpg” “2.jpg” etc. Likely you’ve encountered tons of those, so do you have a rule for renaming them? Do you have little ID files that go along with them that list the source, original filename, creator, breadcrumbs of any kind?

    This is the magical, impossibly complicated system I was hoping to hear you describe….

    Do you aspire for that level of organization? Have you worked on a partial solution at any time? Do you have something like this in place, but consider it a trade secret?

  3. Just wonderin' says:

    So when are you gonna put up a torrent with all this stuff?

  4. Lazlo Nibble says:

    I don’t have anywhere near as much stuff as Jason does, but there’s still quite a lot. In another instance of that weird Jason/Lazlo synchronicity I file things in pretty much the same way: vague buckets that get split into less-vague buckets as the vague buckets get too big to manage.

    That’s text. Images are handled similarly but with one caveat: I actually do tag ‘em with IPTC metadata (usually just keywords) whenever possible. I wish there was an easy way to do that with plaintext. There are some standards for it (like TEI-C) but the tool stacks are aimed at people with MLS degrees…