Some Random, Unsorted Thoughts on Sorting —
A few people, when I recently talked about the new hard drives I bought, asked me about how I sort things, since they have absolute tons of random files as well. Totally understandable, and I’ll happily talk about it, but I have to warn you that I don’t do anything according to any known code or formatting. I do what works for me.
If nothing else, I have to stress the most important rule, which I picked up, from all places, AEleen Frisch‘s book “Essential System Administration”. In her book, she tells an anecdote, which I will now tell to you.
“I learned about the importance of reversibility from a friend who worked in a museum putting together ancient pottery fragments. The museum followed this practice so that if better reconstructive techniques were developed in the future, they could undo the current work and use the better method. As far as possible, I’ve tried to do the same with computers, adding changes gradually and preserving a path by which to back out of them. “
A little white-hot cube of brilliance, that is. And that’s the #1 thing: any methods I provide or come up or which you do must be ones that, down the road, you can completely undo as better technology and techniques become available. Specific to the sorting of files, this means I don’t kill off compilations, delete metadata, undo ISOs, or otherwise split apart that which can’t be immediately unsplit. I also, whenever possible, try to keep things together that were always together. In all cases, it’s because as time goes on, things get better.
I have a FreeBSD file server using samba to allow my Windows box to interact with the hard drives. This is important because it lets me choose utilities that work in Windows as well as scripts and applications that work in FreeBSD/Linux. So I get whatever does the job best.
You can’t survive, once you go past a few tens of thousands of files, without some sort of doubles checker. I use a freeware program called CloneSpy as well as some Perl scripts that find duplicates. I actually have a version of the perl script that always deletes the newest files that are doubled; this lets me run it automatically as needed and kill off the redundant newcomers.
I am always erring on the side of “get it again” if I can’t recall if I downloaded anything. As a result of that, I have a lot of doubled data; I just found recently that I had over 40gb of redundant data collected on 15 hard drives across three machines. That’s a lot of downloading the same stuff. But better that than the sad keening I get from people who can’t believe that yyysite.com has gone under and nobody kept a copy. I keep a lot of copies.
So first, I split stuff into generic massive folders. In my case, it’s IMAGES, MOVIES, AUDIO, WEBSITES, APPLICATIONS, and DOCUMENTS. I acquire something, and throw it into one of these massive headers. That’s good enough sorting for my needs, on the spur of the moment. At least it’s generally there.
Underneath each one are arbitrary collections. So, for DOCUMENTS, I have sub-folders like MANUALS, MAGAZINES, BOOKS, and so on. Under MOVIES we have sub-headings like MUSIC VIDEOS, MUSICAL EVENTS, PRESENTATIONS, TECHNICAL DEMOS, CAMERA DEMOS. I built each one up when I had a collection of movies that would fill such a directory. As you can see, these are arbitrary. Is something a technical demo or a camera demo? Is it both? I choose one, randomly.
Under DOCUMENTS/BOOKS I will likely have thousands of documents representing books (and textfiles and PDFs and so on). So, if it starts getting big, I add subfolders under THAT like POSTERS, FICTION, TECHNICAL, SCIENCE, HAM RADIO, and so on. Each one gets a bunch of books.
Now, you would likely split things up differently, and we would probably disagree on what goes into TECHNICAL and what goes into SCIENCE. And indeed, sometimes I will yank something out of one folder and put it elsewhere.
But what I’m doing in all this is reducing the size of any given directory. Instead of having to stare, dumbly, at a multi-thousand-file data dump that I can barely get though the “A”s without glazing over, I have a few trees I can browse in.
We get into an advantage of my personality, which is that I have an unnatural attraction to classification and sorting. I will sit for hours and hours and hours, taking a big pile and adjusting it into dozens of smaller piles, arranged along a hierarchy. I do this all the time, both on my computer and in my office and in a bunch of other locations. (I straighten places I visit, for example.) So for me, this whole approach works because I have so much fun sorting it.
Now, and this is important (and I’ve mentioned this before), what is going on with this data is that it is all STATIC. That is, as opposed to dynamic. This stuff has a specific aspect about it, that is, once I grab “it”, “it” is basically done as far as my interaction with it. I might read it or look at it, but “it” stays the same. This works for movies, documents of a collected nature, music, and so on. I have it and that’s that. So this data is all kept in one place.
In other folders, I have more dynamic stuff, like e-mail I’ve sent, documentary in-process stuff, raw footage, work documents, and so on. This stuff is still being worked with, still being engaged. So it doesn’t make ANY sense to put it on this static location. I might, if it strikes me, put a backup folder on the same drive as the static folder, but that’s simply for redundancy, not because it should be there. I’m basically piggybacking on the infrastructure already there, like leaving my valuables at work because work is unusually protected or secure.
That said, once my dynamic stuff becomes static (new job, documentary is complete), then it becomes static and is shoved on the drive as needed.
So this, in a very simple nutshell, is how I approach my data. I do not pretend it would work for everyone, and I’m not overly interested in hearing about improvements to my system. It morphs, adds and deletes ideas. But for now, that’s how the terabyte storage is split up. And I tell you, I can get an idea in my head (where’s that podcast I wanted? Where’s that old website with the cool pictures I saved?) and I can get to it within a very short time, sometimes a few seconds. That’s good enough for me.
