ASCII by Jason Scott

Jason Scott's Weblog

Now That’s What I Call Script-Assisted-Classified Pattern Recognized Music —

Merry Christmas; here is over 500 days (12,000 hours) of music on the Internet Archive.

Go choose something to listen to while reading the rest of this. I suggest either something chill or perhaps this truly unique and distinct ambient recording.


Let’s be clear. I didn’t upload this music, I certainly didn’t create it, and actually I personally didn’t classify it. Still, 500 Days of music is not to be ignored. I wanted to talk a little bit about how it all ended up being put together in the last 7 days.

One of the nice things about working for a company that stores web history is that I can use it to do archaeology against the company itself. Doing so, I find that the Internet Archive started soliciting “the people” to begin uploading items en masse around 2003. This is before YouTube, and before a lot of other services out there.

I spent some time tracking dates of uploads, and you can see various groups of people gathering interest in the Archive as a file destination in these early 00’s, but a relatively limited set all around.

Part of this is that it was a little bit of a non-intuitive effort to upload to the Archive; as people figured it all out, they started using it, but a lot of other people didn’t. Meanwhile, Youtube and other also-rans come into being and they picked up a lot of the “I just want to put stuff up” crowd.

By 2008, things start to take off for Internet Archive uploads. By 2010, things take off so much that 2008 looks like nothing. And now it’s dozens or hundreds of uploads of multi-media uploads a day through all the Archive’s open collections, not to count others who work with specific collections they’ve been given administration of.

In the case of the general uploads collection of audio, which I’m focusing on in this entry, the number of items is now at over two million.

This is not a sorted, curated, or really majorly analyzed collection, of course. It’s whatever the Internet thought should be somewhere. And what ideas they have!

Quality is variant. Finding things is variant, although the addition of new search facets and previews have made them better over the years.

I decided to do a little experiment: slight machine-assisted “find some stuff” sorting. Let it loose on 2 million items in the hopper, see what happens. The script was called Cratedigger.

Previously, I did an experiment against keywording on texts at the archive – the result was “bored intern” level, which was definitely better than nothing, and in some cases, that bored internet could slam through a 400 page book and determine a useful word cloud in less than a couple seconds. Many collections of items I uploaded have these word clouds now.

It’s a little different with music. I went about it this way with a single question:

  • Hey, uploader – could you be bothered to upload a reference image of some sort as well as your music files? Welcome to Cratediggers.

Cratediggers is not an end-level collection – it’s a holding bay to do additional work, but it does show the vast majority of people would upload a sound file and almost nothing else. (I’ve not analyzed quality of description metadata in the no-image items – that’ll happen next.) The resulting ratio of items-in-uploads to items-for-cratediggers is pretty striking – less than 150,000 items out of the two million passed this rough sort.

The Bored Audio Intern worked pretty OK. By simply sending a few parameters, The Cratediggers Collection ended up building on itself by the thousands without me personally investing time. I could then focus on more specific secondary scripts that do things and an even more lazy manner, ensuring laziness all the way down.

The next script allowed me to point to an item in the cratediggers collection and say “put everything by this uploader that is in Cratediggers into this other collection”, with “this other collection” being spoken word, sermons, or music. In general, a person who uploaded music that got into Cratediggers generally uploaded other music. (Same with sermons and spoken word.) It worked well enough that as I ran these helper scripts, they did amazingly well. I didn’t have to do much beyond that.

As of this writing, the music collection contains over 400 solid days of Music. They are absolutely genre-busting, ranging from industrial and noise all the way through beautiful Jazz and acapella. There are one-of-a-kind Rock and acoustic albums, and simple field recordings of Live Events.

And, ah yes, the naming of this collection… Some time ago I took the miscellaneous texts and writings and put them into a collection called Folkscanomy.

After trying to come up with the same sort of name for sound, I discovered a very funny thing: you can’t really attached any two words involving sound together and not already have some company that has the name of Manufacturers using it. Trust me.

And that’s how we ended up with Folksoundomy.

What a word!

The main reason for this is I wanted something unique to call this collection of uploads that didn’t imply they were anything other than contributed materials to the Archive. It’s a made-up word, a zesty little portmanteau that is nowhere else on the Internet (yet). And it leaves you open for whatever is in them.

So, about the 500 days of music:

Absolutely, one could point to YouTube and the mass of material being uploaded there as being superior to any collection sitting on the archive. But the problem is that they have their own robot army, which is a tad more evil than my robotic bored interns; you have content scanners that have both false positives and strange decorations, you have ads being put on the front of things randomly, and you have a whole family of other small stabs and Jabs towards an enjoyable experience getting in your way every single time. Internet Archive does not log you, require a login, or demand other handfuls of your soul. So, for cases where people are uploading their own works and simply want them to be shared, I think the choice is superior.

This is all, like I said, an experiment – I’m sure the sorting has put some things in the wrong place, or we’re missing out on some real jewels that didn’t think to make a “cover” or icon to the files. But as a first swipe, I moved 80,000 items around in 3 days, and that’s more than any single person can normally do.

There’s a lot more work to do, but that music collection is absolutely filled with some beautiful things, as is the whole general Folksoundomy collection. Again, none of this is me, or some talent I have – this is the work of tens of thousands of people, contributing to the Archive to make it what it is, and while I think the Wayback Machine has the lion’s share of the Archive’s world image (and deserves it), there’s years of content and creation waiting to be discovered for anyone, or any robot, that takes a look.

Categorised as: Internet Archive

One Comment

  1. Tedd Tiger says:

    As always, Jason, your energy and drive is inspiring. You look amazing, also. If I could ask one thing, do you have an update for those of us backers to your films Kickstarter? I’d love to know how you’ve been getting on, if anything. 🙂

Leave a Reply