ASCII by Jason Scott

Jason Scott's Weblog

All The Podcasts: An Update —

Well, it’s been a little while since I announced that I was setting off to archive all the podcasts. I figured I’d take some time to let you know how all that’s been going on.

Obviously, my documentary and then my regular duties have been taking priority over the project, so initially, the whole thing was first to get a sense for how big the whole thing really is, what would be involved, and what effect that would have on my resources. To do this, I started using a program called doppler, which is a multi-threaded podcast downloader that also was wired into the ipodder directory, enabling me to click and say “gimme all”, and then “keep all forever”, and so on. I figured if I’d opened the floodgates, I’d get a good chunk of the stuff.

Now, as it turned out, Doppler wasn’t up to this task. I hardly blame the developers for this situation; how many of their users would be expected to be downloading 1,200 feeds? So I started to run into problems of corruption, doubled feeds, and other annoyances. I ended up having to write a perl program that went through the doppler configuration files and cleaned them. It was picking stuff up, but only sort of.

After a couple months of jamming doppler on full, and keeping in mind I was just doing this in the background, low-priority compared to my documentary work, I had about 1200 podcasts and about 11,000 files, equalling 150 gigabytes of data. Large, but not stunning. If you play the game of saying that mp3s are about a megabyte a minute (which is very, very rough and doesn’t take a lot of factors into account), I have downloaded over 104 solid days of podcasts.

So let’s step aside a moment and go over where I am in the collecting process.

I had at this point collected a large amount of audio files, but hardly a comprehensive collection. Some of the podcast sets are incomplete, just representing what was in recent XML pages, and not going back far enough. Some of them are doubled. Others are weird, broken files, not really podcasts; some people didn’t implement them correctly and they don’t have any audio files at all, just a JPEG file and a PDF file and nothing else. In other words, it’s definitely big, but it’s also a big mess.

This is that critical juncture I mentioned in the previous entry. My collection is neither convenient or complete; neither well-sorted or easily browsable. It is, basically, a huge crap-pile of audio files.

Obviously, I will press on. But here, after taking stock of my initial collecting, I have begun the process of re-doing things right.

First of all, I’ve had to stop using Doppler. It was fun and easy to use and a nice client, but I was using it on a Windows box to access a samba share presented on my freebsd file machine (which has a couple terabytes of disk space, in case you’re wondering where I’m putting all this). This was tying up two machines for no good reason, and also was adding a ton of network traffic that didn’t need to be there (pulling it to the windows box and then throwing it all back at the freebsd box).

So now I switched to bashpodder. Actually, that’s not really true. I switched to taking bashpodder and IMMEDIATELY dropping its transmission and rewriting it basically from scratch. So I use the “secret sauce” line that yanks all the mp3s out of an XML file, but the rest of it all is using my directory structure and approach, and is using whatever URL is stored in the directory with the mp3s as the place to check the feeds… and it’s also now grabbing a copy of the XML file and archiving it as well.

The reason behind this last move is because I’m finding that a lot of the XML feeds have all sorts of important information in them, like descriptions, text, ideas and explanatory paragraphs that are not anywhere else. I have no way, right now, to match them up with the file, but maybe someday. Or someone else will do it. Either way, it will enable me to keep the whole collection somewhat sane. This is a case where I am Collecting for the Future, not for my own information/education. I may never read these archived XML files again; but they’ll at least be somewhat near the MP3 files they reference, so someone studying that particular podcast can see information about the MP3s that would otherwise be lost.

The new directory structure allows another situation which I knew was coming but wasn’t handling yet: splitting the collection among more than one drive. Right now, the whole collection is mirrored, but in point of fact we were heading happily towards the 240gb limit of the hard drive, which meant we were going to have to use multiple drives, and I am currently avoiding RAID. So this new structure means that the information on how to get the mp3s is located within the directory with the mp3s themselves, so multiple scripts can be running. I’ve written the scripts so you can say things like “check and download all the feeds starting with the letter ‘A'” so the scripts don’t bother each other.

So now we have everything a little cleaner. But! In cleaning up the situation, I found rapidly that I was missing a ton of podcasts. As might be expected, my half-hearted “download everything” selection was working, generally, but missing a lot of feeds with broken links or weird syntax, or otherwise not making sense to Doppler. So, with my new scripts in place, I’ve started downloading. And downloading… and downloading.

As of right now, I have downloaded 20 gigabytes of podcasts on this day alone. And I’m nowhere near done. This is because I’ve started the second phase in this type of collecting; hand-checking my feeds to be accurate, and filling in gaps where needed. So now I don’t just get the last 5 mp3s of a feed; I go back and get ALL of them.

To help me, I’ve written a script that allows me to add a podcast to the directory. This script will check to see if the feed is already in use anywhere else by any other feed (meaning it’s a double if we were to add it), and then, if it isn’t, creates the directory for the feed, puts the URL in there for later script use, and then issues an immediate download of the feed. All of the things it does can be called at the command line, so I can have a script that gets a bunch of “new” podcasts, and adds the unique new ones AND downloads them immediately. Good stuff.

Let’s step aside even further as I talk a little bit more about my opinion about podcasts.

As said in the previous entry, I do not consider podcasts particularly new or revolutionary. That said, I am happy that people are making such an effort to record themselves and then slave away to make those recordings available to the largest number of people they can get the attention of. To an anthropologist, it’s like this huge self-service oral history project. Maybe they’re talking about tech issues or news items of the day or other “disposable” subjects, but not always, and there’s a lot more information in these than you might expect.

One of my favorite ironic works is Maciej Cegłowski’s “Audioblogger Manifesto”, which was created to attempt to point out the folly of audio weblogs, and how the bandwidth of information was nothing compared to text blogs. Of course, Maciej didn’t intend this to be ironic, but guess what, it ended up being just that. (You can read a transcript of his speech at his site as well).

While he goes on about how audio files are a step back, how by removing the advantages of hypertext and forcing your audience down a single path without the ability to skip around or add additional information, you’re hearing music play in the background. In other words, you are getting two simultaneous streams of information in your ears. You can hear Maciej’s voice dripping with either sarcasm or stilted emotion, depending on your point of view. He talks about all the disadvantages of spoken word without pointing out all the advantages, like how his pauses and expression come across in ways they wouldn’t with the written word; this is why, for example, many authors going back many years have gone on speaking tours, reading from their own works; you never saw Mark Twain show up to one of his many engagements, point to a book of Huck Finn, and say “Just shut up and read.”

He is, essentially, a man sitting on the porch watching a new airplane fly over, going “nothin’ wrong with walkin’. Don’t like flyin'”. Or a guy seeing new antibiotics going “Make the kid walk it off.” In other words, even now, 8 months after he wrote his words, and Google and other sites add text search of audio files and people are taking radio shows (which have been around for decades by the way, and aren’t mentioned in Maciej’s timeline), and putting them in podcasts so you choose what you want to hear when you want to hear it… We are seeing the things he claimed were the disadvantages of audiobloggingand having them be turned around to advantages. His own joke is on himself.

Speaking of misguidedness, for all my liking the audio form as a means of expression, I contend quite heavily that there are a lot less podcasts than people are trying to puff up in the growing Podcast Industry. This is because, for example, I don’t count shows that basically play pre-recorded music in a set order. That knocks out a good amount of them. I also don’t count shows that are basically re-run professional productions, like FM/AM talk radio shows. I am COLLECTING these, make no mistake, but you’re basically photocopying already-existing material in both cases and then making them available on demand. That’s a little different than sitting down for an hour or half-hour regularly and talking about a set of subjects.

There’s also what I call the 4 Month Death Wall. This wall exists, basically, in all projects, not just podcasts. I’ve seen it in Zines, BBSes, high school bands, relationships, gardening projects and anything else that requires constant or semi-constant attention. You get to a point where you’ve been doing something for a while, it takes some amount of your time, and then it bumps up against the rest of your life. You make a choice at that juncture; and the vast majority of people choose to shitcan it.

After four months, you will have been doing a lot of work on your show. You will have been spending a lot of time before it preparing, and a lot of time afterwards cleaning up. And for a lot of people, the thrill has gone. At that point, they tend to hang up their mike, sit back, and get their life back. This contributes to what they now call “churn”.

I don’t see how the churn in podcasting will be any different. And that solves two problems for me: having a constantly growing collection grow exponentially, and keeping track of a specific site. I suspect as time goes on, my collection will have more and more dead feeds, missing any new updates until finally they 404 out and I have, trapped in amber, a little bit of online history.

One final thing before I finish this “small” update. This is easily the most amount of data I’ve ever trying bringing in at once. It’s really stretching my mental muscles trying to keep track of everything, handle issues of bandwidth and sorting information via scripts. It is, if nothing else, an incredibly beneficial exercise. I feel like, doing this, I could go on to collect most anything. So there’s always that.

Categorised as: Uncategorized

Comments are disabled on this post


  1. jj says:

    mmm… I find this all incredibly facinating… i’ve managed to figure out what podcasts are, thanks to the context clues in your blog… and i am now sufficiently interested enough to wish to create one.

    i admire your itch to collect, and your skill in doing so.


  2. Jason Scott says:

    It’s less skill than bull-headed energy and laziness. The more that scripts can do, the less that I have to do, and the more likely I will continue the project.

    I just upgraded to a second 250gb drive, which is now filling.