ASCII by Jason Scott

Jason Scott's Weblog

Podcast Work —

I figure it’s worth it to describe each step along the way of this Podcast collecting thing, in case people are actually going to use this as a template for their own collecting, or for some other weird purpose like trying to discover my “secret sauce” for all this.

When I’m not emotionally invested in a collection and don’t get a particular joy of mulling over each new item (like I do with my “last straw” collection), then I want to script as absolute much as I possibly can, lest I lose personal interest and the collection languishes as a result.

In the case of the podcasts, we’re in luck, since, as mentioned before, the items in the collection are doing their damndest to be collected; so they end up providing these nice little RSS feeds for me.

So the structure I currently have is:

NAME OF PODCAST/.url (The URL of the RSS Feed)
NAME OF PODCAST/.xml (Directory with grabbed RSS Feeds)
filename.mp3
filename.mp3
filename.mp3
[...]

The name of the directory functions as the name of the feed. If this ends up being insufficient, I’ll put a .fullname file inside that will override the name for the purpose of reports.

Here’s the full source of the PODLIST script, which creates a file with a list of all the podcast directories and their feed names:

#!/bin/sh
#
# PODLIST: Go and build up the feed list from the podcast directory.
cd /podcasts/podcasts
echo "" > PODCASTLIST
for each in *
do
FOF=`cat "$each/.url"`
echo $FOF "($each)" >> PODCASTLIST
done

…nothing too big. So you end up with a file called PODCASTLIST which lists all the podcasts. I use this for a quick-reference when I add new feeds; if the new feed matches anything in that list, I already have it.

I need a script to download a podcast, checking for the .url file. This is that, the core of the downloading:

#!/usr/local/bin/bash
#
# PODSUCKER: Jason Scott's Script for Pulling Podcasts
#
# Based off of BASHPODDER and its Secret Sauce
# Originally by Linc 10/1/2004
# Find the original script at
# http://linc.homeunix.org:8080/scripts/bashpodder
#
# Modified by James Rayner, iphitus@gmail.com
# www.iphitus.tk

PODCASTDIR=/podcasts/podcasts
cd $PODCASTDIR
# Use the current directory as the location of all the directories.
for MUSICBIN in "${1}"*
do
cd $PODCASTDIR
echo "[%] $MUSICBIN"
# Is there a .url file? If not, then this isn't really a podcast.
cat "$MUSICBIN/.url"
if [ -f "$MUSICBIN/.url" ]
then
# Grab a copy to store locally.
cd "$MUSICBIN"
if [ ! -d ".xml" ]
then
mkdir .xml
fi
podcast="`cat .url`"
timedate=`date '+%Y%m%d%M%S'`
wget --output-document=.xml/$timedate $podcast
file=$(wget --tries=5 --append-output=.retrieval.log -q \
$podcast -O - | tr '\r' '\n' | tr \' \" | sed -n \
's/.*url="\([^"]*\)".*/\1/p')

for url in $file
do
echo "    $url"
wget -nc "$url"
done
fi
done

Scary if you’ve never seen Bourne Shell script in action; stupid if you use Perl, weird if you use Bourne. Basically, I go into a directory, grab the URL, pull a copy to store locally, then re-pull it and pull out all the mp3 files referenced and download them. WGET does the cool thing of making sure I don’t re-download a file I already have.

Why do I pull the stuff TWICE? Well, I could probably get the information from the pulled file, but regardless, I keep a copy of the file in the .xml directory so that, down the line, any information stored in the .xml file that isn’t directly related to mp3 filenames is saved for history. So I’m trying to think ahead here.

Then we need a script to add new unique feeds. Here we go:

#!/bin/sh
#
# ADDPODCAST - Add a Unique Podcast, if it exists
#              addpodcast  

if [ "$1" ]
then
POG="$1"
POF="$2"
echo "Checking for $POF.."
else
echo "What is the RSS feed URL for this podcast:"
read POF
fi

if [ ! "`grep -i $POF /podcasts/podcasts/PODCASTLIST`" ]
then
if [ "$2" ]
then
echo "Calling it $1.."
else
echo "It's a newbie!"
echo "What is the name of this directory?"
read POG
fi

mkdir "/podcasts/podcasts/$POG"
echo "$POF" > "/podcasts/podcasts/$POG/.url"
echo "$POF" >> "/podcasts/podcasts/PODCASTLIST"
echo "Added. Now sucking it down."
/podcasts/podsucker.sh "$POG"
else
echo "Already got it, chief:"
grep -i $POF /podcasts/podcasts/PODCASTLIST
fi

A lot is going on here. First of all, it looks very weird because it can be called at the command line, or, if no command line options are given, then it asks for you to add them.

If the RSS feed you tell it to use is already taken, it fails out. Otherwise, it creates the directory, gives it a .url file with the RSS feed’s URL, and then calls the podcast grabbing script (Podsucker).

By now it a couple things should be clear (besides that I write really sloppy code): You create a bunch of little tiny scripts that do one thing well, and then have each script all the others, so you can concentrate on the little tasks, instead of trying to make a monolithic script from hell. And, you try to handle a bunch of errored contigencies.

I attended a talk about programming held by Tom Jennings of FidoNet (I’d interviewed him a year earlier, but also attended this talk) and one of his big statements during his talk was that 95 percent of a programmer’s work was handling errors. He had to work on design, workflow, structure, but the rest of the work was trying to handle the stupidity of man or the unexpected contingencies. He’s quite right.

So, we have:

– The thing that will download a podcast’s files
– The thing that will list all the podcasts we have
– The thing that will add a new podcast directory and call the above

And so if I let this stuff run, it will do a very good job, without my efforts, of downloading all the podcasts I find.

So, where do I get these podcasts from? Well, I have to scrape other sites. “Scrape” isn’t my favorite verb to describe the process, but it’s what I do. Here we get into insane magic mojo.

I give unto you, now, my one questionable script. Given a number, it will go to the site PODCASTDIRECTORY.COM and pull down the name of the XML feed from that podcast, as well as the official name, and then run the addpodcast script.

#!/bin/sh
#
# GRAB from the podcastdirectory.com site.

wget -O beets "http://www.podcastdirectory.com/podcasts/index.php?iid=$1"
FLAM=`head -240 beets | tail -1 | sed 's/.*<p><b>//g' | sed 's/<\b><p>.*//g'`
FLIM=`head -240 beets | tail -1 | sed 's/.*<p><a href=//g' | sed 's/><span class=.*//g' | sed "s/'//g"`
rm beets
./addpodcast "$FLAM" "$FLIM"

Crazy, huh. A small number of lines, but thanks to the scripts I previously worked on, it will go to podcastdirectory, get the name and RSS feed URL, and then add it to my collection if it doesn’t exist. I only pull a few K (and no images) from podcastdirectory, so I don’t feel overly bad. And it’s for the good of history, anyway.

So there we go, a little insight. Next time, I’ll go into some of the other issues of this process.

In case you’re keeping track at home, I am currently pulling down 1,299 feeds, and currently have 17,059 files to show for it. 202 gigabytes of podcasts. And I’m just getting warmed up.


Categorised as: Uncategorized

Comments are disabled on this post


2 Comments

  1. jon sofield says:

    Interested in your ideas around podcastlists. We are embarking on significant R&D investments in this space and I was wondereing whether you would be interested in talking with us.

  2. c01eman says:

    i take this script modfiy it for my own needs… and thank the author for it… i wouldnt have a clue where to start without it