Nice Try, Archiver-Hater — February 12, 2008

I find it easier to generally just grab anything that catches my attention for more than a few seconds. Copy it, download it, PDF it, whatever makes the most sense, shove it into a directory with a description of it (if any) and then forget about it. It takes me 10 seconds, and I think of all the times we wish somebody 5, 10, 15 years ago did this, and how we’d all be a little happier for it. So I do this all the time.

One of the ways I do this is to use a program called wget, which can be poorly summarized as “a web browser that does a single thing”. In fact, it can do many things, but what it basically does is allow you to interact with web-based and net-based assets such that you can say “go get this”. So if there’s a URL to an image, you can wget it. If there’s a site you want to download, you can say “wget the site and everything it has on it”. It can even let you go to password-protected stuff and grab a copy, update just what’s new, and so on. It’s very nice. I use it all the time.

Here’s my incantation:

wget -r -l 0 -np -nc http://www.somewebsite.com

Every once in a while, though, it doesn’t work. I go to download something and I get a big fat error. Like here:

wget http://stevenpoole.net/th/TriggerHappy.pdf
--23:57:37--  http://stevenpoole.net/th/TriggerHappy.pdf
=> `TriggerHappy.pdf'
Resolving stevenpoole.net... 64.13.232.191
Connecting to stevenpoole.net|64.13.232.191|:80... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
23:57:37 ERROR 500: Internal Server Error.

Hey! Something broke. I can’t get this file. The file, in this case, is a book about videogames, some academic nib-nob, that the author has released to a creative commons license, in PDF form. It’s perfect for pdf.textfiles.com, so I tried to wget it. And failed.

I am then forced to pull out the big guns, the secret weapons that ensure my continued success in this rough and tumble world of high security:

wget --user-agent=EatDeliciousPoop http://stevenpoole.net/th/TriggerHappy.pdf
--03:01:33--  http://stevenpoole.net/th/TriggerHappy.pdf
=> `TriggerHappy.pdf'
Resolving stevenpoole.net... 64.13.232.191
Connecting to stevenpoole.net|64.13.232.191|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,633,505 (2.5M) [application/pdf]
100%[====================================>] 2,633,505    647.63K/s    ETA 00:00
03:01:38 (585.47 KB/s) - `TriggerHappy.pdf' saved [2633505/2633505]

Wow! That’s amazing! By merely indicating that my “User Agent” (the tag sent along with a browser) was NOT wget (the default), but that old standard “Eat Delicious Poop”, suddenly I downloaded it with no problems.

So what’s at play here is there’s a rule in the webserver’s configuration that if the user-agent string is “wget”, return a 500 error and throw that bastard out. If it’s anything else, however, roll out the red carpet and let that esteemed colleague download your precious data.

I’m picking on Steven today, but I run into this crap all the time. Bear in mind it’s not checking for number of links, or bandwidth utilization, or any metric that would actually indicate abuse. It’s just looking for the most basic, most surface judgment, profiling really, and then making a snap decision: NO. Oh, I’m sure some people are unaware their servers do this, but a lot actually think they’re helping something. They’re not.

You want to be all super-hacker and automatic-defense systems and shit? Easy enough; bury a link in your site, somewhere at the surface, with a link to a textfile. Don’t make it embedded or load. Make it so you have to actively pull that file down. If someone does it, then they’re spidering. Pretty simple. A browser wouldn’t do it and someone like me, targeting a file, doesn’t get ensnared in your bear trap. Ban that IP for 24 hours. Congratulations, warrior.

Of course, if you’re offering a book online, or an artifact, or some other item, one would think you’d be happy someone was wgetting it, meaning they were attempting to place it somewhere, instead of just viewing it inline in their browser, ready to switch off to the next animated GIF or site that cathes their eye.

Don’t hate archivers. We outlast you.

Categorised as: Uncategorized

Comments are disabled on this post

9 Comments

GWB says:

February 13, 2008 at 9:21 am

Or, more properly, the archivers might not outlast anyone–but the archives will.
Jason Scott says:

February 13, 2008 at 9:26 am

If one falls, another will take our place.

We are archivers. We are legion.
Chris Barts says:

February 13, 2008 at 10:05 am

Too bad “Trigger Happy” was written by someone who can’t even get basic facts straight. The PDP-1 was not a mainframe and it was not hulking. In fact, the main, defining feature was that it was small and cheap enough MIT would let students like Greenblatt and off-the-wall professors like Minsky play with it. That is the sole reason people remember it as fondly as they do.

Oh, yeah, great article otherwise. I’d emphasize the fact people like you can adapt to just about any server-side restriction and that wget alone is flexible enough to look enough like a human software can’t really filter it (even to the point of inserting random delays into its sequence of requests so it looks like a human instead of a program).
Gene Buckle says:

February 13, 2008 at 10:07 am

Our Archives _never_ forget.
Matt Brubeck says:

February 13, 2008 at 12:01 pm

Maybe he’s just trying to keep out Richard Stallman.
Richard Warezwolf says:

February 13, 2008 at 12:50 pm

There is one on-line community that finds it hilarious to “archive” content on websites that they find offensive to their hive mind (using wget). They in fact use the word ‘archive’ to get around their own self-imposed rule against fucking with other sites.

how i hate them

–RW
Josef Kenny says:

February 13, 2008 at 3:29 pm

I just wgot ascii.textfiles.com 😛 It’s quite a good idea actually. Next time my hard drive dies (yup, maxtor) I can entertain myself by flicking through the archives in Ubuntu.
steven says:

March 24, 2008 at 5:07 pm

Gosh, you are so much cleverer and cooler than me, the mere author of “some academic nib-nob”. (Er, it’s not actually academic, but whatever.) I am shivering in awe.

What was actually going on was that some idiot was wgetting the same file from the same IP address about 50 times a minute for several hours, thus costing me money. I actually didn’t know how to do anything more complicated than ban the user-agent (which worked for this idiot), and didn’t have time to find a better solution. So please aim your self-righteousness somewhere else.
Sarah says:

April 28, 2009 at 6:36 am

I think both people have valid points. Often people seriously abuse wget and strip whole sites… often for their own, money grabbing (and cost you money in terms of bandwidth).

But simply blocking wget stops a lot of archive systems people are using.
Perhaps revisiting a 100% kill wget policy after the problem has gone away so that it throttles the connection would be better. Who can say, most people apply the bandaid and leave it at that.

But there is no real need for righteous indignation from either side.

Share this: