I find it easier to generally just grab anything that catches my attention for more than a few seconds. Copy it, download it, PDF it, whatever makes the most sense, shove it into a directory with a description of it (if any) and then forget about it. It takes me 10 seconds, and I think of all the times we wish somebody 5, 10, 15 years ago did this, and how we’d all be a little happier for it. So I do this all the time.
One of the ways I do this is to use a program called wget, which can be poorly summarized as “a web browser that does a single thing”. In fact, it can do many things, but what it basically does is allow you to interact with web-based and net-based assets such that you can say “go get this”. So if there’s a URL to an image, you can wget it. If there’s a site you want to download, you can say “wget the site and everything it has on it”. It can even let you go to password-protected stuff and grab a copy, update just what’s new, and so on. It’s very nice. I use it all the time.
Here’s my incantation:
wget -r -l 0 -np -nc http://www.somewebsite.com
Every once in a while, though, it doesn’t work. I go to download something and I get a big fat error. Like here:
wget http://stevenpoole.net/th/TriggerHappy.pdf --23:57:37-- http://stevenpoole.net/th/TriggerHappy.pdf => `TriggerHappy.pdf' Resolving stevenpoole.net... 220.127.116.11 Connecting to stevenpoole.net|18.104.22.168|:80... connected. HTTP request sent, awaiting response... 500 Internal Server Error 23:57:37 ERROR 500: Internal Server Error.
Hey! Something broke. I can’t get this file. The file, in this case, is a book about videogames, some academic nib-nob, that the author has released to a creative commons license, in PDF form. It’s perfect for pdf.textfiles.com, so I tried to wget it. And failed.
I am then forced to pull out the big guns, the secret weapons that ensure my continued success in this rough and tumble world of high security:
wget --user-agent=EatDeliciousPoop http://stevenpoole.net/th/TriggerHappy.pdf --03:01:33-- http://stevenpoole.net/th/TriggerHappy.pdf => `TriggerHappy.pdf' Resolving stevenpoole.net... 22.214.171.124 Connecting to stevenpoole.net|126.96.36.199|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 2,633,505 (2.5M) [application/pdf] 100%[====================================>] 2,633,505 647.63K/s ETA 00:00 03:01:38 (585.47 KB/s) - `TriggerHappy.pdf' saved [2633505/2633505]
Wow! That’s amazing! By merely indicating that my “User Agent” (the tag sent along with a browser) was NOT wget (the default), but that old standard “Eat Delicious Poop”, suddenly I downloaded it with no problems.
So what’s at play here is there’s a rule in the webserver’s configuration that if the user-agent string is “wget”, return a 500 error and throw that bastard out. If it’s anything else, however, roll out the red carpet and let that esteemed colleague download your precious data.
I’m picking on Steven today, but I run into this crap all the time. Bear in mind it’s not checking for number of links, or bandwidth utilization, or any metric that would actually indicate abuse. It’s just looking for the most basic, most surface judgment, profiling really, and then making a snap decision: NO. Oh, I’m sure some people are unaware their servers do this, but a lot actually think they’re helping something. They’re not.
You want to be all super-hacker and automatic-defense systems and shit? Easy enough; bury a link in your site, somewhere at the surface, with a link to a textfile. Don’t make it embedded or load. Make it so you have to actively pull that file down. If someone does it, then they’re spidering. Pretty simple. A browser wouldn’t do it and someone like me, targeting a file, doesn’t get ensnared in your bear trap. Ban that IP for 24 hours. Congratulations, warrior.
Of course, if you’re offering a book online, or an artifact, or some other item, one would think you’d be happy someone was wgetting it, meaning they were attempting to place it somewhere, instead of just viewing it inline in their browser, ready to switch off to the next animated GIF or site that cathes their eye.
Don’t hate archivers. We outlast you.
Categorised as: Uncategorized
Comments are disabled on this post