In some circles, I’m known as the guy who wrote Fuck the Cloud.
Yet as of this past weekend, I have three Amazon EC2 instances doing massive amounts of screenshots of ZX Spectrum programs (thousands so far) using the Screen Shotgun.
Nobody has specifically come after me about this, but I figured I’d get out ahead of it, and again re-iterate what I meant about Fuck The Cloud, since the lesson is still quite relevant.
So, the task of Screen Shotgunning still takes some amount of Real Time – that is, an emulator is run in a headless Firefox program, the resulting output is captured and analyzed a bit, and then the resulting unique images are shoved into the entry on archive.org so that you get a really nice preview of whatever this floppy or cartridge has on it. That process, which really works best once per machine, will take some amount of minutes, and multiply it by the tens of thousands of floppies I intend to do this against, and letting it run on a spare machine (or even two) is not going to fly. I need a screenshot army, a pile of machines to do this task at the same time, and then get those things up into the collections ASAP.
A perfectly reproducible, time-consuming task that can be broken into discrete chunks. In other words, just the sort of task perfect for….
Well, let’s hold up there.
So, one thread or realm of developer/programmer/bystander would say “Put it in the Cloud!” and this was the original thing I was railing about. Saying “Put it in the Cloud” should be about as meaningful a statement as “computerize it” or “push it digital”. The concept of “The Cloud” was, when I wrote my original essay, so very destroyed by anyone who wanted to make some bucks jumping on coat-tails, that to say “The Cloud” was ultimately meaningless. You needed the step after that move to really start discussing anything relevant.
The fundamental issue for me, you see, is pledging obfuscation and smoke as valid aspects of a computing process. To get people away from understanding exactly what’s going on, down there, and to pledge this as a virtue. That’s not how all this should work. Even if you don’t want to necessarily be the one switching out spark plugs or filling the tank, you’re a better person if you know why those things happen and what they do. A teacher in my past, in science, spent a significant amount of time in our class describing every single aspect of a V-8 engine, because he said science was at work there, and while only a small percentage of us may go into laboratories and rockets, we’d all likely end up with a car. He was damn right.
Hiding things leads to corruption. It leads to shortcuts. It starts to be that someone is telling you all is well and then all the wheels falling off at 6am on a Sunday. And then you won’t know where the wheels even were. Or that there were wheels. That is what I rail against. “The Cloud” has come to literally mean anything people want.
No, what I wanted was a bunch of machines I could call up and rent by the hour or day and do screenshots on.
And I got them.
Utilizing Amazon’s EC2 (Elastic Computing) is actually pretty simple, and there’s an awful lot of knobs and levers you can mess with. They don’t tell you what else is sharing your hardware, of course, but they’re upfront about what datacenter the machines are in, what sort of hardware is in use, and all manner of reporting on the machine’s performance. It took me less than an hour to get a pretty good grip on what “machines” were available, and what it would cost.
You pay by the “machine hour” for these, and I was using a machine that cost $.47 an hour. Within a day, you’re talking $10. Not a lot of money, but that would add up. The per-hour cost also helped me in another way – it made me hunt down inefficiencies. I realized that uploading directly to archive.org was slowing things down – it had to wait in line for the inbox. Shoving things into a file folder on a machine I had inside the Internet Archive was much faster, since it just ran the file transfer and was able to go to the next screenshot. Out of the 2 minute time per program, the file upload was actually completely negligible – maybe 1-2 seconds of uploading and done, versus 1-2 minutes putting it carefully into an item. Efficiency!
I then tried to find the least expensive machine that still did the work. After some experimentation (during which I could “transfer the soul” of my machine to another version), I found that c3.large did the job just fine – at $0.12/hr, a major savings. That’s what has it for now.
Because I knew what I was dealing with, that is, a machine that was actually software to imitate a machine that was itself inside an even larger machine and that machine inside a datacenter somewhere in California… I could make smarter choices.
The script to “add all the stuff” my screen shotgun needs sits on a machine that I completely control at the Internet Archive. The screenshots that the program takes are immediately uploaded away from the “virtual” Amazon machine, so a sudden server loss will have very little effect on the work. And everything is designed so that it’s aware other “instances” are adding screenshots – if a screenshot already exists for a package, the shotgun will move immediately to the next one. This means I can have multiple machines gnaw on a 9,000 item collection (from different ends and in the middle) like little piranhas and the job will get done that much quicker.
In other windows, as I type this, I see new screenshots being added every 20 seconds to the Archive. That’s very nice. And the total cost for this is currently 36 cents every hour, at which point a thousand screengrabs might be handled.
I’m not “leveraging the power of the cloud”. I’m using some available computer rental time to get my shit done, a process that has existed since the first days of mainframes, when Digital and IBM would lease out processing time on machines they sold to bigger customers, in return for a price break.
It is not new.
But it does rule.