WHAT the Cloud? — July 30, 2014

In some circles, I’m known as the guy who wrote Fuck the Cloud.

Yet as of this past weekend, I have three Amazon EC2 instances doing massive amounts of screenshots of ZX Spectrum programs (thousands so far) using the Screen Shotgun.

Nobody has specifically come after me about this, but I figured I’d get out ahead of it, and again re-iterate what I meant about Fuck The Cloud, since the lesson is still quite relevant.

So, the task of Screen Shotgunning still takes some amount of Real Time – that is, an emulator is run in a headless Firefox program, the resulting output is captured and analyzed a bit, and then the resulting unique images are shoved into the entry on archive.org so that you get a really nice preview of whatever this floppy or cartridge has on it. That process, which really works best once per machine, will take some amount of minutes, and multiply it by the tens of thousands of floppies I intend to do this against, and letting it run on a spare machine (or even two) is not going to fly. I need a screenshot army, a pile of machines to do this task at the same time, and then get those things up into the collections ASAP.

A perfectly reproducible, time-consuming task that can be broken into discrete chunks. In other words, just the sort of task perfect for….

Well, let’s hold up there.

So, one thread or realm of developer/programmer/bystander would say “Put it in the Cloud!” and this was the original thing I was railing about. Saying “Put it in the Cloud” should be about as meaningful a statement as “computerize it” or “push it digital”. The concept of “The Cloud” was, when I wrote my original essay, so very destroyed by anyone who wanted to make some bucks jumping on coat-tails, that to say “The Cloud” was ultimately meaningless. You needed the step after that move to really start discussing anything relevant.

The fundamental issue for me, you see, is pledging obfuscation and smoke as valid aspects of a computing process. To get people away from understanding exactly what’s going on, down there, and to pledge this as a virtue. That’s not how all this should work. Even if you don’t want to necessarily be the one switching out spark plugs or filling the tank, you’re a better person if you know why those things happen and what they do. A teacher in my past, in science, spent a significant amount of time in our class describing every single aspect of a V-8 engine, because he said science was at work there, and while only a small percentage of us may go into laboratories and rockets, we’d all likely end up with a car. He was damn right.

Hiding things leads to corruption. It leads to shortcuts. It starts to be that someone is telling you all is well and then all the wheels falling off at 6am on a Sunday. And then you won’t know where the wheels even were. Or that there were wheels. That is what I rail against. “The Cloud” has come to literally mean anything people want.

No, what I wanted was a bunch of machines I could call up and rent by the hour or day and do screenshots on.

And I got them.

Utilizing Amazon’s EC2 (Elastic Computing) is actually pretty simple, and there’s an awful lot of knobs and levers you can mess with. They don’t tell you what else is sharing your hardware, of course, but they’re upfront about what datacenter the machines are in, what sort of hardware is in use, and all manner of reporting on the machine’s performance. It took me less than an hour to get a pretty good grip on what “machines” were available, and what it would cost.

I started with their free tier, i.e. a clever “try before you buy” level of machine, but running an X framebuffer and an instance of Firefox and then making THAT run a massive javascript emulator was just a little too much for the thing. I then went the other way and went for a pretty powerful box (the c3.2xlarge is the type) and found it ran my stuff extremely well – in fact, compared to the machine I was using to do screenshots, it halved the time necessary to get the images. Nice.

You pay by the “machine hour” for these, and I was using a machine that cost $.47 an hour. Within a day, you’re talking $10. Not a lot of money, but that would add up. The per-hour cost also helped me in another way – it made me hunt down inefficiencies. I realized that uploading directly to archive.org was slowing things down – it had to wait in line for the inbox. Shoving things into a file folder on a machine I had inside the Internet Archive was much faster, since it just ran the file transfer and was able to go to the next screenshot. Out of the 2 minute time per program, the file upload was actually completely negligible – maybe 1-2 seconds of uploading and done, versus 1-2 minutes putting it carefully into an item. Efficiency!

I then tried to find the least expensive machine that still did the work. After some experimentation (during which I could “transfer the soul” of my machine to another version), I found that c3.large did the job just fine – at $0.12/hr, a major savings. That’s what has it for now.

Because I knew what I was dealing with, that is, a machine that was actually software to imitate a machine that was itself inside an even larger machine and that machine inside a datacenter somewhere in California… I could make smarter choices.

The script to “add all the stuff” my screen shotgun needs sits on a machine that I completely control at the Internet Archive. The screenshots that the program takes are immediately uploaded away from the “virtual” Amazon machine, so a sudden server loss will have very little effect on the work. And everything is designed so that it’s aware other “instances” are adding screenshots – if a screenshot already exists for a package, the shotgun will move immediately to the next one. This means I can have multiple machines gnaw on a 9,000 item collection (from different ends and in the middle) like little piranhas and the job will get done that much quicker.

In other windows, as I type this, I see new screenshots being added every 20 seconds to the Archive. That’s very nice. And the total cost for this is currently 36 cents every hour, at which point a thousand screengrabs might be handled.

I’m not “leveraging the power of the cloud”. I’m using some available computer rental time to get my shit done, a process that has existed since the first days of mainframes, when Digital and IBM would lease out processing time on machines they sold to bigger customers, in return for a price break.

It is not new.

But it does rule.

Categorised as: computer history | Internet Archive | jason his own self

Comments are disabled on this post

6 Comments

ChrisG says:

July 30, 2014 at 10:59 pm

I agree on the cloud. Your data is best kept on a short leash.
anachostic says:

July 31, 2014 at 2:32 am

You’ve been thinking about this long before me, but I’m surprised this viewpoint never came up: http://anachostic.wordpress.com/2014/05/15/heads-in-the-cloud/
Bruce says:

August 2, 2014 at 11:53 pm

In the 60s when IBM invented the cloud they called it Utility computing by analogy with power or water utilities.
Asterisk says:

August 4, 2014 at 7:21 pm

Using commodity storage/bandwidth/CPU-power provided by a third party doesn’t seem to be “the cloud” that people hype, and that you rightly criticized in “Fuck the Cloud!”. Services like Amazon EC2, Linode, etc., are just an evolution of the traditional web-hosting paradigm: you’re still running your own code using resources that are under your own direct control. Hosted platforms like these are like leasing a house: you may have to make regular payments to keep living there, but while you do live there, you have just as much privacy and control over your activities as you would if you owned the house.

Software-as-a-service, on the other hand, is more like living in a hotel; you don’t really control anything, and you’re subject to the mercy of the people you’re obtaining the service from. You don’t control the code, and you don’t have direct access to the servers it’s running on: you’re simply doing data transactions with someone else’s server. SaaS is the “cloud” we need to be wary of, and EC2-like solutions are a step in the right direction.
natecull says:

August 11, 2014 at 8:48 am

I have to wonder just how secure even Amazon S3 volumes are. Since if AWS doesn’t outright generate the private key for you, their custom hypervisor probably knows when you make the S3 API call, and it certainly has access to the RAM that contains your private key. They talk about about how secure everything is and how they don’t store your keys and there are armed guards watching all the sysops, but from where I’m sitting my AWS credentials are exactly as secure as one quiet little shell script running under the CEO’s account.

And they host a private AWS instance for the CIA, but I’m sure they refuse to cooperate at all with the US intelligence community. Or have any desire to make any commercial advantage from several hundred thousand company’s financial records and patented research. They just sit there every day, day after long boring day, patrolling the hot aisles and saluting at guard change every hour, protecting everyone’s secrets and manfully resisting the urge to peek.

And I know that technically, any of thousands of small hosting companies could also look at what’s on the VM instances they host. Well, except that small web hosts typically run off-the-shelf software and don’t have their own custom hardware and software.

It’s just that if I were an evil billionaire Bond villain with my own space program, owning ALL OF THE WORLD’S COMPUTERS (MUHAHAHAHA) would be fairly high up there on my To Do list, right after blowing up the Moon.

We can trust Jeff Bezos, though, right? Anyone with rockets must be a nice guy.
natecull says:

August 11, 2014 at 8:59 am

(I don’t actually know how important privacy of data security of private keys is to people running VM cloud instances; it just seems to me to be the elephant in the cloud that nobody’s talking about, because there seems mathematically to be no way to keep private keys of any kind secure in hosted computers at all. And maybe it’s just best ignored? What do people think? Is there any global systemic risk to hosting collapsing into three or four huge US providers? Can an Amazon or IBM be sued into oblivion if they read a single byte of encrypted user data? I want to be told that yes, I’m overreacting.)

Share this: