ASCII by Jason Scott

Jason Scott's Weblog

Statement by Jason Scott on Archive Team —

archiveteamlogo

In 2009, I proposed the idea that rapidly-shutting-down websites should be quickly downloaded and saved by a fast acting and future-minded group of web archivists. I called them the A-Team, or Archive Team, and said they’d be the last resort for online history that was otherwise doomed. An unbelievably huge amount of web history had been deleted by that point, subject to capricious whims of startups and dot-com tomfoolery that resulted in quick shut-offs.

A good number of people had this proposal resonate with them, and they volunteered time, money and effort to making this team a reality.

In the interim years, I have done dozens of appearances, interviews and statements regarding the goals and ideas of Archive Team. With my skills in public speaking and a willingness to stand in front of crowds, I’ve been the face of Archive Team to many people. I’m still involved with the project daily, and provide advice, make connections, and do whatever else I can to help with the goals of the Team.

BXY-RPOCMAImCm9In 2011, on the strength of a talk I presented at the Personal Digital Archiving Conference, I was hired by the Internet Archive as a free-range archivist, working with the non-profit on a huge realm of projects. Some of these have been internal efforts and some have been widely public, like the collection of software or preservation of various musical archives.

All of this brings to bear a very important distinction:

Archive Team is not an Internet Archive Project, and I am not the owner of Archive Team.

Archive Team has no offices, no phone number, and a website that is a wiki that any reasonable person or team member can edit. While I am a (generally) beloved figure who is appreciated for his public speaking skills and snappy dressing, Archive Team has collectively disagreed with me and some projects have been approached completely different ways than I would have approached them.

The Internet Archive does NOT take every single bit of data Archive Team produces (although they take the vast, vast majority) and Archive Team is not “paid for” or “owned by” the Internet Archive in any way.

As the saying goes, if I was hit by a bus later this afternoon, Archive Team would be very sad, but would get back to work by the evening. The early evening.

While I am pleased that the Team listens to my opinion as much as it does, there have been projects and efforts that were started, refined, and producing output long before I wandered into the communication channel. While my presence is then acknowledged, the Team has continued the given effort as it would have anyway, preserving and rescuing online history from total oblivion for the good of society and future generations.

Archive Team is an idea, and the idea is far beyond me or even the Internet Archive at this point. I expect this to continue.

Meanwhile, I have been actively avoiding the front of buses.

Thanks.


OASIS at SXSW: An Asking Thing —

Every time I’ve spoken at the South by Southwest Festival in Austin, Texas, I’ve always had someone else arranging all the details, including the panel and the attendant responsibilities of getting it to pass muster. This time, however, I’ve put together something neat, and the process needs, nay, requires you to really get out there and pound the pavement.

I’m a busy person and endless pavement pounding is not my usual deal, but if you are so inclined, and since the deadline to vote ends within a week, I’d like to call your attention to the SXSW “Panel Picker” and a proposed on-stage conversation to be held between two very odd retro-oriented folks.

Here’s the full information:

http://panelpicker.sxsw.com/vote/38216

In Ernest Cline’s novel READY PLAYER ONE, the world’s population spends most of its time inside the OASIS, a simulator and ultimate operating system that provides access to an endless amount of games, videos and media to the world. At the Internet Archive, a non-profit dedicated to open access to as much content as possible, a new in-browser interface allows instant access to tens of thousands of microcomputer programs, console games, and emulated vintage hardware. Author Ernest Cline and Internet Archive’s software curator Jason Scott discuss the similarities between OASIS and the Archive, the consequences and results of a world with endless vintage computing access, and what parts of the 1980s they’re working the hardest to save.

It’ll help if you have read Ready Player One, so there’s Ernest Cline’s page about it. It is a fun novel about a world where, among other things, endless amounts of old videogames are available.

It’ll also help if you are aware of the Console Living Room, a project I’ve spearheaded where, among other things, endless amounts of old videogames are available.

The SXSW system made me go through an enormous amount of hoops to get into that panel picker. They have a very large amount of things you have to read and describe, and it appears that having a panel of a guy from Austin (Ernest) and a guy from NY (Me) throws us a few rungs back, but my hope is the surreal experience of a book coming true and what that means for the world at large, will overcome that.

The deadline to vote is at midnight tonight. I couldn’t bring myself to do the massive amount of canvasing to beg for votes that I think is being encouraged. So my hope is you will vote if you want to, and if not, we’ll figure something out. Ernest and I have a mutual admiration and I look forward to doing stuff with him in the future.

Anyway, vote before Midnight, or leave a comment, or whatever floats your 1980s-loving boat. Thanks.


The Need for JSMESS Speed —

This is another call for help with JSMESS. I promise you that I will get back to more general computer history soon, but this project is really important and changing the entire paradigm of how software is presented is pretty high up the list right now.

I also know this description of the issue and then calling out for help can get pretty repetitive, but the combination of javascript conversion, browser interaction, and the entire MESS/MAME project itself means there’s all sorts of strangeness happening in the gaps between.

Let’s focus on the good part first… This program works very well. Almost miraculously, it will run a whole variety of software, game cartridges, and images and present them inside of your browser. Sometimes it’s a bit rough, sometimes it’s a bit slow, sometimes the mere overwhelming user interface between the original item and it being in a browser window makes things strange. But even a few years in, I will set up a full screen image of a game console playing a scrolling classic and I will completely forget how this is happening. It is seriously the bomb.

logo

A while ago I put out a call to have the sound issue looked at. A top-notch developer named Katelyn Gadd stepped forward, helped us create an entirely new sound device, and in doing so fixed about 20 major problems with sound. She also gave all of us a master class in understanding what the boundaries and hurdles are in browser sound in general. Summary: lots, although they are working to change standards to make it better.

The sound situation and resolution was amazing enough to inspire me to try it again. This time, it is speed.

There are a variety of attacks to making the JSMESS system run faster in the browser.

Obviously, it helps if the Emscripten language compiler gets more efficient, and work is being done in that direction. Just a year ago (can it have really been just a year?), The Colecovision emulation was working at 14% speed. Now it almost always consistently runs at 100% on even slower systems. Work on this is ongoing, and the Emscripten development team stays in almost constant communication with us, so that’s being handled.

Obviously hardware will get better over time, but we’re not exactly going to sit back and wait. But stay on point, computer industry!

The browsers themselves are rapidly increasing the speed of their JavaScript engines. The website arewefastyet.com lets you watch nightly tests being run against JavaScript engines so that we may notice that these things are getting damn fast. Again, not my department, not willing to wait.

Certainly, the emulator itself has been working to speed things up, but it might not surprise you to learn that speed is willingly sacrificed in the name of accuracy, to make sure that all the aspects of incoming images are handled and that everything can be, if not future-proof, at least future-existent. If it slows things down for a while, the MAME/MESS team is not bothered by it. It would be nice if somebody went to work on the emulation team itself to optimize things and generally help track stuff down, but that’s a rant for a future entry. Until then, speedups and slowdowns on the emulation can have a pretty drastic effect on the JavaScript version as well.

So that leaves a number of efforts to make the resulting JavaScript output as machine friendly and fast as possible. It also means a situation where simple code changes applied to the emulator source code results in the JavaScript versus being that much more efficient.

To help jumpstart things, we have created a page about the need for speed. We’re trying to put, in terms that will be of use to a developer or coder, what exactly were looking for.

If you’ve got the skills to get involved with this, or know someone who does, it would be great to hear from you. It would have an amazing effect on a pretty important project, and we’ve seen cases where one or two simple insights from a new team member makes the entire thing run that much better.

We really come along. There’s a ways to go, and I’m hoping that by writing this we can reach out to someone to make a difference.

Let’s speed this thing up.

screenshot_07


Screenshots Forever and Ever Until You Can’t Stand it —

The Screen Shotgun, as mentioned before, is continuing its never-ending quest to play tens of thousands of games and programs, take screenshots, and upload them into the Internet Archive.

Like any of these tinker-y projects, I’ve written a bunch of additional support, error checking, and what have you to the process, so that it can handle weird situations while being left alone for days on end. There’s still the occasional mess-up but the whole thing can be viewed later and I can pinpoint mistakes and re-do them with little effort. It’s a win all around.

There’s now a routine called BOUNDARYISSUES that looks at any emulated program and figures out where the edges are – it’s no big deal and the routine is probably hugely inefficient but it’s nice to keep my hands on the programming side of things, even a little. Thanks to BOUNDARYISSUES some machines that have less than two dozen known software packages are getting screenshots, since the program will do the cropping work and it’s not reliant on my procrastination or free time.

And how many winners there are!

screenshot_25

There won’t be an endgame to this anytime soon – I’m now ingesting hundreds of floppies, thousands of already-ingested floppies, and whatever else I can find online. The Screen Shotgun has work cut out for it for some time to come.

So thanks to the industrialization of the screenshot, it’s giveaway time!

I’ve decided to throw some galleries of these screenshots on Flickr, because what the hell, I have an unlimited account and I love finding what the definition of “unlimited” is. So, enjoy:

Feel free to use these any way you want, for whatever you want. Watermark them and I’ll track you down and humiliate you like a hole in your pants. Make art! Do criticism! Set up retro slideshows in your raver club! This art represents hundreds of thousands of work by thousands of people – it’s worth browsing through. (ZX and Atari 800 are my favorites at the moment.) I’ll be adding more sets soon.

So yeah, I’m writing this one in the ‘success” column. This is years of work, done in a month. But so much more to do!

Get going, shotgun.

00_coverscreenshot (1) 00_coverscreenshot (2) 00_coverscreenshot (3) 00_coverscreenshot (5) 00_coverscreenshot (6) 00_coverscreenshot (7)


WHAT the Cloud? —

In some circles, I’m known as the guy who wrote Fuck the Cloud.

Yet as of this past weekend, I have three Amazon EC2 instances doing massive amounts of screenshots of ZX Spectrum programs (thousands so far) using the Screen Shotgun.

Nobody has specifically come after me about this, but I figured I’d get out ahead of it, and again re-iterate what I meant about Fuck The Cloud, since the lesson is still quite relevant.

00_coverscreenshot (12)

So, the task of Screen Shotgunning still takes some amount of Real Time – that is, an emulator is run in a headless Firefox program, the resulting output is captured and analyzed a bit, and then the resulting unique images are shoved into the entry on archive.org so that you get a really nice preview of whatever this floppy or cartridge has on it. That process, which really works best once per machine, will take some amount of minutes, and multiply it by the tens of thousands of floppies I intend to do this against, and letting it run on a spare machine (or even two) is not going to fly. I need a screenshot army, a pile of machines to do this task at the same time, and then get those things up into the collections ASAP.

A perfectly reproducible, time-consuming task that can be broken into discrete chunks. In other words, just the sort of task perfect for….

Well, let’s hold up there.

IMG_3701

So, one thread or realm of developer/programmer/bystander would say “Put it in the Cloud!” and this was the original thing I was railing about. Saying “Put it in the Cloud” should be about as meaningful a statement as “computerize it” or “push it digital”. The concept of “The Cloud” was, when I wrote my original essay, so very destroyed by anyone who wanted to make some bucks jumping on coat-tails, that to say “The Cloud” was ultimately meaningless. You needed the step after that move to really start discussing anything relevant.

The fundamental issue for me, you see, is pledging obfuscation and smoke as valid aspects of a computing process. To get people away from understanding exactly what’s going on, down there, and to pledge this as a virtue. That’s not how all this should work. Even if you don’t want to necessarily be the one switching out spark plugs or filling the tank, you’re a better person if you know why those things happen and what they do. A teacher in my past, in science, spent a significant amount of time in our class describing every single aspect of a V-8 engine, because he said science was at work there, and while only a small percentage of us may go into laboratories and rockets, we’d all likely end up with a car. He was damn right.

Hiding things leads to corruption. It leads to shortcuts. It starts to be that someone is telling you all is well and then all the wheels falling off at 6am on a Sunday. And then you won’t know where the wheels even were. Or that there were wheels. That is what I rail against. “The Cloud” has come to literally mean anything people want.

No, what I wanted was a bunch of machines I could call up and rent by the hour or day and do screenshots on.

And I got them.

samurai

Utilizing Amazon’s EC2 (Elastic Computing) is actually pretty simple, and there’s an awful lot of knobs and levers you can mess with. They don’t tell you what else is sharing your hardware, of course, but they’re upfront about what datacenter the machines are in, what sort of hardware is in use, and all manner of reporting on the machine’s performance. It took me less than an hour to get a pretty good grip on what “machines” were available, and what it would cost.

I started with their free tier, i.e. a clever “try before you buy” level of machine, but running an X framebuffer and an instance of Firefox and then making THAT run a massive javascript emulator was just a little too much for the thing. I then went the other way and went for a pretty powerful box (the c3.2xlarge is the type) and found it ran my stuff extremely well – in fact, compared to the machine I was using to do screenshots, it halved the time necessary to get the images. Nice.

You pay by the “machine hour” for these, and I was using a machine that cost $.47 an hour. Within a day, you’re talking $10. Not a lot of money, but that would add up. The per-hour cost also helped me in another way – it made me hunt down inefficiencies. I realized that uploading directly to archive.org was slowing things down – it had to wait in line for the inbox. Shoving things into a file folder on a machine I had inside the Internet Archive was much faster, since it just ran the file transfer and was able to go to the next screenshot. Out of the 2 minute time per program, the file upload was actually completely negligible – maybe 1-2 seconds of uploading and done, versus 1-2 minutes putting it carefully into an item. Efficiency!

I then tried to find the least expensive machine that still did the work. After some experimentation (during which I could “transfer the soul” of my machine to another version), I found that c3.large did the job just fine – at $0.12/hr, a major savings. That’s what has it for now.

00_coverscreenshot (11)

Because I knew what I was dealing with, that is, a machine that was actually software to imitate a machine that was itself inside an even larger machine and that machine inside a datacenter somewhere in California… I could make smarter choices.

The script to “add all the stuff” my screen shotgun needs sits on a machine that I completely control at the Internet Archive. The screenshots that the program takes are immediately uploaded away from the “virtual” Amazon machine, so a sudden server loss will have very little effect on the work. And everything is designed so that it’s aware other “instances” are adding screenshots – if a screenshot already exists for a package, the shotgun will move immediately to the next one. This means I can have multiple machines gnaw on a 9,000 item collection (from different ends and in the middle) like little piranhas and the job will get done that much quicker.

In other windows, as I type this, I see new screenshots being added every 20 seconds to the Archive. That’s very nice. And the total cost for this is currently 36 cents every hour, at which point a thousand screengrabs might be handled.

I’m not “leveraging the power of the cloud”. I’m using some available computer rental time to get my shit done, a process that has existed since the first days of mainframes, when Digital and IBM would lease out processing time on machines they sold to bigger customers, in return for a price break.

It is not new.

But it does rule.

screenshot_01


The JSMESS Sound Emergency —

UPDATE: I’m happy to say a developer has come forward and we’re out of the woods on sound. It’s not perfect, but the web audio API isn’t perfect and we’re much better armed for interacting with it now. Thanks, everyone.

logo

Spread this one far and wide.

It’s rare I get anything close to desperate, but we’re somewhere in the realm of “stunningly frustrated” and so I can see where things are going. I can state the problem simply, and hopefully you or someone who you can reach out to, can be the one to do the (likely minor) work to make this happen.

Essentially, JSMESS has a real sound issue.

MESS does not – the program handles sound nicely and stuff sounds really great, just like its recreation of computers and other features are great. In most cases, these amazing MESS features have translated nicely into JSMESS. But not sound.

IMG_2866

I have thrown a lot of good people at this morass. We’ve done a massive amount of work trying to get sound to improve. We have cases where it is very nice, and cases where it is horrible, grating.

It is holding back the project, now. People want to hear the sound. Right now, it is simply not dependable to turn on at the Internet Archive. I want to be able to turn it on.

Like a lot of problems to solve with the web, we have two test cases you can try out: The Wizard test and the Criminal test.

Here is the Wizard Test. It’s an emulator playing the Psygnosis game “Wiz n’ Liz” on an emulated Sega Genesis. This is extremely tough on the browser – almost nothing can play it at 100% speed.

Here is the Criminal Test. It is an emulator playing Michael Jackson’s Smooth Criminal as rendered on a Colecovision. It is not tough on the browser at all. Almost everything should be able to do it at 100% or basically 100%.

In both cases, Firefox will play the emulator faster and will sound better. Chrome will generally do well, but will be slower. Internet Explorer will have zero sound. Safari… well, depends on the day. (And Opera is dead – it’s essentially a reskinned Chrome. As is Seamonkey a rebuilt Firefox.)

colecojackson

So, what do we know?

Well, part of this whole mess was a switch over to the Web Audio API. Mozilla’s browser had a nice format before that worked well, but only on Firefox. In theory, the new API will eventually work everywhere.

Here is a helpful chart describing that compatibility. So we’re working for this Web Audio API.

My belief is only a relatively small number of people will be able to help. I am happy to entertain all ideas, discuss all possibilities. You can come to #jsmess on EFNet if you have IRC, or you can just e-mail me at audio@textfiles.com. I am willing to spend all the time you need to ramp up, or try any suggestion.

In the past, fresh eyes have helped us greatly to get MESS to the fantastic position it is now, where it can play tens of thousands of programs for hundreds of platforms. Here’s hoping your fresh eyes might help us further.

 


A Very Big Sort, or The Epic of Deaccessioning —

This has been years in the making.

IMG_7266

When my living space looked like this photo. it was just vaguely problematic. Considering what I do and what is specifically needed day to day, this project and storage situation was an issue.

But that was a while ago. Now we’re at this:

IMG_4977

See, that’s seriously out of control.

There’s two main contributing factors to that state, which is that I had to quickly consolidate from other parts of the house I’m renting when space was needed for other items, and I simply did not have time to address incoming material when it came in, so it became a matter of just finding space for things mailed in and then saving it for later.

This is to say nothing of the shipping container, which currently looks like this:

IMG_2693

So, that’s a lot of stuff. That’s a 40-year old’s ability to acquire material and bring it to bear into a storage space, with a splash of “divorced in 30s” and “moved out of the house”.

But here comes the big changes.

The book scanner is back up again. That provides me the ability to scan in materials before finding them a home elsewhere. My rule is nothing leaves the house in printed form unless it’s digitized in some fashion, and I have a copy of the digitization.

With the books scanned and going to a home, that then leaves magazines.

Magazines scanned and going to a home, that then leaves academic papers and proceedings.

Believe it or not, just getting those out will probably clear things up beyond belief – I’d say a MAJORITY of material in the container and my room are printed materials of this sort.

As I begin going through the books, I check for them on the Internet Archive’s Open Library site. I see if the book has been scanned already, and it’s quite shocking just how much the Archive has scanned over the past few years, then I know I don’t have to scan it. I spend a small portion of the time that I would have spent scanning adding some background information to the Open Library entry, so that the book has a better look. I also ensure the cover image looks good, and so on. Then it’ll go into a box of outgoing material.

With that going on, I’ll getting a pile of to-be-scanned books and an outgoing pile to be sent away. That brings up the next issue.

Where do they go?

IMG_0758

I do not throw out books. Let’s repeat that – I do not throw out books. I go a little further, in fact, and I will not give the books to a place likely to throw out the books. I consider that up with the cowardly act of leaving your longtime pet with the vet to be put down, while you tearfully drive home. You did the deed, you just didn’t own up to taking responsibility. So I’m not donating/contributing these books to a place that is likely to toss them.

This cuts down the potential field dramatically.

For books related to games and gaming, a home is already in place – The International Center for Electronic Games at the Strong Museum, in Rochester, NY. I’ve built up a great reputation with those folks, and they are delighted to be getting that set of material. They don’t mess around. They get things done:

IMG_1017IMG_1054IMG_1056

One of the reasons I really like them is that they have an honest research library and space you can work in – you can get a hotel nearby, and go in and do actual work involving having a table and even a locker to store the printed materials in between days. It’s what I really want for this.

Because, you see, it’s not about me having the most stuff. It really never was. I am not interested in going after as many lost collections as possible, pushing them into a bigger pile, and declaring victory. I want this stuff accessible and useful.

About 4-5 times in the last two years, people have approached me asking if I have X, and the answer, in 3 of them, is essentially that I do have X, but X is buried way the fuck down in the shipping container, and good goddamned luck. Well, that’s not right at all.

So off they’ll go.

The question that remains is where.

I’ll be packing these scanned-or-verified books into boxes, putting them in bags before doing so, and then looking for a place these will go.

The place will have to have a phone number, people on salary, and physical space dedicated to storing and accessing the materials.

I want to talk to them and I will probably want to tour them.

So that search begins – here’s hoping I find one.

In the meantime, now begins what I hope is the next phase – slimming down my collection while making it available to the maximum amount of people.

To the Scanner!

IMG_5080


MindCandy: The Last Bright Star Before the Media Dims —

unnamed

The MindCandy series got me started on this whole “get it down in a movie” trip.

Created with a love of the demoscene, a dedication to capturing the demos as accurately as possible, and most importantly explaining the entire process from beginning to end, MindCandy was a refreshing breath of air. DVDs, which were still relatively new in 2002, had a few weird examples of using the features in the format but few had the dedication to making the most of the format as MindCandy did.

As I began work on the BBS Documentary, it was MindCandy’s inspiration (and their staff) which gave me the push I needed to make the final DVD as nice as it could be.

MindCandy Volume 1: PC Demos was followed up a few years later by MindCandy Volume 2: Amiga Demos. It was in every way as good as the first. They sold pretty well – they made back the cost but they’ll never make back the time spent to make them.

Then, finally, they released MindCandy Volume 3. MindCandy was a Blu-Ray/DVD combo, and it did its best to use all the insane measures of Blu-Ray and bring the high-resolution captures of PC Demos to the next generation of media equipment. It is truly an exquisite package, of a near and dear quality.

But of course, times had changed.

vol3xThe trick of moving from CDs to DVDs, and from DVDs to Blu-Ray, has turned out to be a cul-de-sac in the journey of access to material. I saw this in 2010, and released GET LAMP with a gold coin because I knew it was going to be difficult to get people to buy physical media. By 2010, people were asking “Why can’t I download this”? And by 2011, people were asking “Why do I have to download this? Can’t I just stream it? Everywhere?” This world changed very fast.

Yes, there are still people who prefer the physical media. They want a nice package, a sense of an experience when they get the show in their mail. They are a shrinking group, and while they should be catered to, they are out of the realm of the majority of people. Some even think they’re part of this group and they seriously are not. Not really.

It’s pretty obvious where the world is heading, and so this graphical treat by Hornet (who designed the DVD and software to do amazing captures that are still used by the Demoscene) is the bright brilliant sunset of a spectacular triptych of works.

The model these guys should have gone with should have been Patreon (make top-quality exports and contextual interviews about demos, and release a set for money each month), but they didn’t have Patreon until recently, so here we are. A missed boat.

MindCandy 1 and MindCandy 2 sold out of their DVD media years ago. In response, MindCandy has released both of these products as Creative Commons-licensed downloads. You can grab them both from the site.

MC3_Cover1Complete_1280_x_709

And now the last volume has been given a viking funeral, with the remaining stock being dropped to $12. I’m sure they’re taking a bath on this. The announcement said they made 2,500 copies and this was the last 700. Since this came out three years ago (2011), that’s slow sales and I’m sure this was a huge expense.

So, my learned advice to you is this: buy this artifact, this excellent work and package as it rounds out a short but sweet arc of physical media meant to be the next generation.

Oh, and it’s top-notch.

 


Rise of the Screen Shotgun —

Continuing the thoughts I had in the previous entry, I’ve been working on a side-project to improve access to all the software and console games I’ve been uploading to the archive.

To some percentage of the populace, it is a simple and obvious thing, but that’s how a lot of efficient breakthroughs happen – doing simple and obvious things.

Forever. To everything.

Expanding out the console living room collection on archive.org to 2,300 cartridges representing 21 console systems had a not-entirely-unexpected side effect: it blew up the volunteer metadata gang. When things were in the hundreds with these cartridges, a small handful of folks could add some descriptions and other metadata pairs to the entries and reasonably get through them all. Not so anymore.

I’ve got some brave individuals moving through the sets, and they’re heroes to me for doing so. But the pain of a couple thousand cartridges will be nothing to the inevitable hundreds of thousands of individual disks I’m going to end up ingesting. There’s multiple thousands of disks in my room as we speak, not to mention all the other collections I’m working to make playable. It doesn’t scale. It can’t. But I’ve got a step in the right direction.

Mused about by Andrew Perti in the metadata-entry IRC channel we’re hanging out with, and implemented by myself and Kyle Way, is a system I’m calling the Screenshotter or the Screen Shotgun, depending on your mood and taste.

The goal with it is to automate, 100%, the creation of screenshots and informative image grabs of these many, many software projects with essentially zero or minimal human intervention. I’ve been running it for about a week.

It is working very, very well.

00_coverscreenshot (4)00_coverscreenshot00_coverscreenshot (1)00_coverscreenshot (2)Again, these are generated automatically – other than writing the robot that’s doing this, I didn’t make them happen and I certainly didn’t sit there doing screengrabs and turning them into usable, clickable screenshots. And I definitely didn’t shove them into Internet Archive entries for the software items – that was done by a script.

It gets better, too. Click on this image:

Alex_Kidd_in_the_Enchanted_Castle_1988_Sega_JP_en_screenshotIt’s an animated GIF file, and it goes on for a while. It’ll show the title screen, some credits, and a little bit of gameplay (in this case, attract mode gameplay). It’ll be possible to say “re-generate this, but press this key at the end”, so I can finesse some of these (hence it being “minimal” human interaction – I can nudge if I look at dozens of screenshotted programs and realize a software item fell down).

Having these screenshots often verifies a ton of properties: who made it, what kind of program it is (miscategorized?), what the selections are (if it shows a menu), and what one might expect if it’s run in JSMESS – since it is being run in JSMESS.

It’s a simple enough process. Steal it from me and refine it.

screenshot_03

A machine that will be living the doomed life of eternally playing software packages is installed with:

  • An X server (Actually Xvfb, which is a virtual X server used for testing which has a fraction of the space).
  • A copy of Firefox.
  • ImageMagick, that eternal, ubiquitous bastard of image manipulation.
  • Fdupes.
  • The Internet Archive Command-Line Interface.

Some of that you likely know about, others you might not. They’re all readily available, however, and not that secret.

On a very high level, here’s what my script does.

  • Assume the X server is running.
  • Start firefox, running the JSMESS emulator/player for a piece of software
  • Wait roughly 50 seconds (about how long it takes the JSMESS “machine” to boot on my shared, weighed down server).
  • Take 40 screenshots, cropped to JUST the JSMESS player window, with a 4 second delay between each.
  • Run FDUPES against the 40 resulting screenshots, to get rid of shots of static or unchanging images. Sometimes this pulls it down to a single shot, and other times all 40 stay.
  • Upload the resulting full-size unique screenshots into the Internet Archive item, making one the “representative” based on (ahem) being the largest. (This actually works pretty well – often the largest is the one with the most variety, hence often the title screen or gameplay).
  • Compress the screenshots into an animated GIF and upload that.
  • Get rid of all the evidence, and kill Firefox.

Obviously, things can go wrong, but among the things that go wrong that are my favorite are where the resulting screenshot gives an error about the software itself:

Alien_Soldier_1995_Sega_EU_en_screenshot

In this way, the software is basically doubling duty as a free-labor Quality Assurance department, making it instantly obvious as I walk the collections that Something Was A Little Off. Often it’s a simple matter of making a slight change to the emulator parameters or putting the software into a quiet corner because it just doesn’t work. That’s less frustration for future users, and one less concern that the collection is accessible.

screenshot_02 (1)

Notably, this destroys the Screenshot Economy.

There’s a certain amount of work that exists to generate metadata or representative imagery related to, well, anything. But in this case, the work needed is to generate representative screenshots, attach them to the right name, and keep track of them. It’s both simple and boring yet annoying enough that if someone “stole” your screenshots, you might get really angry at them. You might even watermark your screenshots, since you slaved away putting all of them together. You might hound people you saw as lifting the screenshots away, even though they’re using them as purely informative purposes and you yourself didn’t really generate any of the art inside.

Whatever, to that debate. I’m about to put either tens of thousands or hundreds of thousands of screenshots up. And you can use them any darn way you please. The Screenshotgun will not get angry with you. And I just follow the Screenshotgun’s lead.

screenshot_30

There are solid debate chunks available for the Gift Horse Dental Inspection Squad, of course. These are screenshots off an emulator, for example – if you want “true authentic” screenshots as one might take off a monitor, then these are not those. Some folks might not be happy that the Screenshot Economy has been taken down with a flood of “inferior” images, and the old adage of Perfect is the Enemy of Done can rear its head. Obviously, the emulator can be dinged in terms of color reproduction, aspect ratio, brightness/sharpness, and how it renders the graphics/text related to the original hardware. No doubt some of that error-ridden mess will creep into the crop.

I am not worried.

Over time, this process will refine itself. Right now, I have to do some initial setup related to the cropping of the emulator. Different machines yield slightly different Firefox rendering, and with that comes the window showing up at different coordinates. After I get the machine done for that screenshot robot, however, it can just be run, again and again, and most importantly, it can run in a way where it overwrites a previous screenshotting, replacing old bad with new good. I suspect that as time goes on, revisions or improvements to the simple scripting I have in place will handle contingencies of keypresses, performance, detecting bad frames, and so on.

Until then… many, many software items on archive.org’s Console Living Room and ultimately the entire software collection are going to get some excellent illustration, essentially out of thin air. And it won’t murder a volunteer or pile of volunteers to do it. Give me a week or two to get all the consoles properly filled with these screenshots, and then enjoy the beginning of a world where you can see what’s coming if you click the emulation link.

If you want a picture of the future, imagine a rack-mounted server joylessly playing videogames and taking screenshots.

Forever.

00_coverscreenshot (5)


The Robot Army of Good Enough —

(I do not speak for my employer. I am just very loud.)

Pretty much any organization of any size has certain themes, beliefs and outlooks baked into them. Some of them might be obvious from the outside. Others are so inherent that the members might not even notice they’re completely steeped in it.

IMG_4934

At the Internet Archive, there’s a philosophy set about access and acceptance of materials and presentation of said materials that’s pretty inherent throughout the engineering and the website. Paraphrased, in my own words, it’s this:

  • Always provide the original.
  • Never ask why a user wants something.
  • Now is better than tomorrow.
  • We can hold it.
  • Many inexpensively is better than one or none luxuriously.
  • Never send a person where a machine can go.
  • Enjoy yourself.

Some of this exhibits itself in how people use the site – they can grab anything, they can get a “library card” account but they don’t have to, and they can embed or direct-download anything they want. While the machines will derive out versions of the content, you can always find that massive .AVI, .PDF or .WAV that the content came from. They don’t keep user logs to any real degree. They don’t get in the way.

Internally, the rest shows itself in engineering and code – use commodity hardware which will break more often but which can be bought in much greater amounts instead of “Ol’ Trusty” that’s intended to work for five years without fail and “Ol’ Trusty” is all we have because we can’t afford more. The code will put an item up before it’s fully “baked”, that is, you’ll see the original .AVI file for a video item and maybe 20-60 minutes later, another derivation will show up, and then maybe another one after that. This sliding window of material population really confuses the end users in some cases, I’m sure. But it means you get it now, now, now, instead of when it’s all wrapped in a bow.

As things currently stand, and based on my now three years (!) of working for the organization and going out into the world to speak about the place and get feedback, the resulting good and bad of this approach is this:

  • Good: Nobody is doing what we’re doing in many cases, we have so much stuff, every time I wander there I lose an evening walking the stacks.
  • Bad: The site looks like poop, and it’s pretty hard to find the stuff.

So, to get out ahead of “poop look”, efforts are underway to redesign the site, and what I’ve seen, I really like. That’s all I’ll say because it’s not my project.

Regarding finding material and there being stuff, I think the priorities of the Archive have been really firm and right-minded: get the stuff first, quibble on accessibility or presentation later. Turning things away is how tragedy happens. What’s worse – something was taken in and put into a big storehouse? Or something was offered, and because it failed to have a MARC record or a metadata post-it-note on the outside of the archival-quality container file, it was sent back out into the night?

But the real miracle, the one that is perhaps really not obvious from the outside, is how much of the Internet Archive’s work is done by machinery and code.

When an item is uploaded, the user can designate and mention all sorts of aspects of what was sent in – the title, the description, when it was made, who made it, and a bunch of other interesting data attributes. The format allows a lot of extension, so if you want to indicate which of the 300 audio files you uploaded have dog sounds and which ones are recorded using a specific type of microphone, you can do that. It might not mesh with other items all that well, but that’s not your problem – you’re adding things that a machine might not ever know.

But a set of machines at the Archive do know a lot about your item, and will do work to add it all and create other versions of your item. For example, you can upload a .zip file of .jpeg images, and if you happen to name it *_images.zip, it will create a .PDF file of it, an OCR’d version of any text in it, and an animated GIF file of the pages. With movies, it will take a massive .AVI and it will create a thumbnail set, a web-ready version (if it can), and so on. And bear in mind, this collection of tests is massive – it tries to determine the average pixels per inch, the orientation of texts, the framerate of the video, the number of tracks in a collection of MP3s and if there’s any tagging built in. It does a lot. And most importantly, with zero human intervention.

evaluateAnd here’s where the “controversy” happens.

By “controversy”, of course, I mean “people murmuring under their breath in the area of disciplines the Archive overlaps with”. Other organizations and practicioners of the arts of archiving, you see, have their own baked-in philosophies and credos, spoken and unspoken. And they don’t exactly see eye to eye.

Some I’ve encountered and observed:

  • Machines can’t beat people.
  • Zero metadata beats inaccurate metadata.
  • Digital is a Cult. Physical is a Truth.
  • Another six months won’t hurt.
  • Who are you and why do you want this?
  • Pick a format, document it utterly, and use it forever.
  • Justify, Justify, Justify

There’s many more. Some come from policy, some from personality, and some from how people are brought up into the discipline. We’ve destroyed the term “disruptive” as being meaningful in discussions, but the concept that a new outlook or idea could fundamentally change the nature of the realm it is part of is still quite valid. To some extent, the Internet Archive is an upending of century-old approaches, while still loving and promoting the shared beliefs:

  • Our history depends on our artifacts and writings.
  • Education without context is flimsy and transient.
  • Reading is fundamental.
  • What happened before is important tomorrow.
  • Humanity is worth the trouble.

In my capacity in outreach, I find myself in a lot of conferences, restaurant tables, hallways and sidewalks talking to people who believe in these shared beliefs but don’t buy 100% into what the Archive is up to. They question whether a Robot Army is the way to do this inherently human activity, that of cataloging and classifying, of summarizing and representing.

The problem, ironically, is that people think of it as binary: all machine, all people.

Where we are now, the machine takes a rough stab and occasionally a refined stab at what comes in the front door. It will try to OCR the text, it’ll figure out the orientation or how many pages or what baked in records exist in the digital object, and it will report those. It would also appreciate your input as uploader, thank you very much, but it doesn’t stop dead waiting for you, either.

To this end, the resulting output, especially the machine-generated side, is not perfect. But most importantly, it can be overruled. Always. It can always be shoved aside as “that’s not perfect, this is perfect”, but the amount of items getting that “perfect” treatment are going to always be a small percentage of total. They just are.

IMG_2422_s

So, this week, I was working on a way to make the endless piles of texts on the Archive more accessible. The solution I cooked up was to take the OCR’d text generated for all “texts” classified objects, throw them into a word frequency generator, remove the obvious stupid ones, and put that up into the Archive. That actually has worked out pretty well.

It’s not, perfect, of course. Never perfect. But here’s what it returned (and put up) in 10 seconds of analysis on a 945 (!) page book on Architecture:

figure; landscape; design; standards; soil; concrete; architecture; water; surface; aggregate; landscape architecture; saver standards; asphalt concrete; tor landscape; standards tor; water table; water level; standards lor; lor landscape

The “standards lor” stuff doesn’t fly – it’s an error. But the vast, vast majority of it is what a person might reasonably need to know “what the hell is this book about”. You can make decisions in a very short time if this is the book you want to browse through. You have more information than you had before.

Similarly, you can probably guess what these books are about from the keywords:

software; ibm; computer; graphics; apple; color; disk; program; commodore; game; hard drive; hard disk; word processor; disk drive; megabyte hard; deluxe paint; sale price; retail price; public domain
 
moog; modular; output; arturia; modulation; input; filter; frequency; manual; sequencer; moog modular; modulation input; connection jack; key follow; low pass; keyboard follow; audio output; audio input; input connection; trigger input
 
 
iso; wedding; lovegrove; julie; bride; pictures; chapter; shoot; shot; picture; wedding day; wedding photography; light matters; healthy profits; business strategies; wedding photographers; opposite figs; finoncial mastery; exposure compensation

Again, perfect? No. But each of these was generated, automatically, and without a miserable intern or low-paid person doing a job that would probably never be funded in the first place. But those keywords tell you a lot, and they’re getting the job done, even if you have to keep an eye out for what exactly “finoncial mastery” is.

And frankly, nothing stops the addition of a second set of scripts for quality control, that provide lists of all the generated tags and allowing a person to go “that one doesn’t look quite right” and to have it taken away. The difference is, now it’ll be one person overseeing hundreds or thousands of items at once, using the brainpower so that in one weekday they will do more resulting work than a year of the most highly-trained, perfect and precious professional dedicated to metadata entry. And in the case of the Issue of “Compute!” Magazine, the Moog Synthesizer Manual, and the Professional Wedding Photographer book above, you’ll get what you need, now now now.

blopps

And as a side note: I love this is what my mind is being used for. I love that I work for a place where this sort of thinking is what is needed. And I love what the result of this effort is – a place where millions of items are flying out the front door every single day, spirited away for a thousand reasons, and making the world a better place.

I can’t imagine doing anything else. Keyword: “happiness”.