ASCII by Jason Scott

Jason Scott's Weblog

Geocities: Why Hello, Everybody —

Well, hasn’t it been a fascinating few days indeed.

Remember that, late last week, this project wasn’t even on the horizon. I mean, Archive Team was, and I had mechanisms and structure in place to call out to people, but there was no indication that Geocities was experiencing any problems or future changes. So this hit me at the same time it hit a lot of people who found it out – randomly, and spontaneously. The moment I heard was the moment I vowed to copy off as much as I could and store it away, for historical reasons. I did the same thing when FileFront announced closure; in that case, I brought some people to the table but the team that ultimately did most of the coordination and saving didn’t need my attention, and then the old owners came back and repurchased it anyway, obviating the problem. (The people I brought in kept a massive backup of a lot of FileFront from that; we still have it.) This FileFront FireDrill did wake the beast, so to speak. When Geocities started officially burning oil, we were already prepared to respond quickly. And we did.

As usual, I wrote inflammatory descriptions of our intentions. And, as occasionally is the case, this got passed around a bit. And then Slashdot got a hold of it and put a story out.

Now, Slashdot in 2009 is a hell of a lot different than the Slashdot of 1999; I’ve written about this trend before. Where once a Slashdotting packed a devastating whallop of terrifying magnitude, now it’s more in the area of a florist’s shop on the Thursday before Valentine’s day; bustling, busy, but nobody’s breaking down doors or climbing over the counter. Additionally, they linked to my dreamhost-hosted weblog and website, and while the Archive Team wiki’s got some lingering programming issues I need to fix, the weblog itself (this thing you’re reading) has this wonderful program called SuperCache that should really be a requirement if you run a WordPress weblog – it makes it so your site does a fraction of the computational effort in the event of a Slashdotting or Digg or Reddit. So the waters rose a little outside my little weblog, and the commentary increased to 50 or more comments after the story instead of my usual 2-3, and then it faded.

As a side note, the usual intellectual pineapples of Slashdot are still in force – by far, my favorite one is where someone indicates I am part of the “Cult of Save Everything” and it is a sin for me to do this, at which point someone goes “what does it cost you for him to do this”, to which he responds “Eventually, someone will demand government money to save this junk! And then it will cost me, the taxpayer.” That, in itself, is classic Slashdot Awesome. 

That said, there were some excellent comments as well, as always, among the muck; like the person who pointed out that the Rosetta Stone is simply an announcement of a tax reduction, but the fact it’s in three languages provided a vital link to understanding previously unknown writing.

Like the fading rock star who can’t hit the high notes anymore, or the actor who hasn’t been in a blockbuster in decades, Slashdot may be down but can certainly get the attention of the influential. Or at least, the journalists. So within a short time the story of Archive Team’s Geocities Project ended up on Computer Buyers News, BoingBoing Gadgets, The Register (far and away my favorite), and even this classic rip-from-random-sources article in Web Host Industry Review. I wasn’t contacted about any of these.

Some other entities contacted me for interviews or statements, and I obliged. The NPR show Future Tense did a phone interview and so if you want to hear me say what I’ve been saying in my own voice, here you go. The whole show is a whopping five minutes, so it wouldn’t take you that long, should it interest you.

Anyway, so hooray, we got some attention, and more importantly to me, the idea and debate of “what to do when a major repository of community creation is destroyed for pure business reasons and with little or poor warning” got to the front of a few weblogs, twitter feeds and e-mails. This is kind of the most important aspect for me, you see. I live about 3500 feet from the Charles River, which used to have textile mill waste poured directly into it as a disposal method – the river would actually turn colors based on what dye was in use; kids would swim in that crap. I have a 1988  letter from an administrator at Digital Equipment Corporation announcing, from a Digital e-mail address, that he will no longer allow the passage of Fidonet traffic through Digital’s network equipment because the inventor of Fidonet was gay. Through reaction, attention-getting, and insistence, qualities once considered okay become not so okay, and ideally logic wins out. Right now it is not shameful or bad business practice to dispose of data immediately when it suits you – the practice of holding data in escrow for retrival for up to a year seems, to me at least, a step towards righting a wrong that many don’t consider a wrong at all – or don’t even think about. Trans Fats for the nerd set, I guess. So attention may or may not have that effect but I did something.

So speaking of doing something, the project is now in full bore. The process of grabbing material is not the difficult part – between a small handful of folks, the 200 gigabyte-a-day number could be maintained indefinitely, but we paused for a day or two while figuring out the best places to put it, the way to handle crazy filesystem issues, and all the logistical stuff you don’t initially think about when you run naked into the snow at midnight. I think we’ve got most of that handled at this point, and as we’re rsyncing between multiple “pools” of people with the capacity to hold all this incoming stuff, it’s just awesome to see the history literally flowing through the pipes. This is a lot of data, people; as I have indicated, this is an enormous cross-section of humanity, ranging from academics and historians through to music collectors, science fiction fans, conspiracy theories and prideful craftsmen. I’ve only occasionally glanced at stuff when a funny or interesting directory name goes by, but I am rewarded heavily by what’s here. As the amount of data grows (we are somewhere in the terabyte range, I believe, but further optimizations must be done), I expect this to only get better and better.

Oh, and did I mention the backlash?

Well, first of all, the usual bubbling mass of people who work with or for Yahoo and affiliated companies are not pleased about the general tactic I’ve taken of calling Yahoo all sorts of bad names and generally insulting the company’s good name. I’m sure some portion of them will have extra time to rip into me about this characterization, considering Yahoo is about to lay a bunch of them off.  If someone calls your company a barrel of bastard monkeys for doing what you can sort of origami yourself mentally into thinking is Good Business, you’re more than entitled to call me a heap of insulting names and indicate I am an ignorant foghorn interrupting your evening reading. Whatever makes the pillow feel like a vacation.

But many, many other folks came out of the woodwork, either to thank me and the project, ask further questions, or volunteer assistance. In fact, there’s almost too many people volunteering assistance. There’s only a limited amount of spectrum in the realm of grab everything off of Geocities, after all. So after the pile of people doing industrial-grade downloading and syncing, I’ve mostly been asking people to:

  • Help improve the Archive Team Wiki. 
  • Go find obscure sets of linked Geocities sites from things like mailing lists, usenet, forums, and so on.
  • Track down obscure Geocities history and articles, so we can add it to the collection.
  • Await further instructions.

Some people want to be The Hero and don’t want to wait out until the next phase of things happen – we’re going to end up with terabytes of this data and then we’re going to see where it goes, who we donate it too, how it might be stored, curated, and so on – so they’re somewhat antsy. I can’t do much about this, other than to point out that there’s so much to Geocities that you could probably soldier off on your own and not be completely redundant grabbing data. I’ve had people doing their own thing for 50 gigabytes and while 27 of it has overlapped, that means I got 23 more gigabytes of 150k Geocities HTML files and all the other attendant stuff in these sites. Millions of files we wouldn’t have gotten sooner.

What matters is stuff is being grabbed – I found discussions from sites that are into some obscure or non-mainstream hobby, who have gone “Holy crap, all our best old stuff is on Geocities!” and they’ve launched into projects to mirror the stuff; that warms my heart. And I’ve watched people on the Archive Team site go off and work on pretty damn complicated solution sets to deal with archiving a site that has millions of files – once they finish that stuff, the next Geocities-level crisis will be ever smoother to handle. Oh, it’s good. It’s good beyond good.

Other than that, not much to tell you. Running statistics against a set of machines that holds something like 8-10 million files is tedious and not overly informative. When we get to a certain breakpoint, I’ll give you the statistics that some like to hear – how many individual sites saved, how many different kinds of files we got, and so on. Right now, I’m just collecting hard drives from generous donators, downloading like crazy, and coordinating what needs coordinating. It’s going great!

Oh, one other bit.

With the implosion of Geocities imminent, a couple companies are stepping in with the shark-toothed smile and the outreached arms to capture up all these potential refugees. A couple contacted me. Here’s Dreamhost (with historically interesting story attached), and Jimdo Lifeboat (with theoretically-better long-term free hosting offer). I will not vouch for either; I’m just passing along who mailed in. Always keep a local backup, and don’t trust anybody.

More on this fascinating saga as it unfolds.


Categorised as: computer history | housecleaning

Comments are disabled on this post


9 Comments

  1. MaxMouse says:

    Well this answers my question. Suck it in, worry about it later.

  2. disambiguated says:

    Dude . . . just . . .

    Thank you.

  3. Pat says:

    A quick question, Jason. Have you made any attempts to talk to someone at Yahoo/Geocities about this? It would seem to me to be in their interest to ship you the Geocities sites en masse and not need to have their servers hammered by crawlers, some of which are grabbing redundant data. It would make your end easier too, with one data store to contend with and at least reduce your worries about not covering everything.

  4. I believe you are helping the cause immensely by being unapologetic about how “artfully horrific” GeoCities pages are. Too many people will be caught up in arguing over their usefulness now, and some will mistakenly try to encourage those people to appreciate the pages for their own sake. But you cut right to the bone: this is part of the historical record and that is reason alone to preserve it.

    Also, you use curse words often enough to attract attention without being a gratuitous potty-mouth.

  5. Jason Scott says:

    Fuck yeah!

    As for contacting Yahoo directly, other members of the team are attempting this – it’s probably not best for me to take the lead on that.

  6. Jordan Cole says:

    1) I’ve done a transcript of the Future Tense episode, because audio is harder to keep around than text (and less convenient): http://ratafia.info/post/101890695/transcript-of-rescuing-geocities

    2) When you say ‘Track down articles for the collection’, are you asking for a collected history like Wikipedia offers, or links to informative writing? And do you want it added to the GeoCities page on the Archive Team wiki?

  7. Nathan says:

    I was curious if you were going to make an effort to back up another bastard child of the Geocities/Yahoo merger, that of the Webring?

    Just curious,

    Nathan

    PS Of course, thank you for your work!

  8. Flack says:

    I’ve been thinking about this for a couple of days. It would take more programming skills than I have, but … I was thinking of some sort of client/server software package where you (the Archive Team) would run a server and people who have resources to donate (mainly, bandwidth and storage) would run the client side. Then you (the server) could run whatever discovery scripts you have, and it could assign that work out to the clients. That would distribute the load, speed things up and alleviate redundancy (until the point where all the data could be reassembled). It would also allow those of us with available bandwidth and storage to help!

  9. [...] shutdown the service. Thanks in huge part to Jason Scott and the Archive Team’s tireless work and campaigning, an enormous amount of Geocities has been preserved in the Internet Archive as well as [...]