ASCII by Jason Scott

Jason Scott's Weblog

A Torrent of Attention —

It has been an interesting few days.

Creating the archive of Geocities content from the Archive Team’s collection took my machine roughly 10 days to compress. The resultant collection of .7z files is 642 gigabytes, expanding out to 909 gigabytes. Then I began creating the actual .torrent file, which is merely a collection of pointers to the files that trackers and clients use. This took 13 hours, and I had to do it twice: it turns out the default “piece size” is 256k, this sent the machine up into the 2 million plus “pieces” and a LOT of clients do not like getting two million entries in anything. Rejiggering to 16mb “pieces” did the trick. But it still took another 13 hours.

A few of us in the Archive Team IRC channel did some testing, and we’re off on a roll. The swarm has been in the hundreds range since.

I’ve been sending out e-mails about the torrent existing over the past week to the over 800 people who requested to be notified. This slow rollout isn’t because I think the torrent can’t handle it – it’s just that Gmail is not as easy to run little scripts against collections of mail to extract a mailing list. So there’s a little copy-paste action going on and I am not going to do that full time. A few hundred of those folks have gotten notified and I’ll probably be done with the full list shortly.

And then came the press.

So, I’m going to punch the press in the whizzer for a paragraph or two.

The whole point of this exercise was to gain attention to the issues and cause that Archive Team is involved in: preserving digital heritage and lambasting entities/companies that treat user-generated content like so much trash.  I think the issue transcends anything I’m mucking around with and represents a real and vital issue as more of life moves online. By boiling things into “Geocities as a Torrent”, attention was sought, attention was got. But along the way, I’ve gotten another taste for contemporary news-gathering and the stratification of quality is getting ridiculous.

On the one hand, I’ve got reporters like Ken Gagne of Computer World and Lauren Schenkman of Science Magazine who have contacted me, spoken to me on the phone, and then gone off and gotten related individuals on the phone or e-mail to discuss the issues. They’re doing this with pretty fast turnaround.  And I guarantee I’m probably a tad spoiled by reporters like Stacy Schiff, who spoke to me for hours to get background on her excellent Wikipedia article, or Kim Zetter, who shows that you can write an informative article without being fawning.

And then come the slightly-slapdash ones, who write articles using my one weblog post as their source, but then go off to find some additional illustration. Not really great, but then again these are newish organizations not really interested in a whole lot of standards when it comes to telling the stories. Pleasant surprise should occur when they get things right. (For example, a lot of places wrote that the torrent is 900gb and will expand out to terabytes, something nowhere in anything I’ve written.)

One that made me go off the rails was this article in PC Magazine, which was written by Sara Yin and had the name of an employer I had quit 10 years ago and spelled the name of that employer wrong – ignored the original weblog post about this and never contacted me once. So I made a little noise about it, got a few buttercups up in arms that I’d be so mean, and ultimately got a few additional insights into perceptions of my personality.

Oh, sure, PC Magazine made a correction, but not before it got syndicated to hell, with the wrong information baked in. And the corrections do not follow. It was especially galling as PC Magazine was an entity that I was reading like a bible in my teens, even submitting software for their new PC Disk Magazine subsidiary because I thought it was such a point of pride to be in its pages. Well, obviously not anymore – now they have crap farmers using the first three google links to write inaccurate stories and still calling themselves “reporters” in a land with people with Schiff and Schenkman. For shame.

Anyway.

There have been some amusing podcasts mentioning the situation, for example Infosec Daily has the story at the end and Dan Misener did a recorded interview with me that was so much fun and got the message across so clearly that it’s actually included in the torrent. Even This Week In Tech mentioned the event, comparing it to zombies and yelling “BRAAIIIIINS” and hey, whatever works for you.

Right now, there’s only one seed machine, but I am duping the archive over to a portable drive, and a number individuals and organizations are mailing me hard drives to get copies to seed as well.  So anyone going on seeing that the top seeds are “merely” at 8 percent or some lower number, that torrent is about to speed up dramatically.

I’m glad the word got out about this. Even if people choose not to download the data (and come on, this is a hell of a lot of data), they remembered Geocities one last time, and remembered what Yahoo did. Maybe that’ll change something down the road.

So there we go. One last thing – another Geocities archiving project, Reocities, was done by Jacques Mattheij, who is such an awesome dude and so perfect as a counterpart to what Archive Team is doing, I hereby call out some tech conference to bring us both in for a panel. We will fucking kill the room, I guarantee it! Kick out some lame “how to distribute your blah” speech and give us 90 minutes. Trust me. Get on that.

Oh, and PS: I put all of my Geocities archive on this:

Was it really that hard to keep around, Yahoo?


Categorised as: computer history

Comments are disabled on this post


14 Comments

  1. Hex says:

    “I put all of my Geocities archive on this… Was it really that hard to keep around, Yahoo?”

    That makes me a sad panda. Although it would have blown the me-from-1996 (with my first ever website, on geopages.com)’s mind.

  2. sg_ says:

    “I put all of my Geocities archive on this… Was it really that hard to keep around, Yahoo?”
    I don’t know, maybe if you include the data that you didn’t pull, Yahoo would need /two/ of those drives.

  3. Jobermallow says:

    Hey Jason- just wanted to say that the attention is well deserved, even if certain folks are bad at fact-checking. While I feel like i’m hopping on the bandwagon, I too wrote a post on the Geocities Archive on my blog, Peasant Muse, inspired largely by efforts of Archive Team, you, and others like you. Keep up the good work! More attention is needed to highlight the presence and disappearance of these digital archives.

  4. l.m.orchard says:

    Problem is, all of that data was probably hosted on a horde of old 1990’s era 5.25″ Quantum Bigfoot drives that were screaming away in a sagging, dusty server rack in a forgotten data center whose last ops guy had been laid off years ago by Yahoo! and had the machine inventory and ssh keys on the laptop they wiped on his last day.

    But, I’m sure the real situation had to have been *slightly* better than that 🙂

  5. ap says:

    jeez, that scd guy is quite the white knight. Also, trying to follow twitter conversations is a headache.

  6. Decius says:

    Are we in questionable legal territory regarding the use of information in the archive? Academic work and archives fall under the fair use dotrine, I think, but would the extreme example of selling 1 TB drives filled with Geocities for profit be acceptable?

    I suspect that Geocities TOS basically said “All your stuff belongs to us.” Yahoo! would have acquired all of that, and now has publicly disclaimed their ownership. I’m not sure if that quite qualifies as putting it into public domain.

    Any other armchair lawers have not-legal-advice opinions on the matter?

  7. Alkivar says:

    Decius… with regard to the selling a drive with the content on it… Jason has no profit motive… The cost involved would likely be merely the expense of the drive and shipping… so I cant see someone successfully arguing in court that this is a for profit plan.

  8. Chris says:

    “Are we in questionable legal territory regarding the use of information in the archive?”

    I hereby waive all of my rights to the content contained in the three or four crappy Geocities webpages that I created fifteen years ago. Enjoy.

  9. Paul Wong says:

    Hey Chris,

    First off, for a guy who lives the internet, your e-mail is very difficult to find.

    Anyway, I work with NPR and would like to know if you’re interested in an interview with one of our radio hosts. We’re digging the idea of the necessity of human creativity and its place in history and commerce and think that you’d be a neat interlocutor for this kind of conversation.

    And yeah, we’re definitely going to ask about the Geocities project. It is pretty darn cool.

    Thank you in advance for you consideration. I hope to hear from you soon.

    Sincerely,
    -Paul

  10. i tried to download certain parts of the torrent, but the tracker was down and when i tried
    to download 600mb of the files, …. and let it run for 5 days, it said i had 3years to wait.

    so i’m not sure if i have the old torrent or this other torrent you are talking about.
    i didnt get an email so i assume that the old one was shitcanned and the tpb version is the one to get.

    really have doubts about a torrent this huge surviving in the wild, but good luck.

    what i would suggest next time is getting a few cheap seedboxes, uploading them via ftp or private torrent and starting it off on the right foot.

  11. Richard Wheston says:

    900 GB of storage space isn’t too expensive.

    Getting your arse sued off by every media company in existence is *quite* expensive.

    Indeed, how many times do you think the “flaming flying skull” animated GIF appears in that archive?

  12. Ymgve says:

    Hi, there’s a problem with the torrent. (Well, more of a Windows problem, but..)

    There are (at least) two files that only differ in case:
    LOWERCASE/geocities-x-M.7z.001
    LOWERCASE/geocities-x-m.7z.001

    At least on uTorrent 2.04, this leads to problems since Windows thinks those are the same file, which leads to the one overwriting the other, which then in return destroys the info for the chunk where the other file resides.

  13. Ian says:

    The same name, different case files have been driving me mad, kept getting errors when downloading.
    There are 142 filenames that are affected.
    Set client to not download one of each alternate file for now.

  14. Ian says:

    The windows unfriendly filenames seems irrelevant as the torrent hasn’t had a seed for 2 weeks (currently stuck on 44.62%). Why make a big song and dance about it if the torrent isn’t going to be seeded properly.