ASCII by Jason Scott

Jason Scott's Weblog

Bump Not Sage: Saving 4Chan —

bump2

Probably the best part of following a logical-conclusion path is when people supporting you with pumping fists, hoots, and hollers start to pump their fists a bit less and do a lot less hooting.

So let me inform you all and the world that, after many months of work and negotiation, I have acquired 10 million expired threads from 4chan’s history. Roughly half a decade’s worth.

Why? Because it’s part of online history, a study of the human soul when untethered by identity, a way to confirm statements made years ago… any range of reasons which I could not hope to compose out of the air for you. That’s not my job. My job is to save things. And now I’ve saved this.

It’s going on archive.org over the next week. I’ll let you know when it’s done. It’s dozens of gigabytes, and I have it in XML, HTML and MYSQL formats, all of which show different parts of the data. (Conversion strips out some data that original formats might not have, and so on.)

An awful lot of history that we have at our fingertips is because someone, somewhere, hit “save” instead of “delete”. Someone did that in this case, and so here we go.

Plan accordingly.

Update: This has been cancelled (postponed, really, for a few years). Please read this weblog entry.


Categorised as: computer history | jason his own self

Comments are disabled on this post


16 Comments

  1. Jamie dubs says:

    Amazing! Is your archive current? I’ve actually been working on a 4chan archive/search engine and would love to swap notes or contribute data

  2. chronomex says:

    That’s really surprising; I thought I heard most expired 4chan threads were dropped on the floor. (Notably excepting /r9k/…) Does this include images, or is it just text?

    I would presume that images are what I heard as being deleted on expiry, so I suspect that it’s text-only.

  3. Alex Leavitt says:

    Do you know the date range?

  4. Michael Kohne says:

    At dozens of gigs, I suspect there’s at least some images. Now, what interesting sociological research can we do with 10 million 4chan threads?

    Nicely done, Jason. Did you come up with a way to get the expired stuff on an ongoing basis? I really suspect there’s a least a few papers for the psych majors in that data.

  5. durr says:

    does this include cp threads?

  6. Toshiaki says:

    The images are not there.

    “interesting sociological research”? Clearly, you have never visited 4chan.

    This archive is is only useful if you want to know the first utterance of ‘fgsfds’.

  7. Anonymous says:

    MOAR CP Plz

  8. Anonymous says:

    nah just kidding…AMAZING project though…can’t wait to see how it turns out! 4chan IMHO is the epitome of what the internet is, can be, will be, and has been. It’s one of the only things on the internet that can ONLY be on the internet. It really is a piece of history, try to get the national archives to store it for you!

  9. Kevin says:

    Contents aside, I’m curious to see volume changes over time. How might they map against other events.

  10. “So let me inform you all and the world that, after many months of work and ne…”…

    So let me inform you all and the world that, after many months of work and negotiation, I have acquired 10 million expired threads from 4chan’s history. Roughly half a decade’s worth.[...]It’s going on archive.org over the next week….

  11. [...] Years of 4chan Five years of 4chan is being added to Archive.org. Crazy. [...]

  12. Torley says:

    Heard about this via Waxy too! What a chunk of Internet.

    It would be wonderful to find accessibly exciting ways to browse these archives and the evolutions of memes (and other items of confusion-causing cultural significance).

    And generate other stats like a graph over time like Google Trends showing how many times the words “FAIL” and “WIN” were used.

  13. Anonymous says:

    This is really sad. Please don’t do it. Does nobody see the value in transience anymore?

  14. David says:

    Why would anyone care if images are not included?

  15. [...] Jason’s acquired about 5 years worth of expired threads from the internet’s House of Awful Shit and Meme Factory, 4chan. Yes, this is culturally important. Just trust me. [...]

  16. Anonymous says:

    @Anonymous:
    Why should only the people who were alive and active at a given moment be allowed to enjoy said moment? There’s something to be said for fleeting events, but there’s a lot more to be said for preserving a fairly massive piece of history.

    @David:
    Well, they’re part of the history too.