{"id":13446,"date":"2014-11-26T11:02:04","date_gmt":"2014-11-26T11:02:04","guid":{"rendered":"http:\/\/mith.umd.edu\/?p=13446"},"modified":"2020-10-08T16:00:24","modified_gmt":"2020-10-08T20:00:24","slug":"miths-ed-summers-discusses-ferguson-twitter-archive","status":"publish","type":"post","link":"https:\/\/mith.umd.edu\/miths-ed-summers-discusses-ferguson-twitter-archive\/","title":{"rendered":"MITH\u2019s Ed Summers discusses his Ferguson Twitter archive"},"content":{"rendered":"<p><i>Cross-posted and edited from a blog entry on medium.com: <\/i><a href=\"https:\/\/medium.com\/on-archivy\/on-forgetting-e01a2b95272\"><i>On Forgetting and hydration<\/i><\/a><i>.<\/i><\/p>\n<p>After writing about the <a href=\"http:\/\/inkdroid.org\/journal\/2014\/08\/30\/a-ferguson-twitter-archive\/\">Ferguson Twitter<\/a> archive a few months ago, I received requests from three people both outside and within University of Maryland, for access to the data. My response to the external academic researchers was to point them to <a href=\"https:\/\/dev.twitter.com\/overview\/terms\/policy#6._Be_a_Good_Partner_to_Twitter\">Twitter\u2019s Terms of Service<\/a> which says:<\/p>\n<blockquote><p><i>If you provide Content to third parties, including downloadable datasets of Content or an API that returns Content, you will only distribute or allow download of Tweet IDs and\/or User IDs.<\/i><\/p>\n<p><i>You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a \u201csave as\u201d button) of up to 50,000 public Tweets and\/or User Objects per user of your Service, per day.<\/i><\/p>\n<p><i>Any Content provided to third parties via non-automated file download remains subject to this Policy.<\/i><\/p><\/blockquote>\n<p>It\u2019s my understanding that I can share the data with others at the University of Maryland, but I am not able to give it to the external parties. What I can do is give them the Tweet IDs. But there are 13,480,000 of them.<\/p>\n<p>So that\u2019s what I\u2019m doing today: publishing the tweet ids using a CC-BY license. You can download them from the Internet Archive:<\/p>\n<blockquote><p><i><a href=\"https:\/\/archive.org\/details\/ferguson-tweet-ids\">https:\/\/archive.org\/details\/ferguson-tweet-ids<\/a><\/i><\/p><\/blockquote>\n<h3><b>Hydration<\/b><\/h3>\n<p>On the one hand, it seems unfair that this portion of the public record is unshareable in its most information rich form. The barrier to entry to using the data seems set artificially high in order to protect Twitter\u2019s business interests. These messages were posted to the public Web, where I was able to collect them. Why are we prevented from re-publishing them since they are already on the Web? Why can\u2019t we have lots of copies to keep stuff safe? More on this in a moment.<\/p>\n<p>Twitter limits users to 180 API requests every 15 minutes. A user is effectively a unique access token. Each request can hydrate up to 100 Tweet IDs using the <a href=\"https:\/\/dev.twitter.com\/rest\/reference\/get\/statuses\/lookup\">statuses\/lookup<\/a> REST API call. So<\/p>\n<pre style=\"font-size: 10pt;\">180 requests * 100 tweets = 18,000 tweets\/15 min\r\n                          = 72,000 tweets\/hour\r\n<\/pre>\n<p>In order to hydrate all of the 13,480,000 tweets will take about 7.8 days. This is a bit of a pain, but realistically it\u2019s not so bad. I\u2019m sure people doing research have plenty of work to do before running any kind of analysis on the full data set. And they can use a portion of it for testing as it is downloading. But how do you download it?<\/p>\n<p><a href=\"http:\/\/gnip.com\/\">Gnip<\/a>, who were recently acquired by Twitter, offer a rehydration API. Their API is limited to tweets from the last 30 days, and similar to Twitter\u2019s API you can fetch up to 100 tweets at a time. Unlike the Twitter API you can issue a request every second. So this means you could download the results in about 1.5 days. But these Ferguson tweets are more than 30 days old. And a Gnip account costs some indeterminate amount of money, starting at $500\u2026<\/p>\n<p>I suspect there are other hydration services out there. But I adapted <a href=\"http:\/\/github.com\/edsu\/twarc\">twarc<\/a> the tool I used to collect the data, which already handled rate-limiting, to also do hydration. Once you have the tweet IDs in a file you just need to install twarc, and run it. Here\u2019s how you would do that on an Ubuntu instance:<\/p>\n<pre style=\"font-size: 10pt;\"><code>\r\nsudo apt-get install python-pip\r\nsudo pip install twarc\r\ntwarc.py --hydrate ids.txt &gt; tweets.json\r\n<\/code>\r\n<\/pre>\n<h3><b>Archive Fever<\/b><\/h3>\n<p>Well, not really. You will have <i>most <\/i>of them. But you won\u2019t have the ones that have been deleted. If a user decided to remove a Tweet they made, or decided to remove their account entirely you won\u2019t be able to get their Tweets back from Twitter using their API. I think it\u2019s interesting to consider Twitter\u2019s Terms of Service as what <a href=\"http:\/\/ischool.umd.edu\/faculty-staff\/katie-shilton\">Katie Shilton<\/a> would call a <a href=\"http:\/\/mith.umd.edu\/dialogues\/katie-shilton-finding-values-levers-building-ethics-into-emerging-technologies\/\">value lever<\/a>.<\/p>\n<p>The metadata rich JSON data (which often includes geolocation and other behavioral data) wasn\u2019t exactly posted to the Web in the typical way. It was made available through a Web API designed to be used directly by automated agents, not people. A tweet appears on the Web, but it\u2019s in with the other half a trillion tweets out on the Web, all the way back to the <a href=\"https:\/\/twitter.com\/biz\/status\/21\">first one<\/a>. You have to ask for it individually with your Web browser. It\u2019s representation format is HTML which doesn\u2019t lend itself to computer processing in the same way as the highly structured JSON.<\/p>\n<p>Requiring researchers to go back to the Twitter API to get this data and not allowing it to circulate freely in bulk means that users have an opportunity to remove their content. Sure it has already been collected by other people, and it\u2019s pretty unlikely that the NSA are deleting their tweets. But if you squint right, Twitter is taking an ethical position for their publishers to be able to remove their data: to exercise their right to be forgotten, allowing them to remove a teensy bit of what Maciej Ceg\u0142owski calls <a href=\"http:\/\/idlewords.com\/bt14.htm\">informational toxic waste<\/a>.<\/p>\n<p>As any archivist will tell you, forgetting is an essential and unavoidable part of the archive. Forgetting is the <i>why <\/i>of an archive. Negotiating what is to be remembered and by whom is the principal concern of the archive. Ironically it seems it\u2019s the people who deserve it the least, those in positions of power, who are often most able to exercise their right to be forgotten. Maybe putting a value lever back in the hands of the people isn\u2019t such a bad thing. If I were Twitter I\u2019d highlight this in the API documentation. I think we are still learning how the contours of the Web fit into the archive. I know I am.<\/p>\n<p><i>If you are interested in learning more about value levers you can download a pre-print of Shilton\u2019s <\/i><a href=\"http:\/\/mith.umd.edu\/wp-content\/uploads\/2014\/11\/ShiltonSTHVpreprint.pdf\"><i>Value Levers: Building Ethics into Design<\/i><\/a><i>.<\/i><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Cross-posted and edited from a blog entry on medium.com: On Forgetting and hydration. After writing about the Ferguson Twitter archive a few months ago, I [&hellip;]<\/p>\n","protected":false},"author":38,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[65,77],"tags":[26,190],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v15.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>MITH\u2019s Ed Summers discusses his Ferguson Twitter archive &ndash; Maryland Institute for Technology in the Humanities<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mith.umd.edu\/miths-ed-summers-discusses-ferguson-twitter-archive\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"MITH\u2019s Ed Summers discusses his Ferguson Twitter archive &ndash; Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"og:description\" content=\"Cross-posted and edited from a blog entry on medium.com: On Forgetting and hydration. After writing about the Ferguson Twitter archive a few months ago, I [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mith.umd.edu\/miths-ed-summers-discusses-ferguson-twitter-archive\/\" \/>\n<meta property=\"og:site_name\" content=\"Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/UMD.MITH\" \/>\n<meta property=\"article:published_time\" content=\"2014-11-26T11:02:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-10-08T20:00:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mith.umd.edu\/wp-content\/uploads\/2018\/10\/MITH-logostack-square-grn.png\" \/>\n\t<meta property=\"og:image:width\" content=\"300\" \/>\n\t<meta property=\"og:image:height\" content=\"300\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/mith.umd.edu\/#website\",\"url\":\"https:\/\/mith.umd.edu\/\",\"name\":\"Maryland Institute for Technology in the Humanities\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/mith.umd.edu\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/mith.umd.edu\/miths-ed-summers-discusses-ferguson-twitter-archive\/#webpage\",\"url\":\"https:\/\/mith.umd.edu\/miths-ed-summers-discusses-ferguson-twitter-archive\/\",\"name\":\"MITH\\u2019s Ed Summers discusses his Ferguson Twitter archive &ndash; Maryland Institute for Technology in the Humanities\",\"isPartOf\":{\"@id\":\"https:\/\/mith.umd.edu\/#website\"},\"datePublished\":\"2014-11-26T11:02:04+00:00\",\"dateModified\":\"2020-10-08T20:00:24+00:00\",\"author\":{\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/4948a8fd2a5a93beae6c42416d218254\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/mith.umd.edu\/miths-ed-summers-discusses-ferguson-twitter-archive\/\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/4948a8fd2a5a93beae6c42416d218254\",\"name\":\"Ed Summers\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/mith.umd.edu\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/11f1fafc642e7f0a7e5851f5d98bd66e?s=96&d=mm&r=g\",\"caption\":\"Ed Summers\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/13446"}],"collection":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/users\/38"}],"replies":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/comments?post=13446"}],"version-history":[{"count":1,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/13446\/revisions"}],"predecessor-version":[{"id":21106,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/13446\/revisions\/21106"}],"wp:attachment":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/media?parent=13446"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/categories?post=13446"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/tags?post=13446"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}