{"id":18210,"date":"2017-01-25T12:00:56","date_gmt":"2017-01-25T17:00:56","guid":{"rendered":"http:\/\/mith.umd.edu\/?p=18210"},"modified":"2020-10-08T15:59:34","modified_gmt":"2020-10-08T19:59:34","slug":"tracking-changes-diffengine","status":"publish","type":"post","link":"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/","title":{"rendered":"Tracking Changes With diffengine"},"content":{"rendered":"<p>Our most respected newspapers want their stories to be accurate because once the words are on paper, and the paper is in someone\u2019s hands, there\u2019s no changing them. The words are literally fixed in ink to the page, and mass produced into many copies that are pretty much impossible to recall. Reputations can rise and fall based on how well newspapers are able to report significant events. But of course physical paper isn\u2019t the whole story anymore.<\/p>\n<p>News on the web can be edited quickly as new facts arrive, and more is learned. Typos can be quickly corrected\u2013but content can also be modified for a multitude of purposes. Often these changes instantly render the previous version invisible. Many newspapers use their website as a place for their first drafts, which allows them to craft a story in near real time, while being the first to publish breaking news.<\/p>\n<p>News travels fast in social media as it is shared and reshared across all kinds of networks of relationships. What if that initial, perhaps flawed version goes viral, and it is the only version you ever read? It\u2019s not necessarily fake news, because there\u2019s no explicit intent to mislead or deceive, but it may not be the best, <a href=\"http:\/\/www.forbes.com\/sites\/kalevleetaru\/2017\/01\/01\/fake-news-and-how-the-washington-post-rewrote-its-story-on-russian-hacking-of-the-power-grid\/#780dc24e291e\">most accurate <\/a>news either. Wouldn\u2019t it be useful to be able to watch how news stories shift in time to better understand how the news is produced? Or as <a href=\"https:\/\/twitter.com\/jftitone\">Jeanine Finn<\/a> memorably put it: how do we understand the news <a href=\"https:\/\/jeaninefinn.me\/2016\/11\/15\/understanding-fake-news-in-2016-before-the-truth-gets-its-pants-on\/\">before truth gets its pants on<\/a>?<\/p>\n<p>As part of MITH\u2019s participation in the <a href=\"https:\/\/www.docnow.io\/\">Documenting the Now<\/a> project we\u2019ve been working on an experimental utility called <a href=\"https:\/\/github.com\/docnow\/diffengine\">diffengine<\/a> to help track how news is changing. It relies on an old and quietly ubiquitous standard called <a href=\"https:\/\/en.wikipedia.org\/wiki\/RSS\">RSS<\/a>. RSS is a data format for syndicating content on the Web. In other words it\u2019s an automated way of sharing what\u2019s changing on your website, and for following what changes on someone else\u2019s. News organizations use it heavily. When you listen to a podcast you\u2019re using RSS. If you have a blog or write on Medium an RSS feed is quietly being generated for you whenever you write a new post.<\/p>\n<p>So what diffengine does is really quite simple. First it subscribes to one or more RSS feeds, for example the <a href=\"https:\/\/www.washingtonpost.com\/rss-feeds\/2014\/08\/04\/ab6f109a-1bf7-11e4-ae54-0cfe1f974f8a_story.html?utm_term=.650a7743f53f\">Washington Post<\/a>, and then it watches to see if any articles change their content over time. If a change is noticed a representation of the change, or a <a href=\"http:\/\/catb.org\/jargon\/html\/D\/diff.html\">diff<\/a>, is generated, the new version is archived at the <a href=\"https:\/\/archive.org\/\">Internet Archive<\/a>, and the diff is (optionally) tweeted.<\/p>\n<p>We\u2019ve been experimenting with an initial version of diffengine by having it track the Washington Post, the Guardian and Breitbart News which you can see on the following Twitter accounts: <a href=\"https:\/\/twitter.com\/wapo_diff\">wapo_diff<\/a>, <a href=\"https:\/\/twitter.com\/guardian_diff\">guardian_diff<\/a> and <a href=\"https:\/\/twitter.com\/breitbart_diff\">breitbart_diff<\/a>. <a href=\"https:\/\/twitter.com\/ruebot\">Nick Ruest<\/a> at York University and Ryan Baumann at Duke University have been setting up their own instances of diffengine to track what is now 25 media outlets, which you can see in <a href=\"https:\/\/twitter.com\/ryanfb\/lists\/diffengine\/members\">this list<\/a>\u00a0 that Ryan is maintaining.<\/p>\n<p>So here\u2019s an example of what a change looks like when it is tweeted:<\/p>\n<blockquote class=\"twitter-tweet\" data-lang=\"en\">\n<p dir=\"ltr\" lang=\"en\">Deportation force is \u2018not happening,\u2019 Paul Ryan tells undocumented family &#8211; The Washi\u2026 <a href=\"https:\/\/t.co\/OQEpG1Inj3\">https:\/\/t.co\/OQEpG1Inj3<\/a> -&gt; <a href=\"https:\/\/t.co\/NsDNI5Dflt\">https:\/\/t.co\/NsDNI5Dflt<\/a> <a href=\"https:\/\/t.co\/t0Q6iuG2qX\">pic.twitter.com\/t0Q6iuG2qX<\/a><\/p>\n<p>\u2014 Editing the Wapo (@wapo_diff) <a href=\"https:\/\/twitter.com\/wapo_diff\/status\/819885771469553664\">January 13, 2017<\/a><\/p><\/blockquote>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/p>\n<p>The text highlighted in red has been deleted and the text highlighted in green has been added. But you can\u2019t necessarily take diffengine\u2019s word for it right? Bots are <a href=\"http:\/\/firstmonday.org\/ojs\/index.php\/fm\/article\/view\/7090\/5653\">sending<\/a> all kinds of fraudulent and intentionally misleading information out on the web\u200a\u2014\u200aespecially in social media. So when diffengine notices new or changed content it uses Internet Archive\u2019s <a href=\"https:\/\/archive.org\/about\/faqs.php#1050\">save page now<\/a> functionality to take a snapshot of the page, which it then references in the tweet. So you can see the original and changed content in the <a href=\"https:\/\/archive.org\/\">most trusted public repository<\/a> we have for archived web content. You can see the links to both the before and after versions in the tweet above.<\/p>\n<p>diffengine draws heavily on the work and example of two similar projects: <a href=\"https:\/\/github.com\/j-e-d\/NYTdiff\">NYTDiff<\/a> and <a href=\"http:\/\/newsdiffs.org\/\">NewsDiffs<\/a>. NYTdiff is able to create presentable diff images and tweet them for the New York Times. But it was designed to work specifically with the NYTimes API. diffengine borrows the use of phantomjs for creating tweetable images. NewsDiffs on the other hand provides a comprehensive framework for watching changes on multiple news sites (Washington Post, New York Times, CNN, BBC, etc). But you need to be a programmer to add a <a href=\"https:\/\/github.com\/ecprice\/newsdiffs\/tree\/master\/parsers\">parser module<\/a> for a website that you want to monitor. It is also a fully functional web application which requires considerable commitment to setup and run.<\/p>\n<p>With the help of <a href=\"https:\/\/pythonhosted.org\/feedparser\/\">feedparser<\/a> diffengine takes a different approach by working with any site that publishes an RSS feed of changes. This covers many news organizations, but also personal blogs and organizational websites that put out regular updates. And with the <a href=\"https:\/\/github.com\/buriy\/python-readability\">readability<\/a> module diffengine is able to automatically extract the primary content of pages, without requiring special parsing to remove boilerplate material on a site-by-site basis.<\/p>\n<p>To do its work diffengine keeps a small database of feeds, feed entries and version histories that it uses to notice when content has changed. If you know your way around a <a href=\"https:\/\/sqlite.org\/\">SQLite<\/a> database you can query it to see how content has changed over time. This database could be a valuable source of research data, or <a href=\"https:\/\/www.ideals.illinois.edu\/handle\/2142\/39750\">small data<\/a>, for the study of media production, or the way organizations or people communicate online. One possible direction we are considering is creating a simple web frontend for this database that allows you to navigate the changed content without requiring SQL chops.<\/p>\n<p>Perhaps diffengine could also create its own private archive of the web content, rather than relying on a public snapshot at the Internet Archive. Keeping the archive private could help address ethical concerns around documenting particular individuals or communities when conducting research. If this sounds useful or interesting please get in touch with the <a href=\"https:\/\/www.docnow.io\/\">Documenting the Now<\/a> project, by joining our <a href=\"https:\/\/docs.google.com\/forms\/d\/e\/1FAIpQLSf3E7PAXPoT-XoedpEy9UCTpDPS8kPj5JkMwpaWbuqVP0bTrQ\/viewform\">Slack channel<\/a> or emailing us at <a href=\"&#x6d;a&#x69;&#108;t&#x6f;&#58;&#x69;&#x6e;f&#x6f;&#64;d&#x6f;&#99;&#x6e;&#111;w&#x2e;&#105;o.\">&#x69;&#110;&#102;o&#x40;&#x64;&#111;c&#x6e;&#x6f;&#119;&#46;&#x69;&#x6f;<\/a>.<\/p>\n<p><a href=\"https:\/\/github.com\/docnow\/diffengine\/#Install\">Installation<\/a> of diffengine is currently a bit challenging if you aren\u2019t already familiar with installing Python packages from the command line. If you are willing to give it a try let us know how it goes over on <a href=\"https:\/\/github.com\/docnow\/diffengine\">GitHub<\/a>. Ideas for sites for us to monitor as we develop diffengine are also welcome!<\/p>\n<p><em>Special thanks to <a href=\"https:\/\/twitter.com\/mkirschenbaum\">Matthew Kirschenbaum<\/a> and <a href=\"https:\/\/twitter.com\/gregj\">Gregory Jansen<\/a> at the University of Maryland for the initial inspiration behind this idea of showing rather than telling what news is. <a href=\"http:\/\/www.cs.umd.edu\/hcil\/\">The Human-Computer Interaction Lab<\/a> at UMD hosted an informal workshop after the recent election to see what possible responses could be, and diffengine is one outcome from that brainstorming.<\/em><\/p>\n<p><em>This page was <a href=\"https:\/\/news.docnow.io\/tracking-changes-with-diffengine-60bbbff81d7d\">originally published<\/a> on the Documenting the Now blog<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Our most respected newspapers want their stories to be accurate because once the words are on paper, and the paper is in someone\u2019s hands, there\u2019s [&hellip;]<\/p>\n","protected":false},"author":38,"featured_media":18211,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[77],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v15.0 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Tracking Changes With diffengine &ndash; Maryland Institute for Technology in the Humanities<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Tracking Changes With diffengine &ndash; Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"og:description\" content=\"Our most respected newspapers want their stories to be accurate because once the words are on paper, and the paper is in someone\u2019s hands, there\u2019s [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/\" \/>\n<meta property=\"og:site_name\" content=\"Maryland Institute for Technology in the Humanities\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/UMD.MITH\" \/>\n<meta property=\"article:published_time\" content=\"2017-01-25T17:00:56+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2020-10-08T19:59:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mith.umd.edu\/wp-content\/uploads\/2017\/01\/diffengine.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"600\" \/>\n\t<meta property=\"og:image:height\" content=\"521\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/mith.umd.edu\/#website\",\"url\":\"https:\/\/mith.umd.edu\/\",\"name\":\"Maryland Institute for Technology in the Humanities\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/mith.umd.edu\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/mith.umd.edu\/wp-content\/uploads\/2017\/01\/diffengine.jpg\",\"width\":600,\"height\":521},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/#webpage\",\"url\":\"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/\",\"name\":\"Tracking Changes With diffengine &ndash; Maryland Institute for Technology in the Humanities\",\"isPartOf\":{\"@id\":\"https:\/\/mith.umd.edu\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/#primaryimage\"},\"datePublished\":\"2017-01-25T17:00:56+00:00\",\"dateModified\":\"2020-10-08T19:59:34+00:00\",\"author\":{\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/4948a8fd2a5a93beae6c42416d218254\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/mith.umd.edu\/tracking-changes-diffengine\/\"]}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/mith.umd.edu\/#\/schema\/person\/4948a8fd2a5a93beae6c42416d218254\",\"name\":\"Ed Summers\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/mith.umd.edu\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/11f1fafc642e7f0a7e5851f5d98bd66e?s=96&d=mm&r=g\",\"caption\":\"Ed Summers\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/18210"}],"collection":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/users\/38"}],"replies":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/comments?post=18210"}],"version-history":[{"count":1,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/18210\/revisions"}],"predecessor-version":[{"id":21047,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/posts\/18210\/revisions\/21047"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/media\/18211"}],"wp:attachment":[{"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/media?parent=18210"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/categories?post=18210"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mith.umd.edu\/wp-json\/wp\/v2\/tags?post=18210"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}