Today, we launched PiracyData.org, a site that takes the top ten most pirated movies of the week and mashes them up with data on legal online availability. Our hope is to build an extensive time-series dataset that can help shed light on the relationship between piracy and viewing options.
As might be expected with a new site, we’ve experienced some launch day glitches with the accuracy of our data and our visitors have thankfully pointed these out. We are of course committed to getting it right, so in the spirit of full transparency, we want to explain exactly what has gone wrong and how we plan on fixing it.
First, let me explain in detail how our site works and the exact data sources that we are using. Every hour, PiracyData.org polls the RSS feed for TorrentFreak’s most pirated movies posts. If the new week’s data is not yet in our database, we add it and fetch each movie’s availability from CanIStream.It.
CanIStream.It is a great site, but it is a little difficult for a computer to read. You can’t look up a movie by IMDB ID, which is pretty much the universal identifier for movies. What you can do, however, is pull up a CanIStream.It widget using IMDB ID.
The widget separates availability into four categories: streaming, rental, purchase, and physical DVDs. Given that this is a discussion of online piracy, we are really only interested in the first three categories, but we preserve all four. We scrape the page for movie availability on all of the services that the widget lists.
Making our site this way has presented us with four distinct issues that we only discovered once we started getting user feedback on the site:
1. Movie availability may change throughout the week
This is actually not a problem with our data, but with how it’s interpreted. Because the TorrentFreak data is backward-looking, reporting the most pirated movies in the previous week, we only want to report the online availability of movies as it appeared on Monday. That is, we are intentionally taking a snapshot of Monday availability. If movies become available for rental on Tuesday, we will continue to report throughout the remainder of the week that they were not available to rent on Monday, because that is most likely to reflect the state of the world during the preceding week when the piracy was happening.
A number of people have noted that Pacific Rim is now available for rental. We haven’t been able to confirm for sure, but we believe that it was added for rental at some point after we checked, and therefore this does not appear to be an error on our part. We’d appreciate it if anyone can confirm this because we want to make sure we are getting the right results.
2. Some services are available on CanIStream.It that are not listed in the widget, only on the main site
In particular, The Lone Ranger is available for rental only from a Sony service, but that service is absent in the CanIStream.It widget for not only The Lone Ranger but for all movies. Originally today, our site reported what the CanIStream.It widget reported, that the movie is not available for rental. However, when it was pointed out to us that CanIStream.It’s main site reports that The Lone Ranger is available on Sony, we updated our data to take account of that. We are going to find a way in the future to ensure that all services are automatically included in our dataset, but this means we may have to find another data source or resort to manual entry.
3. In at least one instance, CanIStream.It returned to us data for the wrong movie.
Here’s how the CanIStream.It widgets work: you go to the base url “http://www.canistream.it/external/imdb/” and add the IMDB ID for the movie you are querying. For example, since Pacific Rim’s ID is tt1663662, you can see the widget for the movie at http://www.canistream.it/external/imdb/tt1663662 .
This works perfectly most times, but bizarrely, it doesn’t work for This Is the End, whose IMDB is tt1245492. When you visit http://www.canistream.it/external/imdb/tt1245492 you get the CanIStream.It widget for Jay and Seth Vs. the Apocalypse, not This Is the End. As an outlier, this caught us totally by surprise, and we updated the data on our site to reflect the accurate data from This Is the End. Again, this is the kind of bug we could only have caught once we had lots of eyes on the site and we’re grateful for the feedback.
4. The site is built using the best available data.
TorrentFreak and CanIStream.It offer extremely useful data to the public. While we’ve had some issues incorporating the CanIStream.It data, we are grateful for the data they provide. CanIStream.It’s data is typically seen even among industry insiders as reliable. For instance, MPAA’s site wheretowatch.org directs their users to CanIStream.It as a source.
That said, if we want to build the canonical dataset on this issue, we have to do better. We need to make sure that there are no glitches. We would like to work with anyone with access to availability data to make sure that we can compile the most accurate data possible.
We’re not exactly sure what this entails yet. We may have to get availability data directly from the services themselves. If we can secure the cooperation of the services—for example if they would be willing to supply data on the date that each movie by IMDB number became available on their service—we could even compute availability data historically. TorrentFreak has data on pirated movies going back to 2006.
One thing is for certain: the dataset that we are proposing to build is important. We have provoked quite a reaction from people on both sides of this issue. We acknowledge that it has been a bumpy launch for our site, but we are committed to getting it right. We ask for everybody’s patience and good-faith assistance as we try to get there.