Keep up to date with our latest insight pieces, news and industry developments. See below for the latest posts or use the categories to hone your search for stories of interest.
Rather listen? The WABChats Podcast provides engaging and informative conversations with contacts, clients, advisors and friends of White & Black Limited. Listen Now.
Big Yellow Taxi Data
The accidental disclosure of personal data about New York City cab rides demonstrates what can go wrong when anonymising big data.
The analysis of “big data” has become a key feature of commercial decision-making and social and scientific research. In parallel, governments have increasingly adopted “open data” policies to allow the re-use of data from their activities for such analysis, often in addition to existing freedom of information legislation.
Much information used for big data analytics in the European Union is initially “personal data” for the purposes of the Data Protection Directive (95/46/EC), i.e. information relating to an identified or identifiable natural person.
To avoid the constraints on processing personal data under the European regime, data controllers typically choose to “anonymise” it, so that the subject can no longer be identified. The recitals to the Data Protection Directive are explicit that, “the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable”.
A specific problem arises where one dataset, anonymised so that data subjects cannot be identified, is then analysed against another dataset, allowing subjects to be re-identified. A topical example arose in respect of a dataset of New York taxi rides.
In 2014, the New York City Taxi and Limousine Commission released a dataset of all New York taxi rides in the previous year in response to a request under New York State’s Freedom of Information Law. It contained details of all yellow cab rides including locations, pick up and drop off times and fare amounts.
The first breach of privacy became clear when a hacker, without too much difficulty, managed to de-encrypt the medallion (car) and driver licence numbers in the document, meaning that an abundance of information became available on each driver’s activities and earnings.
The second breach was identified by Anthony Tockar, a masters student at Northwestern University. By searching for timestamped paparazzi photos of celebrities getting into taxi cabs, where the medallions were visible, Tockar was able to identify where, for example, Bradley Cooper went on a particular cab journey and how much he paid, including whether there was a record of a tip.
However, Tockar’s analysis went further than just “stalking” celebrities. He was successful in identifying the address of an individual who took repeat cab rides to a single strip club, Larry Flynt’s Hustler Club. From publicly available information, including Facebook, he was able to identify that person’s “property value, ethnicity, relationship status, court records and even a profile picture”.
In the UK, the Information Commissioner’s Office’s Code of Practice on Anonymisation(November 2012) proposes a process to ensure an adequate level of anonymisation. The data controller should consider the likelihood of re-identification (i) being attempted and (ii) being successful, and the quality (or “richness”) of the data once such anonymisation has taken place. The data controller should then test the data according to the level of acceptable risk and be satisfied that individuals cannot be identified before publication or disclosure.
As regards what risk is acceptable, the ICO states that the Data Protection Act 1998, as confirmed by case law, “does not require anonymisation to be completely risk free – you must be able to mitigate the risk of identification until it is remote. If the risk of identification is reasonably likely the information should be regarded as personal data”.
Personal data is subject to the regime set out in the Data Protection Act and must be processed accordingly, which might require, for example, the consent of each data subject. Anonymised data falls outside the regime unless it poses a risk of re-identification which is more than remote. This requires both a documented risk assessment and testing of the resulting dataset.
In performing a risk assessment and testing for re-identification risk, data controllers should consider the risk of comparison with all the other datasets which government, businesses and individuals have assembled and shared.
This should include the vast, user-generated datasets which exist on social media, as illustrated by the identification of the regular Hustler Club customer. Although most of us are not being followed around by photographers in the manner of Bradley Cooper, those who post selfies on Instagram and Facebook risk becoming their own paparazzi.
Disclaimer: This article is produced for and on behalf of White & Black Limited, which is a limited liability company registered in England and Wales with registered number 06436665. It is authorised and regulated by the Solicitors Regulation Authority. The contents of this article should be viewed as opinion and general guidance, and should not be treated as legal advice.