Robindra Prabhu
Norwegian Board of Technology
robindra.prabhu@teknologiradet.no
Few concepts have made as many headlines in the past few years as the term «Big Data». From its nascent beginnings in technology circles, the term has rapidly catapulted into mainstream society. In recent years it has even become a household notion in the higher echelons of government: the Obama administration has launched its multi-million dollar «Big Data Initiative» (Office of Science and Technology Policy 2012), the United Nations has established a network of experimental labs exploring the possibility of leveraging Big Data for humanitarian purposes (United Nations Global Pulse), and recently the Australian government became one of the first in the world to launch a «Big Data Strategy» for its public service (Australian Government 2013).
With a promise to fundamentally «transform the way we live, work and think» through extensive «datafication» of all things human (Mayer-Shönberger, Cukier 2013: 73–97), Big Data vows to give us unprecedented insight into complicated problems and a powerful toolkit to better understand the world around us. With the excessive hyperbole that is often associated with novel technology trends, one would be forgiven for mistaking the buzz for hype with little substance. And despite its ostensible potential, the significant amount of excitement it generates and the widespread agreement that Big Data will impact our society in crucial ways, there appears to be surprisingly little clarity concerning what Big Data actually is and what it entails for society at large.
To be sure, there is no shortage of definitions. The «Big» alludes to unfathomable troves of digital data, in various shapes and forms, which we deliberately or passively generate in our daily interactions with technology. Then there is our enhanced ability to store, manage and extract insight from these data troves using powerful computing technology and the latest in advanced analytical techniques. But «Big Data» does not refer to a fixed quantitative threshold or clear-cut technological constraint. Indeed what is considered «big», «complex» and «advanced» varies widely. So much so that researchers have found it necessary to collate various definitions of the term «Big Data» and furnish the following meta-definition:
Big Data is a term describing the storage and analysis of large and complex datasets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning. (Ward, Barker 2013)
Perhaps it is only natural that early attempts to capture and define an allusive concept will come in many guises and possibly fall along a «moving technological axis». But while we struggle to pin down this new technology, it is important to recognise that Big Data’s entry into the mainstream is equally about cultural changes in how we think about data, it’s capture and analysis, and their rightful place in the fabric of society.
Whether hype or substance, and however most appropriately defined, the Big Data discourse is taking place against some profound (and I would argue exciting) changes in how we interact with our physical and social surroundings. These interactions invariably involve technologies and result in digital traces manifested in such various ways such as Internet clickstreams, location data from cell phones interacting with phone towers, data streams from credit card transactions, the logging of purchasing patterns in shops, the rich and multifaceted sensor data from an Airbus A380 in flight or vast detectors at research labs like CERN. Moreover, the emergent proliferation of low-cost sensors allows us to track and monitor objects and mechanisms in ways that were previously impossible. Farmers employ moisture sensors to monitor moisture levels in fields, shipments of fish and fruit are monitored for temperature and location in real-time as they are moved between continents, and people log personal health indicators using their smartphones.
Not only do we spend more time «online», but as we continue to add more «things» to the Internet, our lives become increasingly more entwined with the virtual world. And the digital traces we constantly leave behind in the virtual world now give us new handles on complex problems in the physical world.
The sceptic might demur that Big Data still has some way to go to deliver on its promise; targeted advertising and tailored movie recommendations may not appear to be the stuff of «revolutions». But fascinating applications are beginning to emerge: mobile phone data is being leveraged to map and track the spread of disease (Talbot 2013), law enforcement agencies are using predictive data-driven tools to determine the «where and when» of the next crime (The Economist 2013) and aggregate citizen sentiments are mined from large-scale social media feeds (Social Media and Post-2015) (Anderson 2008). In the future, more applications are likely to appear. And as we begin to deliberate on the wider ramifications of the rather nebulous phenomenon of «Big Data», important clues for the ethical challenges ahead are likely to be found through close examination of the existing Big Data landscape. An appreciation of these challenges is not only of relevance to researchers who are dappling with new and vast troves of data, but also to private enterprises and governmental agencies looking to harness the power and potential of Big Data.
The nascent debate around Big Data may appear to be quite polarised. As Sandra González-Bailón remarks, the discussion on the proper governance and use of all these novel data sources has bifurcated public opinion into a two-pronged needle:
… the sceptics, who question the legitimate use of that data on the basis of privacy and other ethical concerns; and the enthusiasts, who focus on the transformational impact of having more information than ever before. (González 2013: 147)
Both camps have extremists that will either dismiss the Big Data phenomenon as overhyped and underwhelming, or espouse the view that we are witnessing a new era in which the proliferation of data will render theory and interpretation superfluous (Anderson 2008) (Richards and King 2013).
While a healthy dose of both enthusiasm and scepticism is essential when dealing with new technologies, there are valuable lessons to be learned from the moderates on either side. Firstly, theory and interpretation are not likely to be discarded any time soon. Instead, their importance is reinforced as a sense-making tool in a growing sea of noisy data. Secondly, we would be wise to tread carefully, lest the critics are vindicated and we end up sleepwalking into a surveillance society.
As the debate and rhetoric advances and matures, it becomes important to capture the full range of nuanced challenges associated with the Big Data paradigm.
An interesting turn in this direction is provided by Richards and King in their paper «Three Paradoxes of Big Data» (Richards and King 2013). While the authors do not deny the many benefits and the substantial potential inherent in Big Data, they advocate a more pragmatic discussion with due attention to the many faceted implications and inherent dangers of the Big Data paradigm by calling attention to three paradoxes in the current rhetoric:
The paradoxes framed around transparency, identity and power touch on more than one raw nerve in the current discourse on the ethical and societal implications of Big Data. A closer look at the various elements along the «Big Data chain» – namely data collection and storage, the application of analytical tools and finally action on the basis of insights mined–also reveals a host of potential shortcomings in current protective measures, as well as new challenges and problems.
To date, most of the ethical concerns that have been raised relate to privacy challenges in the first link in the chain, namely that of the collection and storage of data. Many of these problems are not entirely new, but traditional mechanisms for ensuring privacy protection have come under increasing pressure with the advent of Big Data. Let us consider two cases:
As part of a contest to improve its movie recommendation service, the online movie streaming service Netflix released an anonymised dataset containing the rental and rating history of almost half a million customers. By running the anonymised dataset against ratings on the online service «Internet Movie Database», researchers were not only able to identify individuals in Netflix’s records, but also the political preferences of those people (Narayana and Shmatikov 2006).
There are similar examples of how an apparently anonymised dataset, when properly contextualised, is no longer truly anonymous. And the Big Data paradigm makes it increasingly more difficult to secure anonymity, because ever more data streams are generated, stored and made available for advanced data mining techniques (Navetta 2013). As the world becomes more data rich, researchers can no longer rest content with simplistic anonymisation to mitigate ethical risks.
The re-identification problem does not only highlight the shortcomings of established protective measures, but also shows that focusing exclusively on the proper governance of datasets and their attributes will often fall short of capturing the nuanced ethical challenges associated with data analysis. In order to grab the bull by the horns and provide the individual with meaningful control over personal information, it is necessary to govern data usage – that is, the actual operations performed on and with the datasets – rather than focusing solely on the collection and retention of such data. Doing so, however, is challenging, to say the least.
For while the current Big Data scene may appear to be dominated by a handful of major players, such as Google, Facebook and Amazon, its ecosystem is in fact highly distributed, with a host of third party actors operating behind the scenes which «often piggyback on the infrastructure built by the giants» (Sandberg 2012). Data collection, curation and analysis do not necessarily take place at a single point which can be subjected to robust regulatory measures.
Moreover, the technical opacity of algorithms underpinning Big Data analysis, as well as the real-time nature of such analyses, does not easily lend itself to meaningful scrutiny by way of traditional transparency and oversight mechanisms. In a world where
… highly detailed research datasets are expected to be shared and re-used, linked and analysed, for knowledge that may or may not benefit the subjects, and all manner of information exploited for commercial gain, seemingly without limit. (Dwork 2014)
it can be hard to gauge a priori which operations are socially and ethically sound and which are not. Researchers may find that seemingly innocuous operations reveal themselves as privacy-intrusive or otherwise ethically «sticky» only after they have been performed. As computational techniques become increasingly more sophisticated and systems are able to extract personal information from datasets that appear harmless, relying on human intuition to define which operations are privacy-intrusive and which are not seems unsatisfying.
Technology is likely to be at least one part of the solution. Novel approaches such as «differential privacy» leverage mathematics to ensure both consistent and high standards for privacy protection in statistical operations on datasets involving sensitive data. Differentially private algorithms satisfy mathematical conditions that allow the privacy risk involved in an operation on a dataset to be duly quantified. Once a threshold is passed the algorithm will intentionally blur the output so that individuals whose data are being analysed are ensured «plausible deniability». In other words, their presence or absence in the datasets in question has such a marginal impact on the aggregate result that there is no way of telling whether or not they were part of the dataset in the first place. Researchers can still draw value from the dataset because the «blurred» output differs only marginally from true output and the uncertainty, or «degree of blurring», is well known. Differentially private algorithms can also keep track of and appropriately quantify the cumulative privacy risk an individual sustains through repeated or multiple queries by iteratively adding more noise to mask any personal information residing in the data (Klarreich 2012).
Privacy and personal data protection are often touted as the central ethical challenges presented by Big Data. While technology may certainly help mitigate some of these risks, however, other challenges will require strong governance and legal protections. Furthermore, while we attend to the very pressing privacy concerns raised by Big Data, we should not lose sight of the many issues that fall outside the traditional privacy debate.
With recent technological advances, the cost of collecting, storing and analysing various kinds of data snippets has decreased quite dramatically. Novel methods also allow us to interlink and make sense of various kinds of data long after the data has been collected. As Alistair Croll remarks in an interesting blog-post: «In the old, data-is-scarce model, companies had to decide what to collect first, and then collect it. […] With the new, data-is-abundant model, we collect first and ask questions later» (Croll 2012). This attitude was perhaps most amply illustrated by the mass surveillance activities of the NSA unravelled in the recent Snowden revelations, but it also holds true on a more general level. And this, Croll argues, is changing the way we use data.
Many remarkable successes of the Big Data paradigm, such as detecting disease outbreaks or predicting traffic jams, come from utilising data in ways that are very different from the original purpose of collection or the original context and meaning we bestowed on the data. However, Croll argues, this is a slippery slope fraught with ethical problems that go well beyond the regular privacy debate. Instead they deal with the inferences we are allowed to make and just how we act on or apply this insight. As the technical and financial barriers to what we can collect and do with data begin to crumble, the regulatory challenges to the proper use of data and analytics are likely to intensify (Soltani 2013).
As an example, Croll remarks on a study performed by the popular online dating service OkCupid, where the profile essays of some half a million users were mined for words that made each racial group in its member database statically distinguishable from other racial groups. According to the OkCupid blog post, «black people are 20 times more likely than everyone else to mention soul food, whereas no foods are distinct for white people» (Rudder 2010). The relatively simple study highlights just how easily information on race, sexual orientation, political standing or health can be inferred from innocuous information collected for very different purposes. Croll continues: «If I collect information on the music you listen to, you might assume I will use that data in order to suggest new songs, or share it with your friends. But instead I could use it to guess at your racial background. And then I could use that data to deny you a loan».
While such inferences may be partially construed as privacy issues–and legislative regulation can assist in preventing obvious transgressions–there are arguably deeper issues at play.
Inferences like the above are typically used to personalise and tailor adds, information, and online experiences to individuals. Such tailoring relies on classification–the algorithmic grouping of data points (people), and, as Dwork and Mulligan point out, such algorithms are a «messy mix of technical and human curating» and are «neither neutral nor objective», but always geared towards a specific purpose in a given context (Dwork and Mulligan 2013: 35). The objectionable or discriminatory outcome may not even be intentional or obvious to the providers of the service. As Sandberg remarks, a machine-learning algorithm trained on various data to determine the suitability of loan applicants, or even job applicants, may well «know» the race or political orientation of the applicant, even if it were not explicitly fed this information or programmed to use it. The information is baked into the data in non-obvious ways and ultimately «the algorithm will follow the data, not how we want to ‘think’» (Sandberg 2012). Once differential treatment starts following certain social, political or religious patterns, the large scale effects on society can be profound. As Dwork and Mulligan argue, these issues have little to do with privacy and transparency, but are more about the «values embedded and reflected in classifications, and the roles they play in shaping public and private life» (Dwork and Mulligan 2013: 40).
Some cities across the U.S., notably Philadelphia, use statistical profiling techniques to determine the risk of criminal recidivism among parolees. The method relies on classification of offenders into groups for which certain statistical probabilities can be computed. While critics dismiss such methods as ethically questionable at best (should a cold calculus of past offences punish you for crimes you have not yet committed?), proponents argue that the method is not doing anything a parole board would not do, except with greater accuracy, with full absence of discriminatory urges and with more consistency and transparency.
The case highlights the challenging problems that we are likely to face as Big Data moves out of its nascent stage of tailored ads to affect a wider range of human activity. It also shows how easy it is to fall prey to the temptation of framing the problem as one of man versus machine. Such an approach is likely to be counterproductive. The challenge of managing «ethical risk» in a Big Data world is one that is jointly technological and sociological, and as Dwork and Mulligan succinctly put it: «[…] Big Data debates are ultimately about values first, and about math and machines only second» (Dwork and Mulligan 2013: 38).
Like other technologies in the past, as Big Data unfolds and is absorbed into society we are likely to see adjustments and changes in our current notions of privacy, civil liberties and moral guidelines. And as the hype eventually wears off and a proper equilibrium between the role of human intuition and data-driven insight is established, we will need to develop tools, guidelines and legislation to govern this new world of data and data analysis. This process will require a wide perspective, coupled with constant and close scrutiny of all links along the Big Data chain. It is an arduous task, but also one that should be exciting for all involved. Future societies might be shaped by technological advances, but technology itself is moulded by human choices. And these choices are available for us to make now.