There are myriad ways in which data about people can be used for societal benefit – in domains such as research on human health and behavior, novel services offered through electronic commerce, and improved law enforcement and national security. However, many of these uses of data raise justified concerns about privacy, as we learn that de-identified data can be readily re-identified and used in ways that have the potential to harm individuals.
In 2009, CRCS led the launch of the Privacy Tools project, which is a broad, multidisciplinary effort to enable the collection, analysis, and sharing of sensitive data while providing privacy for individual subjects. A particular focus of the Privacy Tools project is on the sharing of data to support research in computational and quantitative social science. Bringing together computer science, social science, statistics, and law, the investigators on the Privacy Tools Project refine and develop definitions and measures of privacy and data utility and design an array of technological, legal, and policy tools for dealing with sensitive data. In addition to contributing to research infrastructure around the world, the ideas developed in this project will benefit society more broadly as we grapple with data privacy issues in many other domains, including public health and electronic commerce.
The Privacy Tools Project is defining and measuring privacy in both mathematical and legal terms and exploring alternate definitions of privacy that may be more general or more practical. The project studies variants of differential privacy and develops new theoretical results for its use in contexts where it is currently inappropriate or impractical. Differential privacy, discovered by CRCS faculty member Cynthia Dwork, visiting scholar Kobbi Nissim, and collaborators in 2006, is a rigorous mathematical definition of privacy. An algorithm is said to be differentially private if, by looking at the output, one cannot tell whether or not any individual’s data was included in the original dataset. In other words, the guarantee of a differentially private algorithm is that its behavior hardly changes when a single individual joins or leaves the dataset — anything the algorithm might output on a database containing some individual’s information is almost as likely to have come from a database without that individual’s information. This guarantee holds for any individual and any dataset. Therefore, regardless of how eccentric any single individual’s details are, and regardless of the details of anyone else in the database, the guarantee of differential privacy still holds. This gives a formal guarantee that individual-level information about participants in the database is not leaked.
The goals of the Differential Privacy research group, which operates within the Privacy Tools Project, are to design and implement differentially private tools that will enable social scientists to share useful statistical information about sensitive datasets; to integrate these tools with the widely-used platforms developed by the Institute for Quantitative Social Science for sharing and exploring research data; and to advance the theory of differential privacy in a variety of settings, including statistical analysis (e.g. statistical estimation, regression, and answering many statistical queries), machine learning, and economic mechanism design.
The other major area of research in the Privacy Tools Project is DataTags. DataTags is a system designed to help data holders navigate complex issues surrounding data privacy. It enables computer-assisted assessments of the legal, contractual, and policy restrictions that govern data sharing decisions. Assessments are performed through interactive computation, in which the DataTags system asks a user a series of questions to elicit the key properties of a given dataset and applies inference rules to determine which laws, contracts, and best practices are applicable. The output is a set of recommended DataTags, or simple, iconic labels that represent a human-readable and machine-actionable data policy, and a license agreement that is tailored to the individual dataset. The DataTags system is being designed to integrate with the open source data repository software. Dataverse and its suite of access controls and statistical analysis tools. It will also operate as a standalone tool and as an application that can be integrated with other platforms.
The research being conducted by the Privacy Tools Project is providing a better understanding of the practical performance and usability of a variety of algorithms for analyzing and sharing privacy-sensitive data. It is developing secure implementations of these algorithms and legal instruments, which will be made publicly available and used to enable wider access to privacy-sensitive data sets in the Harvard Institute for Quantitative Social Science’s Dataverse Network.
Like all CRCS endeavors, the Privacy Tools Project is profoundly collaborative. It is led by SEAS Professor Salil Vadhan, whose team at CRCS partners with researchers at the Berkman Klein Center for Internet and Society (BKC), the Institute for Quantitative Social Science (IQSS), the Data Privacy Lab, and MIT Libraries’ Program on Information Science. This year, Privacy Tools team members presented at conferences around the world, including the 10th Annual Privacy Law Scholar’s Conference (PLSC); the 8th Annual ESPAnet Israel 2017; the Simons Institute’s Data Privacy Planning Workshop; and the Third Biennial Secure and Trustworthy CyberSpace Principal Investigators’ Meeting (SaTC PI Meeting ‘17). Harvard Magazine highlighted the Privacy Tools Project in its article on Privacy and Security, and Principal Investigator and CRCS Professor Latanya Sweeney was recently named one of Forbes’ “20 Incredible Women Working in AI Research.”
The Privacy Tools Project is funded by the National Science Foundation, the Sloan Foundation, the US Bureau of the Census, and Google.
For more information, please visit http://privacytools.seas.harvard.edu