Identification and protection of privacy vulnerabilities - overview
Organizations, public bodies, institutes and companies gather enormous volumes of data that contain personal information. For reputation, compliance and legal reasons, the personal information needs to be de-identified before shared with third parties, such as analytics teams or research scientists. The de-identification process aims to achieve the following three goals: a) significantly and provably minimize the re-identification risk b) maintain a high level of data utility to allow supporting intended secondary purposes and c) maintain the truthfulness of the data at a record level to the largest possible extent. It is often the case where the data owners cannot fully understand how vulnerable are their datasets in terms of privacy and what strategy they need to follow to provide the necessary level of protection.
Our team at IBM Research - Ireland has designed and implemented IBM Data Privacy Toolkit for identifying and protecting privacy vulnerabilities that is provided both as a library and a service model. The toolkit covers a broad list of privacy tasks with the end goal to guide the decision making process for data owners.
- Datasets contain personal and sensitive information about individuals; health records, data from IoT devices, free text etc. making data sharing challenging
- Datasets are constantly increasing in volume and characteristics and overly interconnected
- Provide guidance for stakeholders to understand the risk when sharing data and which strategy to follow for safe data releases
- Develop efficient ways to detect the existence of personal and sensitive information in structured and unstructured data
- Develop algorithms that can detect privacy vulnerabilities in large datasets
- Define methodologies and metrics to calculate privacy risks and recommend data protection strategies
- A workflow to assist data owners, compliance and legal teams to understand the data, their privacy vulnerabilities and guide them through most of the decision making process for a protection strategy
- A toolkit that offers type identification, vulnerability detection, masking and anonymization for large datasets, leveraging on Spark for scalability
Example of the data privacy protection workflow
Spiros Antonatos, Stefano Braghin, Naoise Holohan, Yiannis Gkoufas and Pol Mac Aonghusa. PRIMA: an End-to-End Framework for Privacy at Scale. In the 34th IEEE International Conference on Data Engineering (ICDE), 16-19 April 2018, Paris, France
A. Gkoulalas-Divanis, S. Braghin, S. Antonatos. FPVI: A Scalable Method for Discovering Privacy Vulnerabilities in Microdata. In the Second IEEE International Smart Cities Conference (ISC2), 12-15 September 2016, Trento, Italy
Vanessa López,Martin Stephenson, Spyros Kotoulas, Pierpaolo Tommasi. Data Access Linking and Integration with DALI: Building a Safety Net for an Ocean of City Data. International Semantic Web Conference (2) 2015: 186-202
A. Gkoulalas-Divanis, S. Braghin. Efficient algorithms for identifying privacy vulnerabilities. ISC2 2015: 1-8