Data Privacy - overview
Organizations, public bodies, institutes and companies gather enormous volumes of data that contain personal information. For reputation, compliance and legal reasons, the personal information needs to be de-identified before shared with third parties, such as analytics teams or research scientists. The de-identification process aims to achieve the following three goals: a) significantly and provably minimize the re-identification risk b) maintain a high level of data utility to allow supporting intended secondary purposes and c) maintain the truthfulness of the data at a record level to the largest possible extent. The new era of machine learning and deep learning brings new challenges to the data privacy landscape; it is our responsibility to make sure that models are trained without revealing any personal information from the input datasets.
This project aims to explore innovative ways to provide a framework for ensuring data privacy for machine learning models in meaningful and realistic settings. The audience of this work will be technical, legal and compliance departments thus good presentation skills will help people navigate through the work. The project will also build the foundational metrics for capturing the balance between information loss and risk assessment. The end goal is a research prototype that will demonstrate the framework in various scenarios.
Programming (Java 8 or Python)
Background on machine learning and deep learning
Basic background on data privacy (with preference to differential privacy)
Basic background on information theory and statistical disclosure control
Good presentation skills