DeepAnon: Deep Data Anonymization for Large-scale Analysis of Text-heavy Data

Introduction

As collaborative data-driven research is becoming the norm in the social sciences and humanities (SSH henceforth), the ability to access, move, and share datasets is critical to Danish research institutions’ ability to compete internationally. Data in SSH are often unstructured, text-heavy and can contain personally identifiable information. In cases of structured data, anonymization can be reduced to a two-step procedure: 1) identification of person sensitive columns by a researcher; and 2) automated replacement of values in accordance with a unique key. (This is a simplification of data anonymization of structured data. Many advanced techniques have been proposed, but these typically consists in a method to prevent a particular de-anonymization procedure (i.e., a so-called “attack”). These methods are therefore data-dependent and difficult to apply to a general context where the range of potential attacks are unknown). Due to the unstructured nature of SSH’s data, data anonymization is not such a trivial task. Data privacy therefore becomes a major obstacle that has to be considered before releasing datasets to other research communities that will compute statistics or make a deep analysis of these data.

To develop a novel automated approach, DeepAnon, to anonymization of unstructured text-heavy data, we propose to combine multiple rule-based and statistical approaches that cover a wide range of possible de-anonymization procedures. Our goal is a generic technique based on machine learning, which learns the anonymization function from a set of domain-relevant training data.

Data and Research Design

To ensure maximal resource utilization, the pilot project use a research project DeepDict as test case. DeepDict is a research project run by Eckhard Bick, which develops a new type of lexical resource, build from grammatically analysed Internet data. The project use co-occurrence strengths between mother- daughter dependency pairs to automatically produce dictionary entries of typical complementation patterns and collocations. In relation to DeepAnon, the challenge is use of data from the Danish Netarchive. In order to train and share the dictionary, DeepAnon needs to sanitize text elements from the Netarchive and make them GDPR compliant. We therefore need access to large text samples from the Netarchive. It is possible that we can re-use samples collected by other projects.

DeepAnon is a multi-step algorithm that combines simple pattern matching based on grammatical rules and word-lists with state-of-the-art language tools for entity recognition, tagging and parsing. The planned prototype is written in Python by the project manager and lead programmer. We will however need help with data sampling and use of the HPC infrastructure.

Discussion

With the controversial use of social media data and the GDPR just around the corner, the need for a robust generic approach to data anonymization has never been greater. While we develop the DeepAnon prototype to a particular research project, the ultimate goal is to develop a tool that can be offered both as a service to other research projects and a module in HPC pipelines at Danish research institutions. DeepAnon will be made freely available to researchers and code written specifically for the prototype will be made available under the MIT License.

The project is led by Kristoffer L. Nielbo