Consulting: Privacy in open data

The past six months we consulted for the Meta-Research Center on their NWO Open Science Fund project, aimed at detecting privacy violations in datasets.

Consulting: Privacy in open data

The past six months we consulted for the Meta-Research Center on their NWO Open Science Fund project, aimed at detecting privacy violations in datasets. You can now use the web app we built and the R package we developed to scan your datasets and prevent sharing privacy violations.

Together with prof. dr. Jelte Wicherts and dr. Rick Klein, we wrote the original grant proposal to improve scanning for potential privacy violations in datasets. After we observed in a previous study that 1 in 20 datasets contain identifying information, we determined more work was necessary to improve the situation. The NWO Open Science Fund made this possible.

In the project, we conducted a controlled study to investigate how well our algorithm worked to identify potential privacy violations. All in all, we were able to retrieve technical identifiers (e.g., email, IP addresses) with high sensitivity and precision, and high risk direct identifiers as well (e.g., gender, bank information). Find the detailed results in the precision report on ResearchEquals.

We documented all the steps for this project on ResearchEquals and shared the source code openly on GitHub. The following diagram showcases the development of the project, from before the grant proposal was written to the current state:

graph TD A[Privacy Protection in the Era of Open Science] --> B[Automatic Detection of Identifiers in Open Data - ADIODA] B --> C[Simulating data with identifiable information] B --> D[Synthetic identifying information for 100,000 individuals: A pseudo-population] C --> D B --> E[Aggregate dataset of open data without identifying information] D --> F[Precision of detecting identifying information] B --> G[URLs to open datasets hosted on the Open Science Framework] E --> F F --> H[`datacheck` R package] F --> I[`datacheck` web app]

Practical next steps include the screening of datasets by using the datacheck R package or the datacheck web app, so that you can prevent sharing privacy violations yourself. This algorithm can also be applied in editing processes for journals more widely. Are you interested in a walkthrough? Check out our project report on YouTube.


Thanks for reading this blog about the datacheck project!

If you are interested in working with us on your projects, you can book a free 15 minute consult with us. Liberate Science Labs is happy to consult with you to push the frontier on improving science 😊

Join us on our open journey!