Skip to Main Content
Libraries
askus Ask us
 

Research Data Services

Guidance, tools, and training to support faculty and students working with research data.

Sensitive Data

Researchers and their teams need to be aware of the policies and processes to which their research data must comply. In instances where sensitive data cannot be made public for various ethical, policy or legal reasons, research teams should consider whether de-identifying data, i.e. removing direct identifiers, is possible and would allow for safe sharing.

Direct identifiers are those which place study participants at immediate risk of being re-identified. The following list is based on various sources, including guidance from major international funding agencies, the US Health Insurance Portability and Accountability Act (HIPAA) and the British Medical Journal. 

Direct identifiers include:

  • Names or initials, as well as names of relatives or household members
  • Addresses, and small area geographic identifiers such as postal codes / zip codes
  • Telephone numbers
  • Electronic identifiers such as web addresses, email addresses, social media handles, or IP addresses of individual computers
  • Unique identifying numbers such as hospital IDs, Social Insurance Numbers, clinical trial record numbers, account numbers, certificate or license numbers
  • Exact dates relating to individually-linked events such as birth or marriage, date of hospital admission or discharge, or date of a medical procedure
  • Multimedia data: unaltered photographs, audio, or videos of individuals
  • Biometric identifiers including finger or voice prints, and iris or retinal images
  • Human genomic data, unless risk was explained and consent to share data or consent for secondary use of data was received from study participants
  • Age information for individuals over 89 years old

For research projects involving human participants and human biological materials, these decisions must align with UVic's Human Research Ethics requirements.


Data Protection Terms

Method Description
Anonymization

Direct and indirect identifiers have been removed or manipulated together with mathematical and technical guarantees to prevent re-identification.

Example: Meaningless data is calibrated to a dataset to hide whether an individual is present or not.

De-identification

Direct and known indirect identifiers have been removed or manipulated to break the linkage to real world identities.

Example: Data are suppressed, generalized, perturbed, or swapped; e.g., GPA: 3.2 = 3.0-3.5, gender: female = gender: male.

Pseudonymization

Information from which direct identifiers have been eliminated or transformed, but indirect identifiers remain intact.

Example: Unique, artificial pseudonyms replace direct identifiers; e.g., John Doe = 5L7T LX619Z (unique sequence not used anywhere else).


Open Source Tools

Researchers may consider use of algorithm-based tools to help anonymize their data and reduce the risk of reidentification. A range of open source software is available. 

 

ARX

Amnesia

Anonimatron

Website

https://arx.deidentifier.org/ 

https://amnesia.openaire.eu/ 

https://realrolfje.github.io/anonimatron/ 

Purpose

  • Anonymization
     

  • De-identification

  • Anonymization

  • Anonymization

System Requirement

  • Available for Windows, macOS or Linux

  • Available for Windows and Linux in addition to an online version 

  • Available for Windows, macOS or Linux

Notable Features

  • Supports popular models for protecting data including k-anonymity, and variants ℓ-diversity, t-closeness, β-Likeness
     

  • Allows end-users to categorize, top and bottom code, generalize, and transform data in more complex ways
     

  • Extract data from CSV, Excel, and DBMS

  • General Data Protection Regulation (GDPR) compliant
     

  • Supports k-anonymity and km-anonymity

  • The anonymized data can be saved locally or directly to Zenodo
     

  • This software may work best for clinical data, or data which are not survey data.

  • General Data Protection Regulation (GDPR) compliant
     

  • Supports popular models for protecting data such K-anonymity
     

  • Can generate fake emails, names, or ID's
     

Limitations

  • Large datasets take time to load, and computation time for large or complex datasets may be lengthy

  • Cannot specify missing values
     

  • Sparse documentation for defining hierarchies

  • The output of the Anonimatron anonymization process may contain certain (statistical) properties of the input dataset


Additional Resources

Managing and sharing sensitive data can prove to be a complex undertaking that requires skill and expertise. Consult the following resources to start learning more about how to share sensitive data responsibly. 

Creative Commons License
This work by The University of Victoria Libraries is licensed under a Creative Commons Attribution 4.0 International License unless otherwise indicated when material has been used from other sources.