LibGuides: Research Data Services: Sensitive Data

Sensitive Data

Researchers and their teams need to be aware of the policies and processes to which their research data must comply. In instances where sensitive data cannot be made public for various ethical, policy or legal reasons, research teams should consider whether de-identifying data, i.e. removing direct identifiers, is possible and would allow for safe sharing.

Direct identifiers are those which place study participants at immediate risk of being re-identified. The following list is based on various sources, including guidance from major international funding agencies, the US Health Insurance Portability and Accountability Act (HIPAA) and the British Medical Journal.

Direct identifiers include:

Names or initials, as well as names of relatives or household members
Addresses, and small area geographic identifiers such as postal codes / zip codes
Telephone numbers
Electronic identifiers such as web addresses, email addresses, social media handles, or IP addresses of individual computers
Unique identifying numbers such as hospital IDs, Social Insurance Numbers, clinical trial record numbers, account numbers, certificate or license numbers
Exact dates relating to individually-linked events such as birth or marriage, date of hospital admission or discharge, or date of a medical procedure
Multimedia data: unaltered photographs, audio, or videos of individuals
Biometric identifiers including finger or voice prints, and iris or retinal images
Human genomic data, unless risk was explained and consent to share data or consent for secondary use of data was received from study participants
Age information for individuals over 89 years old

For research projects involving human participants and human biological materials, these decisions must align with UVic's Human Research Ethics requirements.

Data Protection Terms

Method	Description
Anonymization	Direct and indirect identifiers have been removed or manipulated together with mathematical and technical guarantees to prevent re-identification. Example: Meaningless data is calibrated to a dataset to hide whether an individual is present or not.
De-identification	Direct and known indirect identifiers have been removed or manipulated to break the linkage to real world identities. Example: Data are suppressed, generalized, perturbed, or swapped; e.g., GPA: 3.2 = 3.0-3.5, gender: female = gender: male.
Pseudonymization	Information from which direct identifiers have been eliminated or transformed, but indirect identifiers remain intact. Example: Unique, artificial pseudonyms replace direct identifiers; e.g., John Doe = 5L7T LX619Z (unique sequence not used anywhere else).

Open Source Tools

Researchers may consider use of algorithm-based tools to help anonymize their data and reduce the risk of reidentification. A range of open source software is available.

	ARX	Amnesia	Anonimatron
Website	https://arx.deidentifier.org/	https://amnesia.openaire.eu/	https://realrolfje.github.io/anonimatron/
Purpose	Anonymization De-identification	Anonymization	Anonymization
System Requirement	Available for Windows, macOS or Linux	Available for Windows and Linux in addition to an online version	Available for Windows, macOS or Linux
Notable Features	Supports popular models for protecting data including k-anonymity, and variants ℓ-diversity, t-closeness, β-Likeness Allows end-users to categorize, top and bottom code, generalize, and transform data in more complex ways Extract data from CSV, Excel, and DBMS	General Data Protection Regulation (GDPR) compliant Supports k-anonymity and km-anonymity The anonymized data can be saved locally or directly to Zenodo This software may work best for clinical data, or data which are not survey data.	General Data Protection Regulation (GDPR) compliant Supports popular models for protecting data such K-anonymity Can generate fake emails, names, or ID's
Limitations	Large datasets take time to load, and computation time for large or complex datasets may be lengthy	Cannot specify missing values Sparse documentation for defining hierarchies	The output of the Anonimatron anonymization process may contain certain (statistical) properties of the input dataset

Tools for Sensitive Data
A PDF copy of the information recorded on this page for ease of viewing

Additional Resources

Managing and sharing sensitive data can prove to be a complex undertaking that requires skill and expertise. Consult the following resources to start learning more about how to share sensitive data responsibly.

Sensitive Data Toolkit (Portage)
De-Identification Guidance (Portage)
A Visual Guide to Practical Data De-Identification (Kelsey Finch)
ACRL Primer for Protecting Sensitive Data in Academic Research (Association of College & Research Libraries)
Data Anonymization and De-Identification Guide (University of British Columbia)
Shades of Gray: Seeing the Full Spectrum of Practical Data De-Identification (Polonetsky et al., 2016)