Sensitive data

Sensitive data is data that could cause harm if it was made publicly available. This ranges from data that impacts national security to data that would damage an individual’s reputation or personal relationships. There are laws and regulations around certain types of data (e.g. controlled unclassified information (CUI) and protected health information (PHI)), however, the impact of data release should be considered for all research projects. 

In data from human subjects, common topics that are generally considered  sensitive include PHI (more on this below), data on criminal behaviors (including drug use), disciplinary records, mental health information, sexual behavior, unique biometric information, and account or ID numbers. Outside of human subjects, what is considered sensitive will be vary depending on the field. For example, geographic location may be sensitive for data on exceedingly valuable minerals, but not for crop irrigation. For all types of data, contracts and regulations may determine the sensitivity of the data

  1. PII - Personally Identifiable Information (PII), is any information that can be used to distinguish an individual’s identity. Some types of PII are considered sensitive PII and should be considered high risk regardless of use type: social security numbers and other ID numbers, financial information, biometric identifiers (e.g. retinal scans, fingerprints), vehicle identifiers and property serial numbers, etc. (Note: sensitive data that is collected strictly for business purposes should be treated as institutional data, and stored separately from research data.)

    Some types of PII are considered as part of the public domain – information you would find in public directories, like your name, street and email addresses, and telephone numbers. These are all direct identifiers. Then there are the indirect identifiers: personal characteristics that when combined, may be able to pinpoint a single individual. These are typically collected as demographic information in human subjects data, and the risk increases as the rarity increases. For example, a 68 year old Asian woman who practices Buddhism is not identifiable among the US population, but a 68 year old Nepalese woman who practices Buddhism in Greeley county Nebraska (population 98% White) is very likely identifiable.                
  2. PHI, HIPAA, FERPA, etc. – A lot of data about humans is collected for business purposes and therefore does not involve informed consent. Two particular domains of concern are health and education. 
    1. PHI + HIPAA: Personal health data that is collected for the purposes of providing medical care is called Protected Health Information (PHI) and is governed by HIPAA – the Health Insurance Portability and Accountability Act of 1996. HIPAA defines how such data can be used and with whom. Data that comes from certain “covered entities” are automatically subject to HIPAA. Deidentified data from covered entities is no longer subject to HIPAA regulations. Data that is provided directly from an individual to a researcher with informed consent is NOT PHI nor bound by HIPAA, although the data may still be considered sensitive out of respect for persons. 
    2. FERPA: Student education data is protected under the Family Educational Rights and Privacy Act (FERPA). Like PHI, this is data that was captured for a non-research purpose, and consent has not been granted for other use. Also like PHI, de-identified data is no longer subject to FERPA regulations. Data that is provided directly from an individual to a researcher with informed consent is not bound by FERPA, although the data may still be considered sensitive out of respect for persons. The National Center for Education Statistics provides thorough guidance on reporting of aggregate information to protect student privacy.