Dataset publication and de-identification techniques

The Information Privacy Act 2009 (Qld) (IP Act) regulates the publication1 and overseas transfer of personal information2.  This guideline provides a brief introduction to the tools and techniques for de-identifying data so that its publication can comply with the privacy principles in the IP Act.

What is personal information?

Personal information is defined in the IP Act3 as ‘information or an opinion, including information or an opinion forming part of a database, whether true or not, and whether recorded in a material form or not, about an individual whose identity is apparent, or can reasonably be ascertained, from the information or opinion.’

It includes information which directly identifies an individual and information that can be compared or cross-referenced with other information to identify an individual. 

The advantage of data de-identification

Appropriately de-identified data is no longer linkable to an identifiable individual, which means it is no longer personal information.  Once it is no longer personal information, the IP Act does not apply to the data.

De-identification allows valuable datasets to be released to the Queensland community without infringing the privacy of the individuals the data concerns.     

The effort required to appropriately de-identify data will vary, but it does not have to be significant; common software applications such as Microsoft Excel include tools for basic aggregation and filtering.

Data masking

The simplest method of de-identification involves the removal of obvious identifiers such as an individual’s name or address.  For example, consider the following data:

Name

Address

Postcode

Age

Gender

Profession

Annual Salary

John Matthew Smith

10 Jones Street HeartvilleQLD

4567

47

Male

Truck driver

$62 000

which can become:

Postcode

Age

Gender

Profession

Annual Salary

4567

47

Male

Truck driver

$62 000

While on the face of it this data has been stripped of its identifiers, it retains a relatively high potential for re-identification:  the data still exists on an individual level and other, potentially identifying, information has been retained.  For example, some Queensland postcodes have very small populations and combining this data with other publicly available information4, can make re-identification a relatively easy task.

It may be tempting for agencies to strip out all potentially identifying information however doing so could render the data meaningless. The fact that somewhere in Australia a truck driver is earning $62,000 a year has very limited potential use.

Pseudonymisation

A related method of de-identification is ‘pseudonymisation’ which involves consistently replacing recognisable identifiers with artificially generated identifiers, such as a coded reference or pseudonym. In the example above, John Smith would be assigned a randomly selected numerical value:

Individual reference

Postcode

Age

Gender

Profession

Annual Salary

46389ZX1

4567

47

Male

Truck driver

$62 000

Pseudonymisation allows for different information about an individual, often in different datasets to be correlated without the consequence of direct identification of the individual.  For example, the information above could be correlated with:

Individual reference

Marital status

Number of children

Highest level of education attained

Number of cars owned by household

46389ZX1

Married

2

High school

2

However, pseudonymisation also has a relatively high potential for re-identification, as the data exists on an individual level with other potentially identifying information being retained.  Also, because pseudonymisation is generally used when an individual is tracked over more than one dataset, if re-identification does occur more personal information will be revealed concerning the individual.

Reducing the precision of information

Rendering personally identifiable information less precise can make the possibility of re-identification more remote.  Dates of birth or ages can be replaced by age groups; specific salaries can be replaced by salary ranges. 

For example John Smith’s data now becomes:

Name

Postcode

Age range

Gender

Profession

Annual Salary range

46389ZX1

4567

40-50

Male

Truck driver

$60,000 -$70,000

Related techniques include suppression of cells with low values, such as ages greater than 90, or conducting statistical analysis to determine whether particular values can be correlated to individuals. More advanced techniques include introducing random values or ‘adding noise’ and involve altering the underlying data.5

Aggregation

Individual data can be combined to provide information about groups or populations.  The larger the group and the less specific the data is about them, the lesser the potential for identifying an individual within the group. An example of aggregated data would be:

Profession

State

Annual Salary

Number of drivers6

Truck driver

Queensland

<$50,000

47,507

$51,000

10,843

$52,000

19,876

$53,000

8,748

$54,000

11,414

>$55,000

31,203#

# includes John Smith

The privacy protections in aggregated data are lessened where the sample is small.

Software assisted de-identification

A range of tools and algorithms have been developed to assist in de-identifying datasets.  Most of these tools automate a range of the data manipulation techniques above and assess the potential for re-identification of individuals within the database.  The tools have been developed primarily in other jurisdictions and OIC is not aware of testing to determine the applicability of the tools in the Australian data environment.  For example, some of the tools have integrated functions to reduce United States zip code data which may not be suitable for Australian post codes.  Agencies should conduct their own research into whether the tools listed below or other tools are suited to their needs.

Some commonly available tools are:

Re-identification testing

Some legislative frameworks for privacy protection include expert de-identification testing and certification processes.  For example, the United States Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule includes procedures for the expert determination of identification risk and sets de-identification standards.  Australia does not have equivalent standards for assessing expertise in de-identification.

Testing de-identified datasets against other available data in an attempt to re-identify individuals is a prudent action for agencies.  This is often referred to as ‘penetration testing’ and is similar to security testing used by ICT professionals in hardening systems against external threats.  This approach has the advantage of evaluating the risks in a real world environment but can be resource intensive.  Testing would need to be conducted by trusted parties in secure environments to avoid inadvertent disclosure of personal information. 

Differential privacy 

Another emerging software based approach is to add an intervening application between the dataset and the data end-user.  In the approach, referred to as ‘differential privacy’, the intervening application analyses data query results for re-identification risk and alters the search results to remove identifying factors prior to releasing the results to the end-user. 

Advocates of this approach highlight the reduced need for data manipulation of entire datasets with alterations only occurring on an as-needed basis. However, this method retains a risk of an end-user gaining access to the original dataset by circumventing the software controls.

1 See Information Privacy Principle 11 and National Privacy Principle 2.
2 See section 33 of the IP Act.
3 In section 12 of the IP Act.
4 For example, publicly available census data for the postcode 4490 (Cunnamulla) indicates that the population is 1,857 with 948 males and 909 females. The census data includes a range of other detailed information which could result in re-identification.
5 The United Kingdom’s Information Commissioner’s Office Anonymisation: managing data protection risk code of practice available at: http://www.ico.gov.uk/for_organisations/data_protection/topic_guides/anonymisation.aspx.
6 These numbers are fictitious and are used for the purposes of example only. Any correlation between these numbers and the actual numbers of Queensland truck drivers earning these annual salaries are coincidental.

Current as at: February 18, 2013