k-Grouping

An anonymized dataset is said to have a 🔗k-anonymity property if the information for each person cannot be distinguished from at least k-1 individuals whose information also appears in the dataset. For example, if each individual’s record is indistinguishable from 2 other records, the dataset has 3-anonymity.

Inspired by k-anonymity, TripleBlind supports configuration of a “k-Grouping” safeguard as a property of a DatasetAsset, which is honored by computations in a variety of ways.

ℹ️ Setting the k to 1 will position the dataset without any protection from this safeguard, as the record will be “hiding in a crowd” consisting of only that record. In other words, setting k to 1 is equivalent to disabling k-grouping.

k-Grouping as an aggregate grouping threshold

Blind Stats

The k-value ensures a minimum threshold number (k) records are aggregated for each grouping in a Blind Stats computation, preventing exposure of sensitive information resulting from low numbers of records being operated on in aggregate. If this threshold is not met, the operation will fail with a warning message. This prevents accidental data leakage, eg. requesting the median of a group containing only 1 record.

K-Means Clustering

When using the K-Means Clustering operation, the outputs of training include the mean and standard deviation of each column as well as the means of each column, grouped by clusters.

The following adjustment happens at the end of training, using the highest k-Grouping value of the involved datasets:

  1. Clusters with fewer data points than k-Grouping are marked as empty.
  2. The data points of these clusters are reassigned to the nearest remaining cluster.
  3. The remaining cluster means are recalculated as the mean of all data points within that cluster.

k-Grouping as a minimum record threshold

Blind Query & Blind Join

The k-value is also respected in the Blind Query and Blind Join operations as a minimum record threshold on the output; a query that would result in fewer than k records would automatically fail with a warning message.

Protecting SQL Queries

As a best practice, we encourage using a SQL 🔗HAVING clause to enact a purposeful k-Grouping safeguard within your positioned Database Asset or within the parameterized query in your Blind Report. For instance, the query in the example script (examples/Blind_Report/1_position_bigquery_report.py) is:

query_template = """
SELECT Dept_Name, {{demographic}}, AVG({{pay}}) as average_{{pay}} from tripleblind_datasets.city_of_somerville_payroll
GROUP BY Dept_Name, {{demographic}};
"""

This can be modified to respect a k-Grouping safeguard by introducing a clause to only return groups with more than a certain amount of records:

query_template = """
SELECT Dept_Name, {{demographic}}, AVG({{pay}}) as average_{{pay}} from tripleblind_datasets.city_of_somerville_payroll
GROUP BY Dept_Name, {{demographic}}
HAVING COUNT({{demographic}}) >= 5;
"""

With this clause, you ensure that each group contains at least 5 members, and the report is less likely to inadvertently provide information for a malicious actor to discern potentially personally-identifiable information from its contents (eg. returning the average salary of only a single individual).