K-Means Clustering

TripleBlind supports privacy-preserving K-Means clustering over vertically-partitioned datasets and can incorporate a Private Set Intersection (Blind Match) among multiple data providers into the operation.

Operation

When using add_agreement() to forge an agreement on a trained model, use Operation.EXECUTE for the operation parameter.

When using add_agreement() to allow a counterparty to use your dataset for model training, or using create_job() to train K-Means clustering, use the appropriate operation parameter below

PSI Vertical KMeans Clustering

Use Operation.PSI_VERTICAL_KMEANS_TRAIN to identify an overlap of matching records across datasets, and train a k-Means clustering model for the vertically-partitioned intersection.

Training parameters

num_clusters: Optional[int] = 8

  • Number of clusters to be created

num_iter: Optional[int] = 10

  • Maximum number of times for the clustering algorithm to run during training

dataset: Union[Asset, List[Asset]]

  • List of positioned assets to conduct clustering across
  • The initiator of the training job must put the asset they own first in the list

psi: Dict{ "match_column": Union[str, List[str]] }

  • Name of the column to match. If not the same in all datasets, a list of the matching column names, one for each dataset in the order supplied above.
  • If a single fieldname is provided, each dataset must have the same name for that match_column, eg. “ID”.

normalization Optional[bool] = True

  • Determines whether or not TripleBlind will transform the columns of the data to zero mean unit variance.
  • This transformation is applied to the intersection of the datasets, not to the data before applying PSI.

share_cluster_size: Optional[bool] = False

  • Determines whether the cluster sizes are revealed at each training iteration so that the calculation is more accurate for extremely large cluster sizes.

ℹ️ Normalization has the benefit of ensuring that each feature of the data is given the same amount of weight in the clustering. For example, without normalization a value of 100 for home price would be considered as significant as the value 100 for a person’s height, even though $100 in a home price is insignificant but 100 inches of height is incredibly significant. If the data is already normalized, then leaving normalization on will have no noticeable impact on the performance or accuracy of the operation.

A potential reason to not normalize is to apply different weights to the importance of each column. This can be done by scaling each column so that its variance is proportional to the weight of the column and passing in normalization=False. If the dataset contains large values (like home prices), then the calculation could overflow, resulting in an error or nonsensical output values. To a large extent, this can be mitigated by scaling all the data down so that the largest column variance is equal to 1 as well as shifting the data so that the mean of each column is 0.

Training outputs

cluster_means: numpy.ndarray[float64]

  • The centers, or means, of the identified clusters

is_cluster_empty: numpy.ndarray[int64]

  • Indicates whether each cluster is actually empty (1) or not (0).
  • Sometimes training returns empty clusters based on how the data is arranged or to maintain the privacy of individual data points.
  • Empty clusters are ignored in inference, regardless of their proximity to the data.

zmuv_mean: numpy.ndarray[float64]

  • Used in inference to reapply the data normalization.
  • See 3a_local_inference.py for an example.
  • The means of the columns.

zmuv_linear: numpy.ndarray[float64]

  • Used in inference to reapply the data normalization.
  • See 3a_local_inference.py for an example.
  • The reciprocal of the standard deviations of the columns.

labels: pandas.DataFrame

  • The cluster that each training data point is assigned to.

inertia: numpy.float64

  • The sum of squared distances between each training data point and the nearest cluster mean.

Inference parameters

dataset: Union[Asset, List[Asset]]

  • List of positioned assets to conduct clustering across
  • The initiator of the inference must put the asset they own first in the list

psi: Dict{ "match_column": Union[str, List[str]] }

  • Name of the column to match. If not the same in all datasets, a list of the matching column names, one for each dataset above, in order.
  • If a single fieldname is provided, each dataset must have the same name for that match_column, eg. “ID”.

Inference outputs

labels: pandas.DataFrame

  • The cluster that each training data point is assigned to.

inertia: numpy.float64

  • The sum of squared distances between each inference data point and the nearest cluster mean.

k-Grouping

The outputs of training include the mean and standard deviation of each column as well as the means of each column, grouped by clusters.

The following adjustment happens at the end of training, using the highest k-Grouping value of the involved datasets:

  1. Clusters with fewer data points than k-Grouping are marked as empty.
  2. The data points of these clusters are reassigned to the nearest remaining cluster.
  3. The remaining cluster means are recalculated as the mean of all data points within that cluster.

Limitations

  • Currently supports clustering across 2 datasets only.
  • Currently supports datasets with shape (rows times columns) of 20 million records.
    • e.g. 400 columns * 50,000 rows = 20 million
  • NaN values are not supported, and should be imputed from datasets in preprocessing.