KMeans Clustering
TripleBlind supports privacypreserving KMeans clustering over verticallypartitioned datasets and can incorporate a Private Set Intersection (Blind Match) among multiple data providers into the operation.
Operation
When using add_agreement()
to forge an agreement on a trained model, use Operation.EXECUTE
for the operation
parameter.
When using add_agreement()
to allow a counterparty to use your dataset for model training, or using create_job()
to train KMeans clustering, use the appropriate operation
parameter below
PSI Vertical KMeans Clustering
Use Operation.PSI_VERTICAL_KMEANS_TRAIN
to identify an overlap of matching records across datasets, and train a kMeans clustering model for the verticallypartitioned intersection.
Training parameters
num_clusters: Optional[int] = 8
 Number of clusters to be created
num_iter: Optional[int] = 10
 Maximum number of times for the clustering algorithm to run during training
dataset: Union[Asset, List[Asset]]
 List of positioned assets to conduct clustering across
 The initiator of the training job must put the asset they own first in the list
psi: Dict{ "match_column": Union[str, List[str]] }
 Name of the column to match. If not the same in all datasets, a list of the matching column names, one for each
dataset
in the order supplied above.  If a single fieldname is provided, each dataset must have the same name for that
match_column
, eg. “ID”.
normalization Optional[bool] = True
 Determines whether or not TripleBlind will transform the columns of the data to zero mean unit variance.
 This transformation is applied to the intersection of the datasets, not to the data before applying PSI.
share_cluster_size: Optional[bool] = False
 Determines whether the cluster sizes are revealed at each training iteration so that the calculation is more accurate for extremely large cluster sizes.
ℹ️ Normalization has the benefit of ensuring that each feature of the data is given the same amount of weight in the clustering. For example, without normalization a value of 100 for home price would be considered as significant as the value 100 for a person’s height, even though $100 in a home price is insignificant but 100 inches of height is incredibly significant. If the data is already normalized, then leaving normalization on will have no noticeable impact on the performance or accuracy of the operation.
A potential reason to not normalize is to apply different weights to the importance of each column. This can be done by scaling each column so that its variance is proportional to the weight of the column and passing in normalization=False
. If the dataset contains large values (like home prices), then the calculation could overflow, resulting in an error or nonsensical output values. To a large extent, this can be mitigated by scaling all the data down so that the largest column variance is equal to 1 as well as shifting the data so that the mean of each column is 0.
Training outputs
cluster_means: numpy.ndarray[float64]
 The centers, or means, of the identified clusters
is_cluster_empty: numpy.ndarray[int64]
 Indicates whether each cluster is actually empty (1) or not (0).
 Sometimes training returns empty clusters based on how the data is arranged or to maintain the privacy of individual data points.
 Empty clusters are ignored in inference, regardless of their proximity to the data.
zmuv_mean: numpy.ndarray[float64]
 Used in inference to reapply the data normalization.
 See
3a_local_inference.py
for an example.  The means of the columns.
zmuv_linear: numpy.ndarray[float64]
 Used in inference to reapply the data normalization.
 See
3a_local_inference.py
for an example.  The reciprocal of the standard deviations of the columns.
labels: pandas.DataFrame
 The cluster that each training data point is assigned to.
inertia: numpy.float64
 The sum of squared distances between each training data point and the nearest cluster mean.
Inference parameters
dataset: Union[Asset, List[Asset]]
 List of positioned assets to conduct clustering across
 The initiator of the inference must put the asset they own first in the list
psi: Dict{ "match_column": Union[str, List[str]] }
 Name of the column to match. If not the same in all datasets, a list of the matching column names, one for each
dataset
above, in order.  If a single fieldname is provided, each dataset must have the same name for that
match_column
, eg. “ID”.
Inference outputs
labels: pandas.DataFrame
 The cluster that each training data point is assigned to.
inertia: numpy.float64
 The sum of squared distances between each inference data point and the nearest cluster mean.
kGrouping
The outputs of training include the mean and standard deviation of each column as well as the means of each column, grouped by clusters.
The following adjustment happens at the end of training, using the highest kGrouping value of the involved datasets:
 Clusters with fewer data points than kGrouping are marked as empty.
 The data points of these clusters are reassigned to the nearest remaining cluster.
 The remaining cluster means are recalculated as the mean of all data points within that cluster.
Limitations
 Currently supports clustering across 2 datasets only.
 Currently supports datasets with shape (rows times columns) of 20 million records.

 e.g. 400 columns * 50,000 rows = 20 million
NaN
values are not supported, and should be imputed from datasets in preprocessing.