TripleBlind User Guide

Asset User Operations

Asset Users can use assets (datasets and algorithms) positioned by other organizations at the other organization’s discretion.

Exploring Available Assets

You can use TripleBlind's Explore Assets to browse existing assets—both data and algorithms—which were made discoverable by their owners.

ℹ️Remember, discoverable assets do not allow you to view the actual data—only to learn the existence of the data or algorithms that you can access to train algorithms or run inferences. Before accessing other organizations' assets, you first need to obtain the necessary permissions from those organizations.

Datasets

You can search for dataset assets in the web interface or by using the SDK.

Web Interface

In the web interface, you can search for a dataset and select it to find out more information about the dataset. You can search for datasets by keyword or by Asset ID. Searching by keywords can allow you to find datasets that may be useful to your organization's needs. Searching by Asset ID is useful when you know the precise dataset you are looking for.

If you’d like to easily access an asset later, you can use the bookmark icon in the top right corner of the asset card to save it to My Bookmarks. You can save it as a Favorite and easily access it from My Bookmarks. You can use Folders to further organize your bookmarks. Assets may be placed in both the Favorites table and within bookmark folders, or within one location and not another.

For tabular, file-based Dataset Assets, the Overview tab displays a 10-record set of representative data for each dataset. It is important to remember that this representative data is just that; it does not contain raw records. This allows Asset Users to visualize the shape of the data without risking privacy.

Fields containing sensitive data are “masked” with random characters or replaced with synthetic values. By default, all Assets are positioned with all fields masked to ensure complete privacy-protection. The Asset Owner has the ability to selectively unmask fields considered to be safe for display, in which case the 10-record set of the unmasked field contains a random sample of values from the underlying dataset.

The Data Profile tab for a tabular dataset contains an Exploratory Data Analysis (EDA) report, which provides summaries and statistics about the dataset without revealing the actual contents. Data scientists can use this information to determine the utility of a dataset for their purposes.

Through the Q&A tab, you can submit questions to the Asset owner. The Asset owner can respond to the question by populating the Q&A tab with answers.

SDK

Using the SDK, you can search for a dataset using its Unique Asset ID or name. To search for any dataset you can use tb.Asset.find(). To search specifically for tabular datasets, you can use tb.TableAsset.find(). Each of these methods returns a single dataset, if one exists. To use a more general search term, you can use tb.Asset.find_all() which will return a list of assets that match the term. To make either search dataset specific, use the dataset=True parameter. Here are examples:

dataset_PNB = tb.TableAsset.find("PNB Customer Database")

dataset_PNB = tb.Asset.find("daedc7db-8e7f-4ed7-92fb-a7fa8b26fae1")

datasets_bank = tb.Asset.find_all("Bank")

Algorithms

You can search for algorithm assets either in the web interface or using the SDK.

Web Interface

In the web interface, you can search for an algorithm and select it to find more information about it. You can search for algorithms by keyword or by Asset ID.

The Profile tab for an algorithm contains information that varies by model type. Content may include the model type, input/output structure, and a summary of training runs.

SDK

Using the SDK, you can search for an algorithm using its Unique Asset ID or name via tb.Asset.find(). This method returns a single algorithm, if one exists. To use a more general search term, you can use tb.Asset.find_all() which will return a list of assets that match the searched for term. To make either search algorithm specific, use the algorithm=True parameter. Here is an example:

# Find the trained model
asset_id = tb.util.load_from("asset_id.out")

model = tb.Asset.find(asset_id, owned_by=org_id, algorithm = True)
models = tb.Asset.find_all("regression", algorithm = True)

More details about how to use these methods can be found in the TripleBlind SDK reference document.

Performing Analysis with 3rd-Party Datasets

TripleBlind offers a number of operations that allow you to safely work with 3rd-party datasets.

Blind Match

Blind Match is TripleBlind’s implementation of a well-known secure multiparty computation known as a 🔗Private Set Intersection.

You might use Blind Match to determine:

  • whether your patient has health records in another hospital’s EHR system
  • how large is the overlap for customers of two banks

This operation requires multiple datasets and a shared identifier such as name or social security number. The result is a list of the overlapping identifiers—no data is shared that is not already within the Asset User’s own dataset.

Blind Report

Blind Report allows you to position a database-backed query with predefined configurable parameters. Users can configure the query using these predefined options, have it run against your database, and receive a report table.

This is a powerful operation that allows the data steward to permit only specific, controlled access to the data they desire exposed to the consumer. For example, a report could be defined that allows a user to select a year and month along with a company division for generating salary statistics by ethnicity or gender for usage in compliance reporting.

Any number of variables and any complexity of queries are supported. See the examples/Blind_Report for documentation and more information.

Blind Join

Blind Join builds on Blind Match but provides powerful additional functions:

  • Identify the subset of data within a counterparty’s (or within multiple counterparties’) dataset(s) that match to features within your own dataset, and bring in additional feature information for that subset of the third-party data.
  • Perform “fuzzy matching” on identifiers that may differ slightly (eg. name or address).

Blind Join is a Safe with Care operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide), and has the potential for misuse. TripleBlind has a number of safeguards for its use:

  • Blind Join is not permitted to return any columns the Asset Owner has masked; the assumption being that the underlying values in those columns contain PII/PHI or otherwise sensitive information.
  • Blind Join is disabled by default at our strictest security levels.
  • Unless an Agreement has been established permitting auto-approval of requests, all Blind Join operations require an informed Asset Owner approval through an Access Request.
  • k-Grouping is respected in the Blind Join operation as a minimum record threshold on the output; a join that would result in fewer than k records would automatically fail with a warning message.
Blind Query

Blind Query is a powerful operation that allows you to query a report from a 3rd-party Asset Owner.

You might use a Blind Query to:

  • Access a prepared report that changes on a monthly or quarterly basis.
  • Determine whether the Asset contains data relevant to your study or investigation.

Blind Query is a Safe with Care operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide), and has the potential for misuse. TripleBlind has a number of safeguards for its use:

  • Blind Query is not permitted to return any columns the Asset Owner has masked, the assumption being that the underlying values in those columns contain PII/PHI or otherwise sensitive information.
  • Blind Query is disabled by default at our strictest security levels.
  • Unless an Agreement has been established permitting auto-approval of requests, all Blind Query operations require an informed Asset Owner approval through an Access Request. The Access Request for Blind Query contains information on any SQL statements that are invoked in the operation.
  • k-Grouping is respected in the Blind Query operation as a minimum record threshold on the output; a query that would result in fewer than k records would automatically fail with a warning message.
Blind Stats

Blind Stats is a powerful privacy-preserving operation that allows a dataset user to understand a study population across multiple datasets, even when the data is in different organizations or regions.

You might use Blind Stats to:

  • Find the mean value of a particular serum biomarker for patients diagnosed with a rare cancer type across multiple hospitals.
  • Understand whether a collaborating data provider has data that is relevant for your study on the spending habits of outdoor hobbyists.

Blind Stats is a Safe operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide), and has the potential for misuse. TripleBlind has a number of safeguards for its use:

  • Unless an Agreement has been established permitting auto-approval of requests, all Blind Stats operations require an informed Asset Owner approval through an Access Request. The Access Request for Blind Stats contains information on all requested statistics.
  • Requests are automatically rejected for Blind Stats operations when they would return descriptive information on groups of records that do not meet the minimum k-Grouping limits set on the involved datasets.

Developing Machine Learning Models

One of the most basic operations in machine learning is supervised learning, where a model learns from labeled data so it can later be used to predict results when it encounters similar data. The predictions can be classifications which produce a discrete category, or a regression where the model prediction is of a continuous response.

Examples of classification problems are:

  • A diagnosis or fraud detection, producing "true" or "false"
  • An image recognizer, producing "cat," "dog," "car" or "airplane"

Regression problems do things like:

  • Predict a price
  • Forecast production quantities

The technique for training a model is similar regardless of the desired output.

TripleBlind’s Blind Learning operation is one of our innovative algorithms, which enables training advanced models without compromising the privacy of the data or the intellectual property of the model. Using Blind Learning, you can easily train your model on distributed datasets without ever "seeing" the raw data, which guarantees privacy for the data owners.

As a deep learning practitioner using the TripleBlind toolset, you will have complete control over all phases of the model-training lifecycle, including:

  • Data collection and preparation
  • Model creation (architecture and hyperparameters)
  • Model validation
  • Inference

Data Collection and Preparation

TripleBlind gives you all the tools you need to discover and examine the characteristics of data from 3rd-party datasets without revealing any sensitive information. As noted, this can be done using the web interface and the SDK.

Model Creation

The SDK provides a NetworkBuilder which supports assembling a network using simple instructions for adding layers and configurations, similar to PyTorch. Refer to Neural Networks to learn more about the capabilities currently supported by TripleBlind. Once the model is assembled via the NetworkBuilder and built via the create_network method, the model can be trained.

Model Training and Validation

The SDK method tb.create_job() configures and creates the training process. In it, you can specify the datasets involved in the training, the data preprocessor(s), and the training hyperparameters. The training hyperparameters are similar to those used in PyTorch and include the loss function and optimization algorithm, among others. These model hyperparameters are passed in the params parameter of tb.create_job() and differ for each algorithm. As such, algorithm-specific parameter specifications can be found in the Operations section of the Documentation. You will additionally set other parameters required for our privacy-preserving algorithms.

After making a call to tb.create_job(), the process needs to be submitted for execution using tb.job.submit() method. After calling these, there are several other methods that should be called to ensure the successful completion of the process. The complete typical workflow for running a process can be found in the module tripleblind.job section of the SDK Reference. This workflow will automatically take care of the training across the distributed datasets and return the trained model (as an asset) once the process is complete.

Note: The names of the loss functions and optimization algorithms, along with their parameters, are derived from PyTorch.

After training a model, it can be downloaded and used outside of TripleBlind (i.e., locally), or used within TripleBlind to create secure inferences using FED(erated) or SMPC modes. See the Inference section below to learn about the difference between these modes.

#### Download trained model locally (this is a PyTorch model)
trained_network = job.result.asset  # result of a training job
local_filename = trained_network.download(save_as="local_MNIST.pth", overwrite=True)

#### load the model and use it as a PyTorch object
pack = tb.Package.load(local_filename)
my_model = pack.model()

Inference

The TripleBlind SDK supports three types of inference on trained models: local inference, FED(erated) inference, and SMPC inference.

Local Inference

After training, you can download the trained model and run an inference using local data. The operation and the data exist on the same machine. No privacy is guaranteed.

Here is an example of running a local inference using the downloaded model referenced above, against a single image.

input_img = image_loader("img/four.jpg")
prediction = my_model(input_img).max(1)
print(f"The predicted label is: {prediction[1].item()}")
Fed(erated) Inference

You can think of these operations as "bringing the program to the data." The Access Point executes the analysis in place against the remote data. For certain operations, intermediate data will be passed between Access Points, but no raw or un-aggregated data is ever shared. All data transmission uses industry standard SSL technologies, requiring TLS 1.2+ using SHA-256 or better cryptographic hashes, AES encryption, and 2048 bit RSA keys.

After the inference is created, the trained model is deleted from the data owner's environment. Here is an example of a FED(erated) inference.

SMPC Inference (aka Blind Inference)

SMPC-based inference fully protects the privacy of the model and the data being run through the model. This is the most secure mode of operation. SMPC works by splitting both the model and data so that neither is fully held by either the data or algorithm owner’s access points. The result is mathematically guaranteed privacy of both parties.

The only difference in syntax between the FED(erated) operation and SMPC is modification of the security parameter, as shown here.

trained_algo = tb.Asset(trained_asset_id)

job = tb.create_job(
   job_name="Santander Inference",
   operation=trained_algo,
   dataset=dataset,
   params={"security": "fed"},
)
if job.submit():
   job.wait_for_completion()

See the Tutorials and Examples contained within the SDK for specific examples of building machine learning models using TripleBlind.