AI/Machine Learning

The following use cases will help you understand how to use TripleBlind to conduct privacy-preserving modeling and inferencing on 3rd-party data.

AI/ML Use Case #1: Model Training


Using APIs, train an AI/ML model based on datasets of virtually any type. Personas represented in this use case are: Data Scientist (User), Dataset Owner.

Workflow

The following workflow is used to train models using TripleBlind.

  1. Initialize a TripleBlind session
  2. Register new assets or locate existing assets
  3. Explore assets
  4. Perform preprocessing and tune model parameters
  5. Train the model and get results of the training run

Steps

To execute this use case follow these steps in your Python IDE:

1. The User authenticates with Router and starts a Session.

import tripleblind as tb

tb.initialize(api_token=user1_token)

ℹ️ The call to tb.initialize is unnecessary if the User token is set up in the User’s tripleblind.yaml file.

2. The Owner and/or User registers datasets as new assets, or the User searches for existing assets and selects them. The first code snippet is an example of registering a new dataset asset. The second is an example of searching for an existing asset.

asset0 = tb.Asset.position(
   file_handle="/Users/john/data_munge_sql_a.csv",
   name="Data Munge Table A-001",
   desc="Example dataset containing patient information in imperial units.",
   is_discoverable=True,
)

asset0 = tb.TableAsset.find("Data Munge Table A")

3. Optionally, the User explores an EDA profile and synthetic data view of registered Assets.

Alternatively, the Owner can grant access for the Blind Sample operation so that the User can get a realistic privacy-preserving sample similar to the real data.

Owner

asset0.add_agreement(
       with_org=2,
       operation=tb.Operation.BLIND_SAMPLE,
)

User

table = tb.TableAsset.find("Data Munge Table A")
df = table.get_sample()
print(df)

            Patient_Id  Age  Height_IN  Weight_LBS
0  3838753949679968321   58         74         134
1  1648887823711656506   37         61         212
2  7552046757277691320   66         67         246
3  9125359464938872180   34         69         216
4  5348069512603498341   82         72         198
5  1318251060642557776   59         60         166
6  2306922378047909737   19         63         183
7  1705011269891820451   82         68         188
8  2576874739707490790   53         76         187
9  4233275952277371155   80         70         135

4. The Owner adds an Agreement for a training operation such as Regression or Blind Learning to their Asset.

asset0.add_agreement(
       with_org=2,
       operation=tb.Operation.REGRESSION,
)

Alternatively, the Owner can authorize each process run by the User manually.

5. The User experiments with sample data until the optimal preprocessing steps and model parameters have been realized.

preprocess0 = tb.TabularPreprocessor.builder()
   .add_column("bmi", target=True)
   .all_columns(True)
   .sql_transform(
       "SELECT Patient_Id as pid, Height_IN as height, Weight_LBS as weight, 1 / (Height_IN * Height_IN) * Weight_LBS * 703 as bmi FROM data WHERE Age > 50"
   )
   .dtype("float32")

job = tb.create_job(
   job_name="Calculated BMI example",
   operation=tb.Operation.REGRESSION,
   dataset=[asset0, asset1],
   preprocessor=[preprocess0, preprocess1],
   params={
       "regression_algorithm": "Linear",
       "test_size": 0.1
 }
)

ℹ️ The model setup and training job parameters can vary widely. For example, for a PyTorch neural network, one or more NetworkBuilder objects with network layer splits and different training parameters will be used. A linear regression model may only require the train/test split size parameter as per the example above.

6. The User runs a training job and obtains results, including model file and reference ID.

if job.submit():
   job.wait_for_completion()
   # Download a local copy of the trained model
   model_file = job.result.asset.download("bmi_model.zip", overwrite=True)
   print("Trained Network Asset ID:", job.result.asset.uuid)

   # load the model to view results
   pack = tb.Package.load("bmi_model.zip")
   model = pack.model()

   print("\nCoefficients:")
   print(model.coef_)

AI/ML Use Case #2: Model Inference

Using APIs, make predictions using a registered model Asset. Personas represented in this use case are: Data Scientist (User), Model Owner.

Workflow

The following workflow is used to generate inferences against trained models using TripleBlind.

  1. Initialize a TripleBlind session
  2. Register a model asset or locate an existing model
  3. Preprocess input data
  4. Run predictions against the model and get results

Steps

To execute this use case follow these steps in your Python IDE:

1. The User authenticates with the Router and starts a Session.

import tripleblind as tb

tb.initialize(api_token=user1_token)

ℹ️ The call to tb.initialize is unnecessary if the User token is set up in the User’s tripleblind.yaml file.

2. The Owner adds an Agreement for their model to be executed by the User.

model = tb.Asset.find("3142c5db-3609-42d9-beb9-d3847b642fec")

model.add_agreement(with_org=2, operation=tb.Operation.EXECUTE)

Alternatively, the Owner can authorize each access request manually.

3. The User searches for an existing model and selects it.

model = tb.Asset.find("3142c5db-3609-42d9-beb9-d3847b642fec")

4. The User preprocesses input data to work with the model.

This step varies based on the model and datasets involved. Generally speaking, the prediction inputs should match the format of the training inputs. For example, if images were resized and cast as DICOM format during training, the same should be done for inferencing. If values for training were encoded, values for prediction should be encoded in a similar way.

A good habit to practice for model builders is to define some of these requirements in the metadata description and/or Q&A fields for their algorithm asset reference in the TripleBlind Router Index.

Example inference dataset preprocessing from the examples/CMAPSS_CNN example in the SDK:

X, y = reformat_data(data_x, data_y)  # a user-defined encoding method

ds = torch.utils.data.TensorDataset(X, y)
test_loader = torch.utils.data.DataLoader(ds, batch_size=128)
y_pred_list = []
y_true_list = []

with torch.no_grad():
   for X_batch, y_batch in test_loader:
       y_test_pred = model(X_batch)
       for i in y_test_pred:
           y_pred_list.append(i.numpy())
       for i in y_batch:
           y_true_list.append(i.item())

y_pred_list = [a.squeeze().tolist() for a in y_pred_list]
r2_metric = r2_score(y_true_list, y_pred_list)
print(f"R2 score(FD001_test): {r2_metric}")

5. The User runs an inference job and obtains results.

for file in files:
   job = tb.create_job(
       job_name="Model test",
       operation=model,
       params={"security": "aes"},  # or "smpc"
       dataset=f"/Users/john/{file}",
   )

if job.submit():
   job.wait_for_completion()
   filename = job.result.asset.download(save_as="result.zip")
   pack = tb.Package.load(filename)
   inference_predictions = pack.records()
print(f"Inference results: {inference_predictions}")