Model Training (tabular data)

Supervised learning is a Machine Learning approach that uses large amounts of existing input data (e.g., X-Ray images) and their outputs (e.g., diagnosis) to produce a generalizable model that can make future predictions on similar but "unseen" data. The predictions can either be classifications that produce a discrete category (e.g., predicting a disease type) or a regression that produces a continuous value (e.g., predicting stock value).

In this tutorial we will focus on training a deep learning model using an advanced privacy-preserving learning technique called Blind Learning. Blind Learning is one of our innovative algorithms which enables training advanced models without compromising the privacy of the data or the intellectual property of the model. Using Blind Learning, you can easily train your model on distributed datasets without ever "seeing" the raw data, which guarantees privacy for the data owners. To learn more about Blind Learning, visit 🔗tripleblind.com.

As a deep learning practitioner, you still have complete control over all phases of the model-training lifecycle, including:

  • Data collection and preparation
  • Model creation (architecture and hyperparameters)
  • Model validation
  • Inference

Scenario

In this tutorial, you will train a binary classification neural network on bank customer data to predict whether a customer will incur an overdraft in the next month or not. Unlike a typical training task which often takes place offline using a single dataset within one organization, this tutorial will carry out a distributed, collaborative training which uses three datasets from three different banks.

You will play the role of one bank with a local dataset which wishes to train an overdraft prediction model. You've realized you need a larger amount of similar data to train a robust, generalizable model. Using TripleBlind's Data Explorer you've identified two other banks -- JPM and PNB -- which use the same banking system and thus have the same type of data.

Working on the same type and structure of data coming from different datasets is known as training on horizontally-partitioned data, and our innovative algorithm that achieves this task is called Horizontal Blind Learning.

Each bank organizes customer data in a similar approach: a CSV file containing customer history and transaction information in 200 columns, with each row representing a customer. This is also known as tabular data. We will assume all three banks store the same information for customers, including the information on overdraft or not. This overdraft field will be our target value. To train this way each bank must have the same fields, but can have a different number of customers (rows) in their dataset.

Data discovery

TripleBlind gives you all the tools you need to discover and examine the characteristics of data from the other banks without revealing any sensitive information about the bank or their customers.

For this tutorial we will use two datasets TripleBlind has already loaded on the platform. They come from two fictional banks, "JPM" and "PNB", so that you can practice this tutorial.

  • Begin by browsing the 🔗Assets for datasets.
  • Enter "JPM" in the search box and locate the "JPM Customer Database".
  • Enter the Detail view to explore the Data Profile report. The Variables section lets you examine the distribution of each of the data columns so you can ensure that you have the data diversity you need for each feature
  • The same can be done with the "PNB Customer Database" dataset, discovering and performing Exploratory Data Analysis (EDA).

As discussed in the Assets tutorial, you can either use the UUID of the datasets (found in the detail view) or search by name within your code, which we will do here.

import sys
import warnings
import torch
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
import tripleblind as tb

tb.util.set_script_dir_current()

#### your dataset exists locally at the same location as this notebook, so
#### we only need the file path, our loaders will take care of the rest

#### Retrieve a dataset to use in this example from TripleBlind
dataset = tb.util.download_tripleblind_resource(
   "customer_transactions_new.csv",
   save_to_dir="data",
   cache_dir="../../../.cache"
   )

#### search for both datasets by name
dataset_PNB = tb.TableAsset.find("TEST - PNB Customer Database")
if dataset_PNB:
   print(f"Found dataset: {dataset_PNB.name}, ID: {dataset_PNB.uuid}")
else:
   print(f"The requested dataset does not exist")

dataset_JPM = tb.TableAsset.find("TEST - JPM Customer Database")
if dataset_JPM:
   print(f"Found dataset: {dataset_JPM.name}, ID: {dataset_JPM.uuid}")
else:
   print(f"The requested dataset does not exist")

We can perform some data analysis on our own datasets. We will use pandas for exploration purposes; the dataset has been divided into two parts: train and test. The test data will be downloaded and used later to validate the trained model locally.

df = pd.read_csv(dataset)
print(df.head())

Alternatively, you can discover properties of the remote datasets by visiting the details page for the datasets in the web interface. A Data Profile report contains detailed statistics about the dataset in aggregate, as shown below.

Data preprocessing

The SDK provides a wide range of functions for preprocessing datasets. Here we CsvPreprocessor to build our dataset from all the columns in the datasets (which we examined via the web interface in the previous step). These columns will be our input values. We will also instruct the preprocessor to specify which column contains the target values, named "target" in all three datasets. The preprocessor will automatically ensure that all columns are processed consistently across all three datasets.

We select all the columns for training except the "target" column named as the training target. The dtype is a numpy type (such as float32 or float64) or str for textual data.

preproc = tb.preprocessor.tabular.TabularPreprocessor.builder()
preproc.all_columns(True)  # select all columns (features) of the dataset
preproc.add_column("target", target=True) # except the "target" column
preproc.dtype("float32") # all use the same the datatype

Network definition

We will create a basic neural network architecture consisting of an input layer, four hidden layers, and an output layer. We also use dropout in one hidden layer for regularization. The SDK provides a NetworkBuilder which supports assembling a network using simple instructions for adding layers and configurations, similar to PyTorch.

This tutorial does not explain our model architecture choice; instead, it focuses on illustrating the use of the SDK in building powerful networks. Feel free to experiment with different architectures!

Notice in the network architecture below a special type of layer, builder.add_split(). The Split layer is specific to TripleBlind's Blind Learning algorithm. The Split layer enables training distributed deep learning models without having to share the datasets with the model creator. Instead, the model is split into different parts and distributed among data holders. The Split layer ensures that only model parameters from a single layer are exchanged during the training process -- preserving privacy for the involved datasets.

ℹī¸ Placing the Split layer too early in the network reduces the number of layers kept on the data-owner side while increasing the computational needs on the server-side. We recommend placing the Split layer somewhere in the middle of the network.

#### define the neural network architecture
builder = tb.NetworkBuilder()

builder.add_dense_layer(26, 120)
builder.add_relu()
builder.add_dense_layer(120, 160)
builder.add_relu()
builder.add_dropout(0.25)
builder.add_dense_layer(160, 200)
builder.add_relu()
builder.add_split()
builder.add_dense_layer(200, 160)
builder.add_relu()
builder.add_dense_layer(160, 10)
builder.add_relu()
builder.add_dense_layer(10, 1)

model_name = "Credit-Classifier"
training_model = tb.create_network(model_name, builder)

Training configuration

The SDK method tb.create_job() configures and creates the training job. Here you specify the loss function, optimization algorithm, and the rest of the hyperparameters typical in PyTorch. Additionally you will set other hyperparameters required for our privacy-preserving algorithms.

ℹī¸ the names of the loss functions and optimization algorithms, along with their parameters, are derived from PyTorch. You do not have to be familiar with PyTorch to use them, you only need to be aware of the functions' names, such as CrossEntropyLoss for cross entropy loss and Adam for the Adam optimizer, for example.

#### select a proper loss fucntion
#### we will use the "BCEWithLogitsLoss" loss function
loss_name = "BCEWithLogitsLoss"

#### select a proper optimizer and its parameters
optimizer_name = "Adam"
optimizer_params = {"lr": 0.001}

#### configure the hyperparameters

train_params = {
       "epochs": 2,
       "loss_meta": {"name": loss_name},
       "optimizer_meta": {"name": optimizer_name, "params": optimizer_params},
       "data_type": "table",
       "data_shape": [26],  # number of columns in the dataset
       "model_output": "binary",  # binary/multiclass/regression
   }

ℹī¸ Vist the documentation for the complete list of hyperparameters. For example, using "test_size": 0.2 will set aside 20% of each dataset for testing purposes, reported after each epoch.

Model training

After preparing the data and creating the training configuration, a training job could be created using the tb.create_job(). Specify the datasets involved in the training, the data processor, and the training hyperparameters. This function will automatically take care of the actual training across the distributed datasets and return the trained model (as an asset) once the job is complete.

ℹī¸ After running the following cell, you will notice that the operation is waiting for permission. As discussed in previous tutorials, your training will not start until the owners of the datasets grant access.

Alternatively, you can contact the data owners and establish an Agreement that will allow you to use their dataset without having to wait for approval each time. Refer to the Assets tutorial for more information on Agreements.

For this tutorial, TripleBlind has created Agreements that automatically grant permission to anyone running against these datasets.

job = tb.create_job(
   job_name=f"Training Tutorial",
   operation=training_model,
   dataset=[dataset_PNB, dataset_JPM, dataset],
   preprocessor=preproc,
   params=train_params,
)

if job.submit():
   job.wait_for_completion(wait_for_permission=True)
   if job.success:
       print()
       print("    =================================================")
       print(f"   Trained Network Asset ID:\n {job.result.asset.uuid}")
       print("    =================================================")
       print()
       trained_network = job.result.asset # retrieve the trained model
   else:
       print(f"Training failed. {job.result.raw_status}")
       sys.exit(1)

Download the trained model

#### Download trained model locally (this is a PyTorch model)

local_filename = trained_network.download(save_as="local.pth", overwrite=True)
print("Trained network has been downloaded as:")
print(f"   {local_filename}")

Now you can use the local copy of the model directly as a PyTorch object:

pack = tb.Package.load(local_filename)
my_model = pack.model()

#### Suppress the PyTorch "SourceChangeWarning"
warnings.filterwarnings("ignore")

print(my_model)

Local inference

In the following example, we will illustrate how to make predictions using the model we downloaded locally in the previous step. This process is called "inference". For simplicity in this tutorial, we are going to make predictions on part of the dataset we used for training.

# Download data used for testing
data_file = tb.util.download_tripleblind_resource(
   "test_small_demo.csv",
   save_to_dir="data",
   cache_dir="../../../.cache",
)

# Load and split test data into independent X (data) and y (target) dataframes
data_X = pd.read_csv(data_file)
data_y = data_X["target"].copy()
del data_X["target"]

X = data_X.values
X = X.astype(np.float32)
X = torch.from_numpy(X)

y = data_y.values.astype(np.int64)
y = np.expand_dims(y, axis=1)
y = torch.from_numpy(y).double()

#### create data loaders using PyTorch for your local batch inference
test_tensors = torch.utils.data.TensorDataset(X, y)
test_loader = torch.utils.data.DataLoader(test_tensors, batch_size=128, shuffle=True)

print(f"Total number of batches: {len(test_loader)}")

y_pred_list = []
y_true_list = []

my_model.eval()
with torch.no_grad():
   for X_batch, y_batch in test_loader:
       X_batch = X_batch
       y_test_pred = my_model(X_batch)
       y_test_pred = torch.sigmoid(y_test_pred)
       y_pred_tag = torch.round(y_test_pred)
       for i in y_pred_tag:
           y_pred_list.append(i.numpy())
       for i in y_batch:
           y_true_list.append(i.item())

y_pred_list = [a.squeeze().tolist() for a in y_pred_list]
df = pd.DataFrame(y_pred_list)
df.to_csv("tabular_local_predictions.csv", header=None, index=None)
print(classification_report(y_true_list, y_pred_list))

You might have noticed from the results above, and based on your selection of the hyperparameters, that while the accuracy of this model is relatively acceptable, the actual model performance is not promising. This is not related to our underlying training protocol; rather, this is a data imbalance problem. We encourage you to experiment with different architectures and configurations to generate a better model.

ℹī¸ One of the most straightforward solutions to data imbalance is to use the pos_weight parameter when using the BCEWithLogitsLoss loss function. This parameter will give a larger weight "penalty" to the wrongly classified positive classes.

    pos_weight = tb.TorchEncoder.encode(torch.arange(17, 18, dtype=torch.int32))

For more information, visit the PyTorch 🔗BCEWithLogitsLoss documentation.

Behind the Scenes

The underlying model-training algorithm used in this tutorial is called Horizontal Blind Learning. "Horizontal" refers to the fact that the distributed datasets are of the same type and include the same columns (in the case of tabular data). In contrast, "Vertical" refers to datasets of different features (e.g., columns) and possibly different data types across the organizations. See the figure below on Vertical versus Horizontal datasets.