Model Training (tabular data)
Supervised learning is a Machine Learning approach that uses large amounts of existing input data (e.g., X-Ray images) and their outputs (e.g., diagnosis) to produce a generalizable model that can make future predictions on similar but "unseen" data. The predictions can either be classifications that produce a discrete category (e.g., predicting a disease type) or a regression that produces a continuous value (e.g., predicting stock value).
In this tutorial we will focus on training a deep learning model using an advanced privacy-preserving learning technique called Blind Learning. Blind Learning is one of our innovative algorithms which enables training advanced models without compromising the privacy of the data or the intellectual property of the model. Using Blind Learning, you can easily train your model on distributed datasets without ever "seeing" the raw data, which guarantees privacy for the data owners. To learn more about Blind Learning, visit 🔗tripleblind.com.
As a deep learning practitioner, you still have complete control over all phases of the model-training lifecycle, including:
- Data collection and preparation
- Model creation (architecture and hyperparameters)
- Model validation
- Inference
Scenario
In this tutorial, you will train a binary classification neural network on bank customer data to predict whether a customer will incur an overdraft in the next month or not. Unlike a typical training task which often takes place offline using a single dataset within one organization, this tutorial will carry out a distributed, collaborative training which uses three datasets from three different banks.
You will play the role of one bank with a local dataset which wishes to train an overdraft prediction model. You've realized you need a larger amount of similar data to train a robust, generalizable model. Using TripleBlind's Data Explorer you've identified two other banks -- JPM and PNB -- which use the same banking system and thus have the same type of data.
Working on the same type and structure of data coming from different datasets is known as training on horizontally-partitioned data, and our innovative algorithm that achieves this task is called Horizontal Blind Learning.
Each bank organizes customer data in a similar approach: a CSV file containing customer history and transaction information in 200 columns, with each row representing a customer. This is also known as tabular data. We will assume all three banks store the same information for customers, including the information on overdraft or not. This overdraft field will be our target value. To train this way each bank must have the same fields, but can have a different number of customers (rows) in their dataset.
Data discovery
TripleBlind gives you all the tools you need to discover and examine the characteristics of data from the other banks without revealing any sensitive information about the bank or their customers.
For this tutorial we will use two datasets TripleBlind has already loaded on the platform. They come from two fictional banks, "JPM" and "PNB", so that you can practice this tutorial.
- Begin by browsing the 🔗Assets for datasets.
- Enter "JPM" in the search box and locate the "JPM Customer Database".
- Enter the Detail view to explore the Data Profile report. The Variables section lets you examine the distribution of each of the data columns so you can ensure that you have the data diversity you need for each feature
- The same can be done with the "PNB Customer Database" dataset, discovering and performing Exploratory Data Analysis (EDA).
As discussed in the Assets tutorial, you can either use the UUID of the datasets (found in the detail view) or search by name within your code, which we will do here.
import sys
import warnings
import torch
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
import tripleblind as tb
tb.util.set_script_dir_current()
#### your dataset exists locally at the same location as this notebook, so
#### we only need the file path, our loaders will take care of the rest
#### Retrieve a dataset to use in this example from TripleBlind
dataset = tb.util.download_tripleblind_resource(
"Customer_transactions_new.csv",
save_to_dir="data",
cache_dir="../../../.cache"
)
#### search for both datasets by name
dataset_PNB = tb.TableAsset.find("TEST - PNB Customer Database")
if dataset_PNB:
print(f"Found dataset: {dataset_PNB.name}, ID: {dataset_PNB.uuid}")
else:
print(f"The requested dataset does not exist")
dataset_JPM = tb.TableAsset.find("TEST - JPM Customer Database")
if dataset_JPM:
print(f"Found dataset: {dataset_JPM.name}, ID: {dataset_JPM.uuid}")
else:
print(f"The requested dataset does not exist")
We can perform some data analysis on our own datasets. We will use pandas
for exploration purposes; the dataset has been divided into two parts: train and test. The test data will be downloaded and used later to validate the trained model locally.
df = pd.read_csv(dataset)
print(df.head())
Alternatively, you can discover properties of the remote datasets by visiting the details page for the datasets in the web interface. A Data Profile report contains detailed statistics about the dataset in aggregate, as shown below.