Private Set Intersection (Blind Match)

Blind Match is TripleBlind’s implementation of a well-known secure multiparty computation known as a 🔗Private Set Intersection. Blind Match allows data from multiple sources to be privately joined based on a common identifier, returning only the values of the identifier that are found across all datasets. This allows a data provider to find the records in common with other data providers, without revealing any information apart from membership of the record to those sets.

Scenario

In this tutorial, we will perform a Blind Match between three datasets owned by three organizations: two banks and a brokerage firm. The objective is to find individuals (based on their Social Security Numbers) who present in all three datasets. You will play the role of the brokerage, who has a local dataset (i.e., a CSV file of customers data) and wishes to work with two organizations, named IniTech and Globex, as the two banks. All involved parties are TripleBlind users, and the two banks have already positioned their datasets and made them discoverable. Note that the name of the organizations in this example (e.g., IniTech) is different from the name of their assets (i.e., datasets). For instance, "Santander Customer Database" is the name of the dataset owned by Globex.

You have a local CSV file, broker-licenses.csv, containing the following information:

  • id: Your brokerage customer's internal account number
  • full_name: You brokerage customer's name
  • license number: The customer's driver's license number
  • ssn: The customer's Social Security Number (e.g. 123-45-6789)

You know your partner banks have datasets made accessible through TripleBlind's platform (i.e., the banks' corresponding Access Points), but you don't know the format yet. So the process begins with data discovery.

Data discovery

The first step in this task is to discover and access the datasets of interest. For this tutorial we will focus on TripleBlind's web interface to search for and obtain the datasets UUIDs. Alternatively, you could achieve the same goal using the SDK (refer to the Assets tutorial to learn more).

Visit the 🔗Asset Explorer to search for your partner's data assets.

You can type the text "SAN Customer Database" in the search box to find the first dataset. Click to further explore this dataset.

Within the data details, you can examine a mock data table which is representative of the actual data contained within the dataset. You can also use the Data Profile report to learn additional statistical details about the contained data. For example, you can find min, max and plots of distributions for individual fields. These tools give you a feel for the dataset even though you haven’t seen the actual data contained within it.

You will need the asset’s ID to reference this dataset in your code, which can be copied from the top of the page. Note: we use the SDK search API in this example to locate the datasets. You can experiment using the dataset ID in the code directly, e.g., dataset = tb.Asset('dataset_ID')

Repeat the same steps for the second dataset, i.e., "PNB Customer Database" dataset.

import tripleblind as tb
tb.util.set_script_dir_current()
Locating datasets via find

In addition to searching by name, we could also use a dataset ID obtained from the web interface.

We are also assuming the dataset is tabular and therefore use the specialized TableAsset class.

table1 = tb.TableAsset.find("Santander Customer Database")

table2 = tb.TableAsset.find("EXAMPLE - PNB Customer Database")

if not table1:
   print("Unable to find Globex's 'Santander Customer Database'")
if not table2:
   print("Unable to find Initech's 'EXAMPLE - PNB Customer Database'")

print(f"Available datasets: {table1.name}, {table2.name}")

After finding the two bank datasets, we will invoke the intersect function to compute a secure and private Blind Match between table1, table2, and the local dataset file broker-licenses.csv. It is important to specify the match_column parameter, which will be used to identify the intersect data records among all three datasets (remember that we have already confirmed that both banks' datasets include an "ssn" column).

The local file is positioned on your Access Point as a temporary dataset behind the scenes because calculations occur on datasets positioned on Access Points to preserve the privacy and security of the involved data. This temporary asset will disappear after the Python program executes.

The intersect() function returns an Asset object that you can manage using the same commands we covered earlier. Specifically, you can view its contents (because you own it) and/or download the asset as a CSV file containing the intersection results (i.e., the ssn numbers that exist in all three datasets).

ℹ️ Remember that the operation will not start until both banks accept your request to use their datasets. Alternatively, you could create an agreement with them to allow the usage of their datasets without having to approve every single usage individually (refer to the Assets tutorial for more information on Permissions and Agreements.

For the sake of this example, you can log-in as admin of these two banks via the web interface, 🔗Access Requests and accept the jobs. You need to accept the requests from both banks.

overlap = table1.intersect(
   intersect_with=[table2, tb.util.script_dir() / "broker-licenses.csv"],
   match_column="sal",
)

if overlap:
   print(f"Overlap Asset ID: {overlap.uuid}")
   print("Overlapped SAL values:")
   print(overlap.dataframe)

#### rename the new generated asset file name
overlap.filename = "PSI_results.zip"

#### download the asset locally
overlap.download(overwrite=True)