Database Assets

In the previous tutorial you learned about positioning static data on your Access Point. This tutorial will cover a different type of dataset -- a Database Asset.

Database Assets contain tabular data, so they consist of columns (or fields) and rows (or records). A single record will contain all of the fields, even if it is an empty value.

For a CSVDataset the set of records is constant once it has been placed on your Access Point. For a Database Asset, however, it is dynamic. The fields are always the same, but the number and content of records comes from the database at the time the algorithm is executed. The actual data is not stored on the Access Point, only a set of instructions for connecting to the database and for extracting a data view.

Connection strings

A connection string is much like a URL to reach a web page. The string specifies the type of database, login information and the specific portion of the database you want to access.

The exact format of the connection string varies for different database vendors. Typically it is in the form of dialect+driver://username:password@host:port/database. We currently support self-hosted database instances of PostgreSQL, MySQL, SQLite, and Microsoft SQL Server.

We also support many data warehouse services, including Amazon Redshift, Databricks, Google BigQuery, Microsoft Azure Data Lake, and Snowflake. We provide helper classes to assist in creating the connection to these services.

ℹ️You may be concerned about the security of the connection string, as most contain sensitive data like a database user and password. Rest assured that this sensitive data remains firmly in your control at all times. The Database Asset is registered with the TripleBlind Router, but the actual asset is stored on your Access Point. When you define the asset, the SDK communication that creates it also occurs directly between your workstation and your Access Point. Nothing sensitive is ever sent to a server controlled by TripleBlind or by a partner which uses your asset.

Database view

Once you have provided the connection to the database, you have to describe what data you want included in this database asset. This is done by creating a view on the data using an SQL query. Only the result of the query is made available to TripleBlind users.

Snowflake Asset

As an example, we will look at defining an asset for information housed in a 🔗Snowflake cloud data warehouse.

First, connect to your Snowflake dashboard in your browser:

Here we see the database TEST_DATABASE has a table called "PEOPLE" which we want to make available on the TripleBlind platform for controlled access.

ℹ️You might be wondering why you can't just provide direct access as a shared Snowflake database. When you share a database you have control over who gets to your data, but you have no idea what they do with it. TripleBlind can completely lock the usage to only the exact algorithm you approve, which you also know guarantees the data is never seen in the clear. You will also have a full audit of exactly when your data was accessed.

Positioning Database Assets with the SDK

Creating the asset

First, pull in the TripleBlind SDK library.

import tripleblind as tb
tb.util.set_script_dir_current()

Now you need the information to establish your connection with the database shown above. You'll need your username, password, and account. For Snowflake, the account is the same as the start of the web interface URL, before "snowflakecomputing.com/".

To avoid storing these sensitive details in a script, they are often stored in environment variables. Below we show how to access those in the commented-out code. This example shows the values stored as simple strings.

Obviously, these credentials won't actually work. If you have your own Snowflake account and want to experiment in this notebook, feel free to put yours in here.

user = "steve"                       # os.environ["SNOWFLAKE_USERNAME"]
password = "password123"             # os.environ["SNOWFLAKE_PASSWORD"]
account = "yj91180.us-central1.gcp"  # os.environ["SNOWFLAKE_ACCOUNT"]
database = "TEST_DATABASE"
description = "test database"

The final thing you need to do is create the view you want to expose on the database. This is done with an SQL query. The simplest approach is to include all of the values in a particular table:

try:
   tb.asset.SnowflakeDatabase.create(
       user, password, account, database,
       query="SELECT * FROM people;",
       name="Snowflake example dataset",
       desc = description
   )
except tb.TripleblindAssetAlreadyExists:
   print("Asset already exists")

Since the query is SQL, you have full control of how much you want to expose in this database asset. For instance, you could limit it to only certain fields or portions of the table using a WHERE clause.

k-Grouping

Operations that return data (eg. Blind Query, Blind Join, & Blind Stats) have embedded k-Grouping safeguards that reduce the risk of leaking data when there are less than a specific threshold of records comprising a group or the total output of an operation.

ℹ️As a best practice, we encourage using a SQL 🔗HAVING clause to enact a purposeful k-Grouping safeguard within the query in your database asset. For instance, the query in the example script (examples/Private_Query/1_owner_create_query.py) is:

report = """
SELECT Ethnicity, count(*) as Samples, avg(Total_Gross) as Average_Gross,
FROM `your-gcpproject.your_datasets.payroll`
GROUP BY Ethnicity HAVING Samples >= 5
ORDER BY Ethnicity
"""

With this clause, you ensure that each group contains at least 5 members, and the report is less likely to inadvertently provide information for a malicious actor to discern potentially personally-identifiable information from its contents (eg. returning the average salary of only a single individual).

Using the asset

This asset is used just like a CSVDataset -- in fact it is indistinguishable to the process. A preprocessor can be used to eliminate fields you aren't interested in for the particular operation.

Another example would be a Human Resources database that contains hundreds of fields for each employee. However, when you want to train a machine learning model to predict employee churn, you might only care about the fields "hire-date", "birth-date", "gender", "zipcode", "salary", and "title". The preprocessor would name these fields, and all of the others will be omitted when training the model.

Back to our example, here are a few simple queries we can run using this asset:

table = tb.TableAsset.find_all("Snowflake example dataset", dataset=True)

# Generate a simple headcount
result = table.search(
   "gender",
   pattern=["m", "f"],
   match_case=False,
   silent=True,
)

# Generate a quick analysis of employee types
result = table.search(
   "position",
   pattern=["director", "vp", "manager", "clerk"],
   match_case=False,
   silent=True,
)

Using Database Assets

You can use TripleBlind's Data Explorer to browse for Database Assets and use them just like any other. Browse, see properties, EDA reports, create Agreements and grant permission for others to use this Asset as easily as a static file.

Other Databases

Many popular databases can be accessed this way. Here are some examples:

Amazon Redshift

Redshift is a data warehouse that has a specific method for asset creation:

tb.asset.RedshiftDatabase.create(
   host = "default.528655681674.us-east-2.redshift-serverless.amazonaws.com",
   port = 5439,
   database = "RedshiftDB",
   username = "akos",
   password = "my_pass",
   query = "SELECT * FROM public."demo_pii" where n1 > .21;",
   name = "Demo DB",
   desc = "Redshift Database using credentials",
   is_discoverable = True,
)

The Redshift connector supports additional connection 🔗options such as IAM:

tb.asset.RedshiftDatabase.create(
   host = "default.528655681674.us-east-2.redshift-serverless.amazonaws.com",
   port = 5439,
   database = "RedshiftDB",
   options = {
      "iam" : True,
      "access_key_id" : "XXXXXXXXXXXXXXXXXXXXXXXXX",
      "secret_access_key" : "XXXXXXXXXXXXXXXXXXXXXXXXX",
   },
   query = "SELECT * FROM public."demo_pii" where n1 > .21;",
   name = "Demo DB",
   desc = "Redshift Database using IAM",
   is_discoverable = True,
)

Amazon S3

tb.asset.S3Dataset.create(
   bucket_name = "9bec276a-68c7-1234-aa91-ff93c418e600",
   region = "us-east-1",
   object_name = "test_data.csv",
   aws_access_key_id = "XXXXXXXXXXXXXXXXXXXXXXXXX",
   aws_secret_access_key = "XXXXXXXXXXXXXXXXXXXXXXXXX",
   name = "Demo Asset",
   desc = "CSV dataset in an S3 bucket",
   is_discoverable = False,
)

Databricks

tb.asset.DatabricksDatabase.create(
   access_token=ACCESS_TOKEN,
   server_hostname=SERVER,
   http_path=HTTP_PATH,
   catalog=CATALOG,
   schema=SCHEMA,
   query=SQL_VIEW,
   name="Databricks Demo DB",
   desc="Databricks Database Asset",
   is_discoverable=False,
)

Google BigQuery

The BigQuery authentication mechanism is slightly different, using the BigQueryDatabase class. You must provide a GCP project id, which will be billed for any data transfer fee associated with accessing this dataset. The specifics of costs are between you and GCP, but typically something like the first TB per month of data transfer is free.

Access credentials come from Google, see 🔗JSON keyfile authentication documentation for details.

import os

credentials = os.environ("GOOGLE_APPLICATION_CREDENTIALS")
gcp_project_id = "project-unique-202115"
bigquery_dataset = "iris"

tb.asset.BigQueryDatabase.create(
   gcp_project_id, bigquery_dataset, credentials,
   query="SELECT * FROM people;",
   name="BigQuery example dataset"
)

Microsoft Azure Blob Storage

tb.asset.AzureBlobStorageDataset.create(
   storage_account_name = "azure_demo_account",
   storage_key = "Your+SecretKey4ADSL+CSVDataNdBaot72ymJa8JBL1B/q9daL+AStOCNjBA==",
   file_system = "demo",
   key = "iris.csv",     # file system name defined in Azure control panel
   name = "Demo DB",
   desc = "CSV dataset in an Azure Blob",
   is_discoverable = True,
)

Microsoft Azure Data Lake

This connector supports Azure Data Lake Storage built on top of Azure Blob Storage.

tb.asset.AzureDataLakeStorageDataset.create(
   storage_account_name = "azure_demo",
   storage_key = "Your+SecretKey4ADSL+CSVDataNdBaot72ymJa8JBL1B/q9daL+AStOCNjBA==",
   file_system = "demo",
   path = "/datasets/csv/iris.csv",
   name = "Demo DB",
   desc = "CSV dataset in an Azure Data Lake",
   is_discoverable = True,
)

Microsoft SQL Server

# Via pyodbc
connection_str = "mssql+pyodbc://user:pass@mydsn"
# or pymssql
connection_str = "mssql+pymssql://user:pass@hostname:port/dbname"

tb.asset.DatabaseDataset.create(
   connection_str,
   query="SELECT * FROM tablename;",
   name="Oracle example dataset"
)

MongoDB

TripleBlind supports multiple MongoDB connection URI formats. See the MongoDB example in examples/Data_Connectors for more information.

tb.asset.MongoDatabase.create(
   connection_str = f"mongodb://{user}:{password}@{mongo_host}:{mongo_port}/",
   query = {},
   projection = {
      "gpa" : 1,
      "age" : 1,
      "admitted" : 1,
   },
   database = "student_records",
   collection = "college_acceptance",
   name = "Demo College Acceptance",
   desc = "College acceptance dataset stored in MongoDB",
   is_discoverable = False,
)

MySQL

connection_str = "mysql://user:pass@host:5432/database"

tb.asset.DatabaseDataset.create(
   connection_str,
   query="SELECT * FROM tablename;",
   name="MySQL example dataset"
)

Oracle

connection_str = "oracle+cx_oracle://user:pass@tnsname"
# or
connection_str = "oracle://user:pass@host:port/sidname"

tb.asset.DatabaseDataset.create(
   connection_str,
   query="SELECT * FROM tablename;",
   name="Oracle example dataset"
)

Postgres

You can also reference a Postgres database accessible to the Access Point. Simply build the connection string and create a generic DatabaseDataset:

connection_str = "postgresql://user:pass@host:5432/database"

tb.asset.DatabaseDataset.create(
   connection_str,
   query="SELECT * FROM tablename;",
   name="Postgres example dataset"
)

SQLLite

# Use four slashes before the path
connection_str = "sqlite:////absolute/path/to/foo.db"

tb.asset.DatabaseDataset.create(
   connection_str,
   query="SELECT * FROM tablename;",
   name="SQLLite example dataset"
)

Support for other databases is possible. If there are data sources you would like to work with natively, please let us know by contacting your Customer Success Manager or submitting a request to 🔗Customer Support.

Positioning Database Assets using the TripleBlind Web Interface

Database assets can also be positioned by making use of an easy-to-follow wizard in the TripleBlind web interface. To begin this process, first log in to the web interface using your TripleBlind credentials. Then, navigate to the Assets page and click the New Asset button. You will be presented with a menu of database and file type options to start with.

In this example, we will configure an Amazon Redshift asset. To launch the wizard, click the Redshift icon.

Enter in your hostname, database name, username and password. If using AWS IAM to connect to Redshift, select the IAM radio button and input your access key and secret. Click Next.

On the next step of the wizard, you can input a query that will be used to pull data from the connected Redshift database. This query can be used to limit the data available through the positioned asset to specific columns and rows. After adding the query, click Next.

In the final step of the wizard, you can define the name and description of the asset (leveraging Markdown for formatting or embedding of images). As with the SDK, the k-Grouping and visibility can be selected here.

Clicking the Position button will test the connection to the database. An error message will appear in the bottom left corner of the screen if the connection is not successful. Once the connection has been established, the Asset Details screen will be shown for the newly connected database asset. Nice job!