Preprocessors

The preprocessing tools are used to prepare dataset assets into a format needed to be input for a specific operation. The result of preprocessing defined in a TripleBlind operation will never be directly read by a human, instead the steps described are performed by the data owner's Access Point immediately before data is fed into the operation and then immediately discarded.

Tabular Preprocessors

Preprocessing can transform data in many ways. Tabular data values could be normalized across the column, scaled to convert units, or otherwise manipulated in virtually any way. You can also select which specific feature columns you want to include in the process.

Below is an example of a typical tabular preprocessor:

csv_pre = (
   tb.TabularPreprocessor.builder()
   .add_column("target", target=True)
   .all_columns(True)
   .dtype("float32")
)

This preprocessor specifies which column is the target (or dependent variable), and defines the training set as consisting of all columns.

Tabular preprocessors may also be used to apply transformations to specific columns. SQL queries can be defined using the sql_transform method, or Python code can be supplied using the python_transform method. Additionally, data transformers are supported for One Hot Encoding and Binning.

When defining a tabular preprocessor, the following order of operations will be followed:

  1. SQL queries or Python code will be applied, if specified, using sql_transform or python_transform.
  2. The dataset will be filtered to include fields representing the desired features and target variables, as indicated by add_columns and/or all_columns.
  3. Value-level imputation and string replacements (see Handling Missing Values) will be applied next, if specified.
  4. Column-level transformations, such as encoding and binning, will be applied as indicated by add_data_transformer and/or add_target_transformer.
  5. Data types in the output array will be cast to the type designated by dtype, if specified.

Here is an example of a preprocessor that uses KBinsDiscretizer to bin “Age” and “Grade”:

csv_pre = (
   tb.TabularPreprocessor.builder()
   .add_column("target", target=True)
   .all_columns(True)
   .dtype("float32")
   .add_data_transformer(
       "KBinsDiscretizer",
       columns=["Age", "Grade"],
       params={"n_bins": 4},
   )
)

Transformations may also be applied to the target (dependent) variables using add_target_transformer rather than add_data_transformer.

SQL transform

The usage of a sql_transform() preprocessor may be found in examples/Data_Munging/2b_model_train.py. This is a powerful preprocessor that permits the user to apply a SQL statement to a table before it is computed within an operation. The statement may be written as if it is executing on a table named “data” with the same column names as the asset. SQL statements supplied to sql_transform can be written directly in the preprocessor block or a local .sql file can be supplied.

# NOTE: when defining the sql_transform portion of the preprocessor, user must
# use a table named "data"
preprocess_a = (
   tb.TabularPreprocessor.builder()
   .add_column("bmi", target=True)
   .all_columns(True)
   .sql_transform(
       "SELECT Patient_Id as pid, Height_IN as height, Weight_LBS as weight, (cast(Weight_LBS as real) / (cast(Height_IN as real) * cast(Height_IN as real))) * 703.0 as bmi FROM data WHERE Age > 50 and Weight_LBS IS NOT NULL"
   )
   .dtype("float32")
)

These statements are always shown to the data owner when an Access Request is sent for an operation so that they are fully aware of the processing that is being requested to run against their data. Some operations have additional protections when this preprocessor is used:

  • When using a Safe with Care operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide), masked columns are not permitted to be returned by a sql_transform preprocessor.
  • SELECT * is not permitted in sql_transform preprocessors upstream of Blind Query and Blind Join. Use get_column_names() to request a list of column names and list them explicitly in the query. This empowers the data owner to make an informed approval decision within any Access Requests against their dataset.
  • Since Blind Report is entirely defined by the data owner, this operation does not permit the use of sql_transform preprocessors.

Python transform

The Tabular preprocessor can also be supplied with a Python script that makes use of powerful libraries such as pandas and numpy to prepare the data into the operation-specific format. An example of this may be found in examples/Data_Munging/2b_model_train.py. The preprocessing script must be written within a pre-configured transform() function that takes in a pandas dataframe (the original data) and outputs a new pandas dataframe (the transformed data to be used in the operation). The Python code can be written directly within the preprocessor block or a local .py file can be supplied.

preprocess_b = (
   tb.TabularPreprocessor.builder()
   .add_column("bmi", target=True)
   .all_columns(True)
   .python_transform("python_transform_script.py")
   .dtype("float32")
)

As with the SQL transform, the data owner is able to review the Python script supplied to the preprocessor when an Access Request is sent for an operation. The Python transform also has extra protections to ensure safety during code execution:

  • Python scripts supplied to python_transform must match the required format imposed within the pre-built transform() function, otherwise they will not be executed
  • Python transform scripts execute within a dedicated, locked down environment that prohibits the import of unauthorized libraries and does not allow writing data out to files, HTTP requests or other input/output.
  • Preprocessing using python_transform can only make use of pandas and numpy and only execute on the data supplied in the operation. The only possible output is another dataframe passed to the operation in context.
  • Since Blind Report is entirely defined by the data owner, this operation does not permit the use of python_transform preprocessors.

In order to aid in the process of building and testing a python_transform, two command line utilities are provided:

  • tb preproc create "Asset" will generate a pre-configured transform() function stored in a local preproc.py file that can be filled in with the desired operations and also retrieves mock data for the given asset
  • tb preproc test data.csv preproc.py will locally run the given preprocessor using the test data and save the post-processed output for examination and debugging.

Handling Missing Values

The Tabular preprocessor provides .handle_nan() and .replace() methods to handle missing values in positioned datasets. These methods are fairly equivalent to pandas methods df.dropna() and df.fillna() that users are likely accustomed to.

The .handle_nan() method supports the following ways to handle missing values. These can either be applied to specific columns or the entire dataset.

  • drop - drops all rows where a NaN is present in any columns
  • min, mean, median, max - fills in missing values with statistics
  • int or float - fills in missing values with provided value
  • dict - allows specifying the above per column

In the below example, all null values in the dataset are dropped.

pre = (
   tb.TabularPreprocessor
   .builder()
   .all_columns(True)
   .handle_nan("drop")
)

In the below example, null values in column A are replaced with the column mean, and null values in column B are replaced with the integer 42.

pre = (
   tb.TabularPreprocessor
   .builder()
   .all_columns(True)
   .handle_nan({"A": "mean", "B": 42})
)

The .replace() method supports the replacement of specific values in the dataset to null or other values. The values to be changed and their replacement values must be provided in tuple format (before, after). In the below example, values of -1 in column A and values of 99 in column B are replaced with null. They can then be filled with the .handle_nan() method.

pre = (
   tb.TabularPreprocessor
   .builder()
   .all_columns(True)
   .replace({"A": (-1, np.nan), "B": (99, np.nan)})
)

Encoders

The data transformers currently supported are OneHotEncoder and OrdinalEncoder for encoding and KBinsDiscretizer for binning. Example usage of these preprocessors can be seen in the Random Forest example found in the SDK. These features are designated as Early Access, and they have been implemented for the following operations:

  • Regression
  • Random Forest
  • XGBoost

Transforms interface:

   def add_data_transformer(
       self,
       transform: Union[
           "OneHotEncoder",
           "OrdinalEncoder",
           "KBinsDiscretizer",
       ],
       columns: Union[List, str] = [],
       params: Dict[str, object] = {},
   ) -> "TabularPreprocessorBuilder":
   def add_target_transformer(
       self,
       transform: Union[
           "OneHotEncoder",
           "OrdinalEncoder",
           "KBinsDiscretizer",
       ],
       columns: Union[List, str] = [],
       params: Dict[str, object] = {},
   ) -> "TabularPreprocessorBuilder":

ℹ️ Details about each parameter can be found in the Preprocessor reference docs.

Usage Notes

When using the OneHotEncoder and OrdinalEncoder transformers, all unique categories need to be specified across all data providers’ assets. At this time, the auto method is not supported in the privacy-preserving implementation of this transformer. The passed categories should also not mix string and numeric values within a single feature, and should be sorted in case of numeric values.

For example, when one hot encoding “favorite color” across three assets, all categories of colors across all assets must be specified in the preprocessor definition.

Asset 1 “Red”
“Green”
Asset 2
“Blue” “Green” “Purple”
Asset 3 “Red” “Blue” “Green”
Categories Supplied “Red” “Blue” “Green” “Purple”

Binning across multiple data providers needs to be used with caution given the nature of different distributions of data across different datasets. As such, when using binning across multiple datasets, it is important for each dataset to be similarly distributed. Otherwise, binning is most effective when used on a single dataset. To better understand dataset distribution, users can use the TripleBlind EDA tools, which may be found under the Data Profile tab when viewing a dataset.

If encoding on target values is desired, the user should opt to use OrdinalEncoder when working with scikit learn based algorithms.

KBinsDiscretizer has primarily been tested on data transformations.