Package preprocessor

The preprocessing tools are used to coerce data assets into a format needed to be input for a specific operation. The result of preprocessing defined in a TripleBlind operation will never be directly read by a human, instead the steps described are performed by the data owner's Access Point immediately before data is fed into the operation and then immediately discarded.

Some preprocessors are able to output data in multiple formats, such as the ImagePreprocessor which can output representations of an image as a numpy.ndarray or as a Torch.Dataset. Normally this can be ignored when defining a preprocessor, the operation will simply request the data in the format needed.

Preprocessing can transform data in many ways. Tabular data values could be normalized across the column, scaled to convert units, or otherwise manipulated in virtually any way. Extraneous rows or columns can also be discarded, only feeding the fields of interest into the process.

# Typical CSV preprocessor for specifing which column of data to use as a
# classification label and what other data columns to include for training.
csv_pre = (
    tb.TabularPreprocessor.builder()
    .add_column("target", target=True)
    .all_columns(True)
    .dtype("float32")
)

Preprocessing transforms for image data can resize the images, change color spaces (e.g. grayscaling), or alter the way the colors are represented numerically.

# Preprocessor to resize images to 28x28 grayscale, using 32-bit floats.
image_pre = (
    tb.ImagePreprocessor.builder()
    .resize(28, 28)
    .convert("L")
    .dtype("float32")
)

Operations can work on multiple datasets, and multiple preprocessors can be defined to operate on those datasets independently – or a single preprocessor can apply to them all.

Sub-modules

preprocessor.abstract

The abstract classes in this modular define the interfaces used by concrete classes defined by in this package or custom preprocessors.

preprocessor.azure_utils
preprocessor.custom_age_provider
preprocessor.custom_date_provider
preprocessor.custom_faker_sex_provider
preprocessor.data_generator
preprocessor.dicom

Preprocessing for DICOM (Digital Images and Communications in Medicine) images …

preprocessor.document_input
preprocessor.error
preprocessor.image

Images preprocessing is for all sorts of "picture" data, including medical imaging …

preprocessor.isdict

The abstract class IsDict defines the to_dict from_dict interfaces used by other abstract methods in this module in abstract.py and concrete classes …

preprocessor.iterator

The Iterators are used internally in preprocessors which interact with file folders, such as preprocessor.image and preprocessor.dicom.

preprocessor.mongodb_utils
preprocessor.nlp
preprocessor.numpy_input

NumPy data can represent virtually any kind of numerical data model, including images, timeseries data and other abstractions. Generally it is …

preprocessor.package

A Package is a single file used to hold one or more files of data. The Package is essentially a .zip archive with several specific files inside it …

preprocessor.python
preprocessor.report_parameters
preprocessor.roi_input

Images used to train Region of Interest require more than simple named "target" information for training. Specifically, coordinates for the bounding …

preprocessor.sampler
preprocessor.serialize
preprocessor.spec
preprocessor.sql
preprocessor.tabular