Package preprocessor
The preprocessing tools are used to coerce data assets into a format needed to be input for a specific operation. The result of preprocessing defined in a TripleBlind operation will never be directly read by a human, instead the steps described are performed by the data owner's Access Point immediately before data is fed into the operation and then immediately discarded.
Some preprocessors are able to output data in multiple formats, such as the ImagePreprocessor which can output representations of an image as a numpy.ndarray or as a Torch.Dataset. Normally this can be ignored when defining a preprocessor, the operation will simply request the data in the format needed.
Preprocessing can transform data in many ways. Tabular data values could be normalized across the column, scaled to convert units, or otherwise manipulated in virtually any way. Extraneous rows or columns can also be discarded, only feeding the fields of interest into the process.
# Typical CSV preprocessor for specifing which column of data to use as a
# classification label and what other data columns to include for training.
csv_pre = (
tb.TabularPreprocessor.builder()
.add_column("target", target=True)
.all_columns(True)
.dtype("float32")
)
Preprocessing transforms for image data can resize the images, change color spaces (e.g. grayscaling), or alter the way the colors are represented numerically.
# Preprocessor to resize images to 28x28 grayscale, using 32-bit floats.
image_pre = (
tb.ImagePreprocessor.builder()
.resize(28, 28)
.convert("L")
.dtype("float32")
)
Operations can work on multiple datasets, and multiple preprocessors can be defined to operate on those datasets independently – or a single preprocessor can apply to them all.
Sub-modules
preprocessor.abstract
-
The abstract classes in this modular define the interfaces used by concrete classes defined by in this package or custom preprocessors.
preprocessor.azure_utils
preprocessor.custom_age_provider
preprocessor.custom_date_provider
preprocessor.custom_faker_sex_provider
preprocessor.data_generator
preprocessor.dicom
-
Preprocessing for DICOM (Digital Images and Communications in Medicine) images …
preprocessor.document_input
preprocessor.error
preprocessor.image
-
Images preprocessing is for all sorts of "picture" data, including medical imaging …
preprocessor.isdict
-
The abstract class IsDict defines the to_dict from_dict interfaces used by other abstract methods in this module in abstract.py and concrete classes …
preprocessor.iterator
-
The Iterators are used internally in preprocessors which interact with file folders, such as
preprocessor.image
andpreprocessor.dicom
. preprocessor.mongodb_utils
preprocessor.nlp
preprocessor.numpy_input
-
NumPy data can represent virtually any kind of numerical data model, including images, timeseries data and other abstractions. Generally it is …
preprocessor.package
-
A
Package
is a single file used to hold one or more files of data. The Package is essentially a .zip archive with several specific files inside it … preprocessor.python
preprocessor.report_parameters
preprocessor.roi_input
-
Images used to train Region of Interest require more than simple named "target" information for training. Specifically, coordinates for the bounding …
preprocessor.sampler
preprocessor.serialize
preprocessor.spec
preprocessor.sql
preprocessor.tabular