Module `preprocessor.tabular`

Classes

class Column (name: str, aliases: List[str], target: bool)

Column(name: str, aliases: List[str], target: bool)

Ancestors

IsDict
abc.ABC

Class variables

var SCHEMA
var aliases : List[str]
var name : str
var target : bool

Inherited members

IsDict:
- from_dict
- to_dict

class TabularNumpyPreprocessor (columns: List[Column], all_columns: bool, sql_transform: Optional[str], python_transform: Optional[str], dtype: Union[str, numpy.dtype, ForwardRef(None)], sk_data_transformers: Optional[List], sk_target_transformers: Optional[List], expand_input_dims: Optional[Tuple[int, int]], handle_nan: Union[str, int, float, dict, ForwardRef(None)], substitutions: Union[list, tuple, dict, ForwardRef(None)])

Preprocessor designed for tabular-style data tasks.

This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:

SQL transform OR Python transform
Column selection (via all_columns or an explicit column list)
Value replacements
Handling of NaN values
Other transforms (OneHotEncoder, etc)
Numpy type casting

This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:

tb.TabularPreprocessor.builder()
    .add_column("target", target=True)
    .all_columns(True)
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")

tb.TabularPreprocessor.builder()
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")
    .all_columns(True)
    .add_column("target", target=True)

Ancestors

Methods

def read_asset_generator(self, asset: Union[str, pathlib.Path, Package], batch_size: int = 32) -> Iterator[numpy.ndarray]

Inherited members

TabularPreprocessor:
- dealias
- optional_target_column_name
- pandas_coerce
- set_sk_fitted_data_transform
- set_sk_fitted_target_transform
- sk_data_transform
- sk_target_transform
- target_column_name
NumpyPreprocessor:
- read_asset
- read_asset_chunked
- read_bytes
- read_file
- read_folder

class TabularNumpyTargetPreprocessor (columns: List[Column], all_columns: bool, sql_transform: Optional[str], python_transform: Optional[str], dtype: Union[str, numpy.dtype, ForwardRef(None)], sk_data_transformers: List, sk_target_transformers: List, expand_input_dims: Optional[Tuple[int, int]], handle_nan: Union[str, int, float, dict, ForwardRef(None)], substitutions: Union[tuple, dict, ForwardRef(None)])

Preprocessor designed for tabular-style data tasks.

This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:

SQL transform OR Python transform
Column selection (via all_columns or an explicit column list)
Value replacements
Handling of NaN values
Other transforms (OneHotEncoder, etc)
Numpy type casting

This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:

tb.TabularPreprocessor.builder()
    .add_column("target", target=True)
    .all_columns(True)
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")

tb.TabularPreprocessor.builder()
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")
    .all_columns(True)
    .add_column("target", target=True)

Ancestors

Inherited members

TabularPreprocessor:
- dealias
- optional_target_column_name
- pandas_coerce
- set_sk_fitted_data_transform
- set_sk_fitted_target_transform
- sk_data_transform
- sk_target_transform
- target_column_name
NumpyTargetPreprocessor:
- read_asset
- read_asset_chunked
- read_bytes
- read_file
- read_folder

class TabularPandasPreprocessor (columns: List[Column], all_columns: bool, sql_transform: Optional[str], python_transform: Optional[str], sk_data_transformers: List, sk_target_transformers: List, handle_nan: Union[str, int, float, dict, ForwardRef(None)], substitutions: Union[list, tuple, dict, ForwardRef(None)])

Preprocessor designed for tabular-style data tasks.

This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:

SQL transform OR Python transform
Column selection (via all_columns or an explicit column list)
Value replacements
Handling of NaN values
Other transforms (OneHotEncoder, etc)
Numpy type casting

This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:

tb.TabularPreprocessor.builder()
    .add_column("target", target=True)
    .all_columns(True)
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")

tb.TabularPreprocessor.builder()
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")
    .all_columns(True)
    .add_column("target", target=True)

Ancestors

Inherited members

TabularPreprocessor:
- dealias
- optional_target_column_name
- pandas_coerce
- set_sk_fitted_data_transform
- set_sk_fitted_target_transform
- sk_data_transform
- sk_target_transform
- target_column_name
PandasPreprocessor:
- read_asset
- read_asset_chunked
- read_bytes
- read_file
- read_folder

class TabularPreprocessor (columns: List[Column], all_columns: bool, sql_transform: Optional[str], python_transform: Optional[str], sk_data_transformers: List, sk_target_transformers: List, handle_nan: Union[str, int, float, dict, ForwardRef(None)], substitutions: Union[list, tuple, dict, ForwardRef(None)])

Preprocessor designed for tabular-style data tasks.

This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:

SQL transform OR Python transform
Column selection (via all_columns or an explicit column list)
Value replacements
Handling of NaN values
Other transforms (OneHotEncoder, etc)
Numpy type casting

This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:

tb.TabularPreprocessor.builder()
    .add_column("target", target=True)
    .all_columns(True)
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")

tb.TabularPreprocessor.builder()
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")
    .all_columns(True)
    .add_column("target", target=True)

Static methods

def builder() -> TabularPreprocessorBuilder

Instance variables

var optional_target_column_name : Optional[str]

Any defined target column name, or None if no target.

Raises

MultipleTargetColumns: Multiple targets were defined

Returns

Optional[str]: Any defined target column name, or None if no target

var target_column_name : str

The target column name

Raises

MultipleTargetColumns: Multiple targets were defined
MissingTargetColumn: No target was defined

Returns

str: The string name of the target column

Methods

def dealias(self, alias: str) -> str

Convert aliases into actual name.

Args

alias: The alias of a column name.

Returns

str: The actual name of an alias, or the unchanged string if no alias found.

def pandas_coerce(self, data: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame

Apply parts of TabularPreprocessor to the data supplied.

Specifically applies all_columns, add_column, sql_transform, and python_transform.

Args

data : pd.Dataframe: data on which to apply preprocessor.

script(str, Optional): contents of python script to apply to dataframe through python transform.

Returns

DataFrame: A dataframe with parts of preprocessor applied.

def set_sk_fitted_data_transform(self, transformers)

Set the sk fitted data transform to the provided transformer(s)

Args

transformers: the provided transformers

def set_sk_fitted_target_transform(self, transformers)

Set the sk fitted target transform to the provided transformer(s)

Args

transformers: the provided transformers

def sk_data_transform(self) -> Tuple[object, bool]

Get the sk fitted data transformers and whether to fit the dataset

Returns

Tuple[object, bool]: The transformers and an indicator for whether or not to fit the dataset

def sk_target_transform(self) -> Tuple[object, bool]

Get the sk fitted target transformers and an indicator of whether to fit the target

Returns

Tuple[object, bool]: The transformers and an indicator for whether or not to fit the target

class TabularPreprocessorBuilder

Abstract base for a preprocessor that can output data as a numpy.ndarray

Ancestors

Class variables

var SCHEMA

Methods

def add_column(self, name: Union[str, List[str]], target: bool = False) -> TabularPreprocessorBuilder

Add column(s) to the list of columns to include in this operation

Args

name : str, List[str]: Name or list of column names to include. If a list is passed, additional names are treated as aliases. To include multiple columns, use the method multiple times.
target : bool: Is this a target column? Target columns are used in operations such as training a model. Inference operations typically do not need a target and will ignore it if set.

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def add_data_transformer(self, transform: Union[ForwardRef('OneHotEncoder'), ForwardRef('OrdinalEncoder'), ForwardRef('KBinsDiscretizer'), ForwardRef('MultiLabelBinarizer')], columns: Union[List, str] = [], params: Dict[str, object] = {}) -> TabularPreprocessorBuilder

Define a data transformer for feature/independent variables.

Args

transform : str: The transformation to be applied to the specified column. Currently supported: OneHotEncoder, KBinsDiscretizer, OrdinalEncoder
columns : str or List: The column(s) to which the specified transformation will be applied.
params : dict: A dictionary of parameters specific to the transformation specified by the transform parameter. Specific parameters can be found in scikit learn documentation. For each transform type, all scikit learn parameters are supported except for 'sparse' in OneHotEncoder. See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def add_target_transformer(self, transform: Union[ForwardRef('OneHotEncoder'), ForwardRef('OrdinalEncoder'), ForwardRef('KBinsDiscretizer'), ForwardRef('MultiLabelBinarizer')], columns: Union[List, str] = [], params: Dict[str, object] = {}) -> TabularPreprocessorBuilder

Define a transformer for target/dependent variables.

Args

transform : str: The transformation to be applied to the specified target column.
columns : str or List: The column(s) to which the specified transformation will be applied.
params : dict: A dictionary of parameters specific to the transformation specified by the transform parameter. Specific parameters can be found in scikit learn documentation. For each transform type, all scikit learn parameters are supported except for 'sparse' in OneHotEncoder. See https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def all_columns(self, value: bool = True) -> TabularPreprocessorBuilder

Turn on or off using all columns in dataset.

Args

value: Indicates whether or not to use all columns in the dataset.

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def dtype(self, dtype: Optional[str]) -> TabularPreprocessorBuilder

Cast an output numpy array to a given dtype.

If not explicitly set, the protocol will choose the dtype. This is ignored for non-Numpy outputs.

Args

dtype: The dtype that a numpy output will be cast into.

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def expand_input_dims(self, axis: Tuple[int, int])

def expand_target_dims(self, expand=True)

def handle_nan(self, method: Union[str, int, float, dict]) -> TabularPreprocessorBuilder

Specify method for handling NaN (not a number) values found in the data

Usage examples::

# Drop rows with NaNs present in any column:
tb.TabularPreprocessor.builder().all_columns(True).
    handle_nan("drop")

# Set NaNs to the value zero
tb.TabularPreprocessor.builder().all_columns(True).
    handle_nan(0)

# Specify different methods for difference columns.  Fill NaNs found
# in column "A" with the median value of the column, and fill NaNs
# found in column "B" with the value 42:
tb.TabularPreprocessor.builder().all_columns(True).
    handle_nan({"A": "median", "B": 42})

Args

method : str, int, float or dict

Method of simple replacement for any NaN found in the data, or a dict specifying the method or replacement for specific fields.

drop - Drop all rows where NaN is present (see df.dropna())

mean/median/min/max - Replace NaN values with the given calculated statistic for the column.

int or float - Replace NaN values with the given value. (see df.fillna())

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def python_transform(self, script: Union[str, pathlib.Path]) -> TabularPreprocessorBuilder

Set a Python script to transform a data asset before use using a dataframe.

The provided script must have the form::

tb.TabularPreprocessor.builder().all_columns(True).
    python_transform(
        '''
        import pandas as pd

        def transform(df: pd.DataFrame) -> pd.DataFrame:
            # transform the dataframe as you'd like
            return df
        '''
    )

For security reasons only Pandas and Numpy can be imported.

NOTE: Only one python_transform() or sql_transform() can be used on each dataset.

Args

script : Path or str: Path to a Python script, or a multiline string holding the script (leading whitespace will be intelligently trimmed).

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def replace(self, substitutions: Union[list, tuple, dict]) -> TabularPreprocessorBuilder

Replace matching values in whole dataset or specific field with the given value

Usage examples::

# Change -99 to np.nan everywhere
tb.TabularPreprocessor.builder().all_columns(True).
    replace((-99, np.nan))

# Change -99 in column named "A" to NaN, and -1 to column B to zero
tb.TabularPreprocessor.builder().all_columns(True).
    replace({"A": (-99, np.nan), "B": (-1, 0)})

Args

substitutions : list, tuple, dict: Change a value matching "from" to a new value When tuple, it is treated as: (from, new_value) When list, it must contain tuples as: [(from, new_value), …] When dict, it is treated as: {"FIELDNAME": (from, new_value), …}

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def sql_transform(self, query: Union[str, pathlib.Path]) -> TabularPreprocessorBuilder

Set an SQLite query to apply to a data asset to transform it before use.

The query will be executed against a table named "data". Any valid SQLite method can be use to rename or modify values in this transitory table before the value is used in the operation. For example, this query renames "Y" to "target" and calculates "svr" from the raw value of svr and base::

tb.TabularPreprocessor.builder().all_columns(True).
    sql_transform(
        "SELECT Y as target, (svr * base) / 2 as svr, FROM data"
    )

NOTE: Only one python_transform() or sql_transform() can be used on each dataset.

Args

query : str: the sqlite query to run.

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

def target_dtype(self, dtype: Optional[str]) -> TabularPreprocessorBuilder

Cast an output target numpy value to a given dtype.

If not set, the protocol will choose the dtype. This is ignored for non-Torch outputs.

Args

dtype : str: The dtype that a numpy output into which will be cast.

Returns

TabularPreprocessorBuilder: This class instance, useful for chaining.

Inherited members

OutputNumpy:
- output_numpy
OutputNumpyTarget:
- output_numpy_target
OutputPandas:
- output_pandas
OutputTorchDataset:
- output_torch_dataset
IsDict:
- from_dict
- to_dict

class TabularTorchPreprocessor (columns: List[Column], all_columns: bool, sql_transform: Optional[str], python_transform: Optional[str], dtype: Union[str, numpy.dtype, ForwardRef(None)], expand_target_dims: bool, target_dtype: Union[str, numpy.dtype, ForwardRef(None)], sk_data_transformers: List, sk_target_transformers: List, expand_input_dims: Optional[Tuple[int, int]], handle_nan: Union[str, int, float, dict, ForwardRef(None)], substitutions: Union[list, tuple, dict, ForwardRef(None)])

Preprocessor designed for tabular-style data tasks.

This family of Preprocessors is designed to manage tabular-style data, e.g. rows of data with named columns for fields. The sequence of operations is fixed to this order:

SQL transform OR Python transform
Column selection (via all_columns or an explicit column list)
Value replacements
Handling of NaN values
Other transforms (OneHotEncoder, etc)
Numpy type casting

This order is followed regardless of the order in which operations are specified. Hence, the following behave identically:

tb.TabularPreprocessor.builder()
    .add_column("target", target=True)
    .all_columns(True)
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")

tb.TabularPreprocessor.builder()
    .sql_transform("SELECT target, salar as sal, amt FROM data")
    .dtype("float32")
    .all_columns(True)
    .add_column("target", target=True)

Ancestors

Inherited members

TabularPreprocessor:
- dealias
- optional_target_column_name
- pandas_coerce
- set_sk_fitted_data_transform
- set_sk_fitted_target_transform
- sk_data_transform
- sk_target_transform
- target_column_name
TorchDatasetPreprocessor:
- read_file