The TableAsset wraps a generic asset, allowing the complexity of creating jobs to be completely hidden. Common operations can happen with just a few lines of code.

For example:

import tripleblind as tb

table1 = tb.TableAsset(company_1_customers)
table2 = tb.TableAsset(company_2_clients)

# Privately determine overlap between above remote tables and a local CSV
overlap = table1.intersect(
    intersect_with=[table2, "broker-licenses.csv"],
    match_column="ssn")
if overlap:
    overlap.print_content()

Classes

class StatFunc (value, names=None, *, module=None, qualname=None, type=None, start=1)

Enumeration of statistical functions supported by get_statistics()

Ancestors

enum.Enum

Class variables

var CONFIDENCE_INTERVAL: The 95% confidence interval for the sample mean. Labeled 'ci-lower' and 'ci-upper'.
var COUNT: The number of items in the data. Labeled 'n'.
var KURTOSIS: The kurtosis of the data, which is the combined weight of a distribution's tails relative to the center of the distribution. Labeled 'kurt'.
var MAXIMUM: The maximum value in the data. Labeled 'max'.
var MEAN: The arithmetic mean of the data. Labeled 'mean'.
var MEDIAN: The median value in the data. Labeled 'median'.
var MINIMUM: The minimum value in the data. Labeled 'min'.
var QUARTILES: The q1 and q3 quartiles of the data plus median. Labeled 'q1', 'mean' ,'q3'.
var SKEW: The skew (degree of asymmetry) of the data. Labeled 'skew'.
var STANDARD_DEVIATION: The standard deviation (dispersion relative to the mean) of the data. Labeled 'sd'.
var STANDARD_ERROR: The standard error for the sample mean. Labeled 'se'.
var VARIANCE: The variance (spread of the numbers) of the data. Labeled 'var'.

class TableAsset (uuid: UUID)

A specialized Asset which holds spreadsheet-style data containing rows and columns.

NOTE: Operations may download the data temporarily. This is done in a secure fashion, but local data should be treated with care.

Ancestors

Asset

Static methods

def cast(asset: Asset) -> TableAsset

Convert a generic Asset into a TableAsset

This should only be used on data known to be tabular, no validation occurs during the cast.

Args

asset : Asset: A generic Asset

Returns

TableAsset: A TableAsset object

def find(search: Optional[Union[str, re.Pattern]], namespace: Optional[UUID] = None, owned: Optional[bool] = False, owned_by: Optional[int] = None, session: Optional[Session] = None, exact_match: Optional[bool] = True) -> TableAsset

Search the Router index for an asset matching the given search

Args

search : str or re.Pattern, optional: Either an asset ID or a search pattern applied to asset names and descriptions. A simple string will match a substring or the entire string if exact_match is True, or a regular expression can be passed for complex searches.
namespace : UUID, optional: The UUID of the user to which this asset belongs. None indicates any user, NAMESPACE_DEFAULT_USER indicates the current API user.
owned : bool, optional: Only return owned assets (either personally or by the current user's organization)
owned_by : int, optional: Only return owned assets owned by the given organization ID
session : Session, optional: A connection session. If not specified, the default session is used.
exact_match : bool, optional: When the 'search' is a string, setting this to True will perform an exact match. Ignored for regex patterns, defaults to True.

Raises

TripleblindAssetError: Thrown when multiple assets are found which match the search.

Returns

TableAsset: A single asset, or None if no match found

def find_linked_records(datasets: List[Union[Asset, TableAsset, str]], match_columns: List[Union[str, Tuple(str, str)]], match_threshold: Optional[float] = 0.88, job_name: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, silent: Optional[bool] = False) -> TableAsset

Find matching records between datasets

Args

dataset : Union[List[Asset], List[TableAsset]]: The two data assets to examine for linked records.
match_columns (List[Union[str, Tuple(str, str)]]): Names of the
columns to match. List of common column names or tuples
containing the column names to be paired for the linkage.
E.g. if the first dataset has columns named "fname", "lname" and
"age" and the second dataset has a columns named "first", "last"
and "age", then the columns would be specified as:
[('fname', 'first'), ('lname', 'last'), 'age')].
match_threshold : float, optional: The minimum confidence for match values. This is a value between 0.0 and 1.0. Default is 0.88.
job_name : str, optional: Reference name for this process. This name will appear in the Access Request, Process History, and Audit Reports.
preproc : Union[TabularPreprocessor, List[TabularPreprocessor]], optional: The preprocessor(s) to use against the datasets. When no preprocessor is specified, the default preprocessor selects all match columns and renames columns for linkage.
silent : bool, optional: Suppress status messages during execution? Default is to show messages.

Raises

TripleblindProcessError: Thrown when a column in return_columns is masked or when k-Grouping validation fails (output contains less than k records).
ValueError: Thrown when a parameter validation fails.

Returns

TableAsset: The output contains rows from the first dataset which had matches in the second dataset. Following those rows, the second dataset matches are reported in the match order. Both include the cosine similarity of the match.

def position(file_handle: Union[str, Path, Package, io.BufferedReader], name: str, desc: str, is_discoverable: Optional[bool] = False, cost: int = 1, allow_overwrite: Optional[bool] = False, session: Optional[Session] = None, is_dataset: bool = True, custom_protocol: Optional[Any] = None, metadata: dict = {}, k_grouping: int = 5, strategy: Optional[str] = None, unmask_columns: Optional[List[str]] = None) -> TableAsset

Place CSV data on your Access Point for use in by yourself or others.

CSV Import Rules:

The separator must be a comma, not semicolons or tabs.
First row must be a header row containing column names.
- No padding is allowed on names, extra spaces will be treated as part of the name.
- Names can optionally be enclosed in double-quotes ('"').
Legitimate values:
- The special numeric values 'inf', '-inf', 'nan' are supported. Capitalization is ignored (e.g. INF and InF are both legal)
- Boolean values can be 'true' or 'false', case is ignored
- Numeric values can be either integers (no decimals) or floating point (with a decimal).
- Scientific notation (e.g. 1.23e04 to represent 12300.0) can be use for floating point values.
- String values can be enclosed in double-quotes ('"') or not, although string containing commas must be within a quoted string.
Column type is inferred from the entire column's content. The following will impact this process:
- Missing values mixed in with booleans becomes a string column
- Any floating point value mixed in with integers makes a floating point column
- Infinite, NaN or missing values in a column with integers makes a floating point column
- Leading spaces before booleans, or the special values -Inf, Inf, or NaN will result in a string column
- Missing values are treated as NaN in floating columns (and will also force an integer column to floating point)
Column types in dataframes:
- Floating point columns are reported as 'np.float64'
- Integer columns are reported as 'np.int64'
- Boolean columns are reported as 'np.bool'
- String columns are reported as 'object'
Leading spaces get included in string columns, but will be trimmed from simple numbers
Quotes are optional for alpha-numerics but required for most strings with special characters (including commas)

Args

file_handle : str, Path, Package or io.BufferedReader: File handle or path to CSV data to place on the API user's associated Access Point.
name : str: Name of the new asset.
desc : str: Description of the new asset (can include markdown)
is_discoverable : bool, optional: Should this asset be listed in the Router index to be found and used by others?
cost : int: Price for accessing this asset, in US cents. Default is $0.01. Note: This can be overridden for specific organizations with an Agreement.
allow_overwrite : bool, optional: If False an exception will be thrown if the asset name already exists. If True, an existing asset will be overwritten.
session : Session, optional: A connection session. If not specified, the default session is used.
is_dataset : bool: Is this a dataset? (False == algorithm)
custom_protocol: Internal use
metadata : dict: Custom metadata to include in the asset
k_grouping : int: The minimum number of distinct values
strategy : str, optional: Upload strategy Use 'stream' to use a web socket, or 'post' to perform a simple post to position the asset. Default is 'stream'.
unmask_columns : [str], optional: List of field names that should be unmasked. Default to to mask all fields.

Returns

TableAsset: New asset on the Router, or None on failure

Instance variables

var dataframe : pandas.core.frame.DataFrame

The content of the asset loaded as a Pandas dataframe

Returns

pd.DataFrame: The asset data

var raw_content : str

The content of the asset loaded as a string

Returns

str: The content

Methods

def blind_join(self, intersect_with: Union[Asset, List[Asset], TableAsset, List[TableAsset]], match_column: Union[str, List[str]], return_columns: Union[str, List[str], List[List[str]]], match_fuzziness: Optional[float] = None, job_name: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, silent: Optional[bool] = False, join_type: JoinType = JoinType.INNER) -> TableAsset

The intersection of datasets using exact or fuzzy matching.

The first dataset must be yours, so this method can only be invoked on an asset your organization owns.

The 'match_fuzziness' is an inverted Jaro-Winkler distance. In a Jaro-Winkler comparison, a score of 0.0 is totally different and 1.0 is perfectly matched. Therefore 'match_fuzziness' = 0.0 is a perfect match and 'match_fuzziness' = 1.0 is completely different.

Args

intersect_with : Union[Asset, List[Asset]]

A dataset Asset (or a list of Assets) to be included in the Blind Join.

match_column : Union[str, List[str]]

Name of the column to match. If not the same in all datasets, a list of the matching column names, starting with the initiator asset and then listing a name in each intersect_with dataset.

match_fuzziness : float, optional

If specified, the fuzziness threshold for match values. This is a value between 0.0 and 1.0. Default is 0.0 (perfect match).

return_columns : Union[str, List[str], List[List[str]]]

Name of the columns to be returned in the resultant dataset. This is a list of lists of column names such that each party has its own list of column names (starting, as with match_column, with the initiator asset).

str: If a single column name is provided, columns matching that name across all datasets will be returned. eg. ['a'] is interpreted as [ ['a'], ['a'], ['a'] ]

List[str]: If a list of names is provided, columns matching these names across all datasets will be returned. eg. ['a', 'b'] is interpreted as [ ['a', 'b'], ['a', 'b'], ['a', 'b'] ]

List[List[str]]: If a list of lists is provided, only columns matching the names from the respective dataset will be returned. eg. [ ['a', 'b'], ['c', 'd'] ] is interpreted as [ ['a', 'b'], ['c', 'd'] ]

NOTE: If a column name is specified for an asset and it does not exist, it is ignored. If multiple datasets have the same name for a column, the output dataset will contain suffixes to differentiate the source of each field. E.g. If "name" is in multiple datasets, the return will contain "name", "name__1" & "name__2".

job_name : str, optional

Reference name for this process. This name will appear in the Access Request, Process History, and Audit Reports.

preproc : Union[TabularPreprocessor, List[TabularPreprocessor]], optional

The preprocessor(s) to use against the datasets. When no preprocessor is specified, the default preprocessor selects all columns.

silent : bool, optional

Suppress status messages during execution? Default is to show messages.

join_type : JoinType, optional

Describes the format of the resulting dataset table. Default is INNER.

INNER: This is a SQL-like inner join where each party's return columns are included in the result set.

INNER_PARTITIONED: This is an inner join with each party's matching rows appended to the end of the dataset. There is no alignment by matching column. This format provides more privacy-protection and reveals less group-membership information than a standard inner join.

LEFT: This is a SQL-like left join where ALL records are returned from the initiator's table, with matching records from each party's return columns (where available).

Raises

TripleblindProcessError: Thrown when a column in return_columns is masked or when k-Grouping validation fails (output contains less than k records).

Returns

TableAsset: A table of matching rows from all datasets. Rows from your table with multiple matches are duplicated such that all column values from matching rows in the second dataset are returned, one per row. If the resulting table is empty, this means there were no records found in the intersection of the datasets.

def detect_outlier(self, identifier_column: str, columns: Union[str, List[str]], std_dev_limit: int = 3, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> List[str]

Identify rows in a dataset containing unusual values, or outliers

This operation privately analyzes the given dataset values. Each value is then compared to the mean value for the dataset, identifying rows where the value is outside the number of standard deviations given. This is known as the Z Score. For a normal distribution, 99.7% of the values will be within 3 standard deviations of the mean.

Args

identifier_column : str: Name of the column in the table containing the record identifier, e.g. "id"
columns : Union[str, List[str]]: Name of the column(s) to check for outliers.
std_dev_limit : int: The number of standard deviations outside of which a value is considered an outlier. Default is 3, which captures 99.7% of a normally distributed dataset.
job_name : str, optional: Reference name for the job with performs this task.
silent : bool, optional: Suppress status messages during execution? Default is to show messages.
session : Session, optional: A connection session. If not specified, the default session is used.

Returns

List[str]: List of identifiers for rows containing outliers.

def get_column_names(self) -> list

Fetch Column Names for a Given Asset

This should only be used on data known to be tabular. Relies on find to pass along asset.

Args

self : TableAsset: A generic TableAsset - utilize uuid

Returns

List: A list of column names or an empty list

def get_mock_data(self, session: Optional[Session] = None) -> pandas.core.frame.DataFrame

Retrieve 10 rows of sample data typical of the dataset Args:

Returns

pd.DataFrame: A table of the example data

def get_sample(self, num_samples: Optional[int] = 10, silent: Optional[bool] = False, session: Optional[Session] = None) -> pandas.core.frame.DataFrame

Generates synthetic data typical of the dataset

Result is randomly generated data similar to the actual dataset.

Args

num_samples : int, optional: Number of synthetic samples to generate. Default is 10.
silent : bool, optional: Suppress status messages during execution? Default is to show messages.
session : Session, optional: A connection session. If not specified, the default session is used.

Returns

pd.DataFrame: A table of the synthetic data

def get_statistics(self, column: Union[str, List[str]], function: Union[StatFunc, List[StatFunc]] = None, combine_with: Optional[Union[Asset, List[Asset]]] = None, group_by: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None, match_column: Optional[Union[str, list[str]]] = None) -> TableAsset

Calculate one or more statistics on the database.

NOTE: String column types will be coerced to float32 when calculating statistics, if possible. Any non-numeric string in the column will result in an error.

Args

column : str, List[str]: Name of data column(s) upon which to calculate. Column names must be common between all of the data.
function : StatFunc, List[StatFunc], optional: Function(s) to calculate. If not specified, all stats are calculated.
combine_with : Union[Asset, List[Asset]], optional: Other table(s) with the same data/columns to virtually combine for the calculation.
group_by : str, optional: Data column for grouping of data before the calculation.
preproc : TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder], optional: The preprocessor(s) to use against datasets. If a list is given, the order must be the same as the combine_with assuming the first entry is this TableAsset.
job_name : str, optional: Reference name for the job with performs this task.
silent : bool, optional: Suppress status messages during execution? Default is to show messages.
session : Session, optional: A connection session. If not specified, the default session is used.
match_column : Union[str, List[str]], optional: If not provided, the data is treated as horizontally partitioned. If provided, the data is treated as vertically partitioned and match_column is the name of the column to match. If not the same in all datasets, a list of the matching column names, starting with the initiator asset and then listing a name in each intersect_with dataset.

Returns

TableAsset: A table of the computed statistics.

def intersect(self, intersect_with: Union[Asset, str, Path, Package, List[Union[Asset, str, Path, Package]]], match_column: str, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> TableAsset

Perform a Private Set Intersection

Private Set Intersection (PSI) allows you to privately compare data
you own with other datasets.  The first dataset must be yours, so
this method can only be invoked on an asset your organization owns.

Args

intersect_with : Union[Asset, List[Asset]]: Other table(s) to intersect against. Can be Assets or paths to local files.
match_column : str: The name of the colum to match on each table
job_name : str, optional: Reference name for the job with performs this task.
silent : bool, optional: Suppress status messages during execution? Default is to show messages.
session : Session, optional: A connection session. If not specified, the default session is used.

Returns

TableAsset: The intersection of the datasets

def load(self, sep: Optional[str] = None, col_names: Optional[List[str]] = None, header: Optional[bool] = True) -> pandas.core.frame.DataFrame

Retrieve and load the asset into a Pandas dataframe

Args

sep : str, optional: Character separating values. Default is comma.
col_names : [str], optional: List of column names. Default is no names.
header : bool, optional: Read first line as a header? Default is True.

def mask_columns(self, col_names: Union[str, List[str]], mask_type: MockType = MockType.random, params: Optional[Dict[str, Any]] = None, session: Optional[Session] = None)

Masks columns identified by the supplied list of MaskColumns.

Args

col_names : List or str: Column name or list of names to mask.
mask_type : MockType: The mock type to use when generating data.
params : Dict[str, Any], optional: Parameters used by the mask_type. For example, for MockType.email it could be 'params={"domain": "gmail.com"}'.
session : Session, optional: A connection session. If not specified, the default session is used.

Returns

True if the operation succeeded, otherwise false.

def pretty_print(self, col_names: Optional[List[str]] = None)

Display the table nicely.

Args

col_names : [str], optional: Names to display for each column

def print_content(self, num_cols: Optional[int] = None, indent: Optional[int] = None, show_rows: Optional[bool] = False)

Print the contents of a TableAsset

NOTE: This downloads the data temporarily. This is done in a secure fashion, but local data should be treated with care.

Args

num_cols : int, optional: Number of columns for output. Default is single column.
indent : int, optional: Number of spaces to indent output. Default is zero.
show_rows : bool, optional: Should row numbers be displayed? Default is no row numbers.

def search(self, search_column: str, pattern: List[Union[str, re.Pattern]], match_case: bool = True, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> TableAsset

Perform string searches against rows of the given column

Args

search_column : str: The name of the column to search.
pattern : Union[str, re.Pattern]: Search pattern(s). These can be simple strings or full regex patterns.
match_case : bool, optional: For a simple string search, should the case be matched? Defaults to True.
job_name : str, optional: A reference name
silent : bool, optional: Suppress status messages during execution? Default is to show messages.
session : Session, optional: A connection session. If not specified, the default session is used.

Returns

TableAsset: The search results, each on its own row

def unmask_columns(self, col_names: Union[str, List[str]], session: Optional[Session] = None) -> bool

Unmasks columns identified by the supplied list of col_names.

Args

col_names : List or str: Column name or list of names to unmask.
session : Session, optional: A connection session. If not specified, the default session is used.

Returns

True if the operation succeeded, otherwise false.

Inherited members

Asset:
- accesspoint_filename
- activate_date
- add_agreement
- archive
- deactivate_date
- delete
- desc
- download
- filename
- find_all
- hash
- is_active
- is_discoverable
- is_valid
- k_grouping
- list_agreements
- metadata
- name
- namespace
- organization
- organization_id
- publish_to_org
- retrieve
- upload
- uuid