Module tripleblind.table_asset
Specialized Asset representing tabular data, like a spreadsheet.
The TableAsset wraps a generic asset, allowing the complexity of creating jobs to be completely hidden. Common operations can happen with just a few lines of code.
For example:
import tripleblind as tb
table1 = tb.TableAsset(company_1_customers)
table2 = tb.TableAsset(company_2_clients)
# Privately determine overlap between above remote tables and a local CSV
overlap = table1.intersect(
intersect_with=[table2, "broker-licenses.csv"],
match_column="ssn")
if overlap:
overlap.print_content()
Classes
class StatFunc (value, names=None, *, module=None, qualname=None, type=None, start=1)
-
Enumeration of statistical functions supported by get_statistics()
Ancestors
- enum.Enum
Class variables
var CONFIDENCE_INTERVAL
-
The 95% confidence interval for the sample mean. Labeled 'ci-lower' and 'ci-upper'.
var COUNT
-
The number of items in the data. Labeled 'n'.
var KURTOSIS
-
The kurtosis of the data, which is the combined weight of a distribution's tails relative to the center of the distribution. Labeled 'kurt'.
var MAXIMUM
-
The maximum value in the data. Labeled 'max'.
var MEAN
-
The arithmetic mean of the data. Labeled 'mean'.
var MEDIAN
-
The median value in the data. Labeled 'median'.
var MINIMUM
-
The minimum value in the data. Labeled 'min'.
var QUARTILES
-
The q1 and q3 quartiles of the data plus median. Labeled 'q1', 'mean' ,'q3'.
var SKEW
-
The skew (degree of asymmetry) of the data. Labeled 'skew'.
var STANDARD_DEVIATION
-
The standard deviation (dispersion relative to the mean) of the data. Labeled 'sd'.
var STANDARD_ERROR
-
The standard error for the sample mean. Labeled 'se'.
var VARIANCE
-
The variance (spread of the numbers) of the data. Labeled 'var'.
class TableAsset (uuid: UUID)
-
A specialized Asset which holds spreadsheet-style data containing rows and columns.
NOTE: Operations may download the data temporarily. This is done in a secure fashion, but local data should be treated with care.
Ancestors
Static methods
def cast(asset: Asset) -> TableAsset
-
Convert a generic Asset into a TableAsset
This should only be used on data known to be tabular, no validation occurs during the cast.
Args
asset
:Asset
- A generic Asset
Returns
TableAsset
- A TableAsset object
def find(search: Optional[Union[str, re.Pattern]], namespace: Optional[UUID] = None, owned: Optional[bool] = False, owned_by: Optional[int] = None, session: Optional[Session] = None, exact_match: Optional[bool] = True) -> TableAsset
-
Search the Router index for an asset matching the given search
Args
search
:str
orre.Pattern
, optional- Either an asset ID or a search pattern applied to asset names and descriptions. A simple string will match a substring or the entire string if exact_match is True, or a regular expression can be passed for complex searches.
namespace
:UUID
, optional- The UUID of the user to which this asset belongs. None indicates any user, NAMESPACE_DEFAULT_USER indicates the current API user.
owned
:bool
, optional- Only return owned assets (either personally or by the current user's team)
owned_by
:int
, optional- Only return owned assets owned by the given team ID
session
:Session
, optional- A connection session. If not specified, the default session is used.
exact_match
:bool
, optional- When the 'search' is a string, setting this to True will perform an exact match. Ignored for regex patterns, defaults to True.
Raises
TripleblindAssetError
- Thrown when multiple assets are found which match the search.
Returns
TableAsset
- A single asset, or None if no match found
def find_linked_records(datasets: List[Union[Asset, TableAsset, str]], match_columns: List[Union[str, Tuple(str, str)]], match_threshold: Optional[float] = 0.88, job_name: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, silent: Optional[bool] = False)
-
Find matching records between datasets
Args
dataset
:Union[List[Asset], List[TableAsset]]
- The two data assets to examine for linked records.
- match_columns (List[Union[str, Tuple(str, str)]]): Names of the
- columns to match. List of common column names or tuples
- containing the column names to be paired for the linkage.
- E.g. if the first dataset has columns named "fname", "lname" and
- "age" and the second dataset has a columns named "first", "last"
- and "age", then the columns would be specified as:
- [('fname', 'first'), ('lname', 'last'), 'age')].
match_threshold
:float
, optional- The minimum confidence for match values. This is a value between 0.0 and 1.0. Default is 0.88.
job_name
:str
, optional- Reference name for this process. This name will appear in the Access Request, Process History, and Audit Reports.
preproc
:Union[TabularPreprocessor, List[TabularPreprocessor]]
, optional- The preprocessor(s) to use against the datasets. When no preprocessor is specified, the default preprocessor selects all match columns and renames columns for linkage.
silent
:bool
, optional- Suppress status messages during execution? Default is to show messages.
Raises
TripleblindProcessError
- Thrown when a column in return_columns is masked or when k-Grouping validation fails (output contains less than k records).
ValueError
- Thrown when a parameter validation fails.
Returns
TableAsset
- The output contains rows from the first dataset which had matches in the second dataset. Following those rows, the second dataset matches are reported in the match order. Both include the cosine similarity of the match.
def position(file_handle: Union[str, Path, Package, io.BufferedReader], name: str, desc: str, is_discoverable: Optional[bool] = False, allow_overwrite: Optional[bool] = False, session: Optional[Session] = None, is_dataset: bool = True, custom_protocol: Optional[Any] = None, metadata: dict = {}, k_grouping: int = 5, unmask_columns: Optional[List[str]] = None)
-
Place CSV data on your Access Point for use in by yourself or others.
CSV Import Rules:
- The separator must be a comma, not semicolons or tabs.
- First row must be a header row containing column names.
- No padding is allowed on names, extra spaces will be treated as part of the name.
- Names can optionally be enclosed in double-quotes ('"').
- Legitimate values:
- The special numeric values 'inf', '-inf', 'nan' are supported. Capitalization is ignored (e.g. INF and InF are both legal)
- Boolean values can be 'true' or 'false', case is ignored
- Numeric values can be either integers (no decimals) or floating point (with a decimal).
- Scientific notation (e.g. 1.23e04 to represent 12300.0) can be use for floating point values.
- String values can be enclosed in double-quotes ('"') or not, although string containing commas must be within a quoted string.
- Column type is inferred from the entire column's content.
The
following will impact this process:
- Missing values mixed in with booleans becomes a string column
- Any floating point value mixed in with integers makes a floating point column
- Infinite, NaN or missing values in a column with integers makes a floating point column
- Leading spaces before booleans, or the special values -Inf, Inf, or NaN will result in a string column
- Missing values are treated as NaN in floating columns (and will also force an integer column to floating point)
- Column types in dataframes:
- Floating point columns are reported as 'np.float64'
- Integer columns are reported as 'np.int64'
- Boolean columns are reported as 'np.bool'
- String columns are reported as 'object'
- Leading spaces get included in string columns, but will be trimmed from simple numbers
- Quotes are optional for alpha-numerics but required for most strings with special characters (including commas)
Args
file_handle
:str, Path, Package
orio.BufferedReader
- File handle or path to CSV data to place on the API user's associated Access Point.
name
:str
- Name of the new asset.
desc
:str
- Description of the new asset (can include markdown)
is_discoverable
:bool
, optional- Should this asset be listed in the Router index to be found and used by others?
allow_overwrite
:bool
, optional- If False an exception will be thrown if the asset name already exists. If True, an existing asset will be overwritten.
session
:Session
, optional- A connection session. If not specified, the default session is used.
is_dataset
:bool
- Is this a dataset? (False == algorithm)
custom_protocol
- Internal use
metadata
:dict
- Custom metadata to include in the asset
k_grouping
:int
- The minimum number of distinct values
unmask_columns
:[str]
, optional- List of field names that should be unmasked. Default to to mask all fields.
Returns
TableAsset
- New asset on the Router, or None on failure
Instance variables
var dataframe : pd.DataFrame
-
The content of the asset loaded as a Pandas dataframe
Returns
pd.DataFrame
- The asset data
var raw_content : str
-
The content of the asset loaded as a string
Returns
str
- The content
Methods
def blind_join(self, intersect_with: Union[Asset, List[Asset], TableAsset, List[TableAsset]], match_column: Union[str, List[str]], return_columns: Union[str, List[str], List[List[str]]], match_fuzziness: Optional[float] = None, job_name: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, silent: Optional[bool] = False, join_type: JoinType = JoinType.INNER) -> TableAsset
-
The intersection of datasets using exact or fuzzy matching.
The first dataset must be yours, so this method can only be invoked on an asset your team owns.
The 'match_fuzziness' is an inverted Jaro-Winkler distance. In a Jaro-Winkler comparison, a score of 0.0 is totally different and 1.0 is perfectly matched. Therefore 'match_fuzziness' = 0.0 is a perfect match and 'match_fuzziness' = 1.0 is completely different.
Args
intersect_with
:Union[Asset, List[Asset]]
- A dataset Asset (or a list of Assets) to be included in the Blind Join.
match_column
:Union[str, List[str]]
- Name of the column to match. If not the same in all datasets, a list of the matching column names, starting with the initiator asset and then listing a name in each intersect_with dataset.
match_fuzziness
:float
, optional- If specified, the fuzziness threshold for match values. This is a value between 0.0 and 1.0. Default is 0.0 (perfect match).
return_columns
:Union[str, List[str], List[List[str]]]
-
Name of the columns to be returned in the resultant dataset. This is a list of lists of column names such that each party has its own list of column names (starting, as with match_column, with the initiator asset).
str: If a single column name is provided, columns matching that name across all datasets will be returned. eg. ['a'] is interpreted as [ ['a'], ['a'], ['a'] ]
List[str]: If a list of names is provided, columns matching these names across all datasets will be returned. eg. ['a', 'b'] is interpreted as [ ['a', 'b'], ['a', 'b'], ['a', 'b'] ]
List[List[str]]: If a list of lists is provided, only columns matching the names from the respective dataset will be returned. eg. [ ['a', 'b'], ['c', 'd'] ] is interpreted as [ ['a', 'b'], ['c', 'd'] ]
NOTE: If a column name is specified for an asset and it does not exist, it is ignored. If multiple datasets have the same name for a column, the output dataset will contain suffixes to differentiate the source of each field. E.g. If "name" is in multiple datasets, the return will contain "name", "name__1" & "name__2".
job_name
:str
, optional- Reference name for this process. This name will appear in the Access Request, Process History, and Audit Reports.
preproc
:Union[TabularPreprocessor, List[TabularPreprocessor]]
, optional- The preprocessor(s) to use against the datasets. When no preprocessor is specified, the default preprocessor selects all columns.
silent
:bool
, optional- Suppress status messages during execution? Default is to show messages.
join_type
:JoinType
, optional-
Describes the format of the resulting dataset table. Default is INNER.
INNER: This is a SQL-like inner join where each party's return columns are included in the result set.
INNER_PARTITIONED: This is an inner join with each party's matching rows appended to the end of the dataset. There is no alignment by matching column. This format provides more privacy-protection and reveals less group-membership information than a standard inner join.
LEFT: This is a SQL-like left join where ALL records are returned from the initiator's table, with matching records from each party's return columns (where available).
Raises
TripleblindProcessError
- Thrown when a column in return_columns is masked or when k-Grouping validation fails (output contains less than k records).
Returns
TableAsset
- A table of matching rows from all datasets. Rows from your table with multiple matches are duplicated such that all column values from matching rows in the second dataset are returned, one per row. If the resulting table is empty, this means there were no records found in the intersection of the datasets.
def detect_outlier(self, identifier_column: str, columns: Union[str, List[str]], std_dev_limit: int = 3, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> List[str]
-
Identify rows in a dataset containing unusual values, or outliers
This operation privately analyzes the given dataset values. Each value is then compared to the mean value for the dataset, identifying rows where the value is outside the number of standard deviations given. This is known as the Z Score. For a normal distribution, 99.7% of the values will be within 3 standard deviations of the mean.
Args
identifier_column
:str
- Name of the column in the table containing the record identifier, e.g. "id"
columns
:Union[str, List[str]]
- Name of the column(s) to check for outliers.
std_dev_limit
:int
- The number of standard deviations outside of which a value is considered an outlier. Default is 3, which captures 99.7% of a normally distributed dataset.
job_name
:str
, optional- Reference name for the job with performs this task.
silent
:bool
, optional- Suppress status messages during execution? Default is to show messages.
session
:Session
, optional- A connection session. If not specified, the default session is used.
Returns
List[str]
- List of identifiers for rows containing outliers.
def get_column_names(self) -> list
-
Fetch Column Names for a Given Asset
This should only be used on data known to be tabular. Relies on find to pass along asset.
Args
self
:TableAsset
- A generic TableAsset - utilize uuid
Returns
List
- A list of column names or an empty list
def get_mock_data(self, session: Optional[Session] = None) -> pandas.core.frame.DataFrame
-
Retrieve 10 rows of sample data typical of the dataset Args:
Returns
pd.DataFrame
- A table of the example data
def get_sample(self, num_samples: Optional[int] = 10, silent: Optional[bool] = False, session: Optional[Session] = None) -> pandas.core.frame.DataFrame
-
Generates synthetic data typical of the dataset
Result is randomly generated data similar to the actual dataset.
Args
num_samples
:int
, optional- Number of synthetic samples to generate. Default is 10.
silent
:bool
, optional- Suppress status messages during execution? Default is to show messages.
session
:Session
, optional- A connection session. If not specified, the default session is used.
Returns
pd.DataFrame
- A table of the synthetic data
def get_statistics(self, column: Union[str, List[str]], function: Union[StatFunc, List[StatFunc]] = None, combine_with: Optional[Union[Asset, List[Asset]]] = None, group_by: Optional[str] = None, preproc: Optional[Union[TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]]] = None, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None, match_column: Optional[Union[str, list[str]]] = None) -> TableAsset
-
Calculate one or more statistics on the database.
NOTE: String column types will be coerced to float32 when calculating statistics, if possible. Any non-numeric string in the column will result in an error.
Args
column
:str, List[str]
- Name of data column(s) upon which to calculate. Column names must be common between all of the data.
function
:StatFunc, List[StatFunc]
, optional- Function(s) to calculate. If not specified, all stats are calculated.
combine_with
:Union[Asset, List[Asset]]
, optional- Other table(s) with the same data/columns to virtually combine for the calculation.
group_by
:str
, optional- Data column for grouping of data before the calculation.
preproc
:TabularPreprocessor, List[TabularPreprocessor], TabularPreprocessorBuilder, List[TabularPreprocessorBuilder]
, optional- The preprocessor(s) to use against datasets. If a list is given, the order must be the same as the combine_with assuming the first entry is this TableAsset.
job_name
:str
, optional- Reference name for the job with performs this task.
silent
:bool
, optional- Suppress status messages during execution? Default is to show messages.
session
:Session
, optional- A connection session. If not specified, the default session is used.
match_column
:Union[str, List[str]]
, optional- If not provided, the data is treated as horizontally partitioned. If provided, the data is treated as vertically partitioned and match_column is the name of the column to match. If not the same in all datasets, a list of the matching column names, starting with the initiator asset and then listing a name in each intersect_with dataset.
Returns
TableAsset
- A table of the computed statistics.
def intersect(self, intersect_with: Union[Asset, str, Path, Package, List[Union[Asset, str, Path, Package]]], match_column: str, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> TableAsset
-
Perform a Private Set Intersection
Private Set Intersection (PSI) allows you to privately compare data you own with other datasets. The first dataset must be yours, so this method can only be invoked on an asset your team owns.
Args
intersect_with
:Union[Asset, List[Asset]]
- Other table(s) to intersect against. Can be Assets or paths to local files.
match_column
:str
- The name of the colum to match on each table
job_name
:str
, optional- Reference name for the job with performs this task.
silent
:bool
, optional- Suppress status messages during execution? Default is to show messages.
session
:Session
, optional- A connection session. If not specified, the default session is used.
Returns
TableAsset
- The intersection of the datasets
def load(self, sep: Optional[str] = None, col_names: Optional[List[str]] = None, header: Optional[bool] = True) -> pandas.core.frame.DataFrame
-
Retrieve and load the asset into a Pandas dataframe
Args
sep
:str
, optional- Character separating values. Default is comma.
col_names
:[str]
, optional- List of column names. Default is no names.
header
:bool
, optional- Read first line as a header? Default is True.
def mask_columns(self, col_names: Union[str, List[str]], mask_type: MockType = MockType.random, params: Optional[Dict[str, Any]] = None, session: Optional[Session] = None)
-
Masks columns identified by the supplied list of MaskColumns.
Args
col_names
:List
orstr
- Column name or list of names to mask.
mask_type
:MockType
- The mock type to use when generating data.
params
:Dict[str, Any]
, optional- Parameters used by the mask_type. For example, for MockType.email it could be 'params={"domain": "gmail.com"}'.
session
:Session
, optional- A connection session. If not specified, the default session is used.
Returns
True if the operation succeeded, otherwise false.
def pretty_print(self, col_names: Optional[List[str]] = None)
-
Display the table nicely.
Args
col_names
:[str]
, optional- Names to display for each column
def print_content(self, num_cols: Optional[int] = None, indent: Optional[int] = None, show_rows: Optional[bool] = False)
-
Print the contents of a TableAsset
NOTE: This downloads the data temporarily. This is done in a secure fashion, but local data should be treated with care.
Args
num_cols
:int
, optional- Number of columns for output. Default is single column.
indent
:int
, optional- Number of spaces to indent output. Default is zero.
show_rows
:bool
, optional- Should row numbers be displayed? Default is no row numbers.
def search(self, search_column: str, pattern: List[Union[str, re.Pattern]], match_case: bool = True, job_name: Optional[str] = None, silent: Optional[bool] = False, session: Optional[Session] = None) -> TableAsset
-
Perform string searches against rows of the given column
Args
search_column
:str
- The name of the column to search.
pattern
:Union[str, re.Pattern]
- Search pattern(s). These can be simple strings or full regex patterns.
match_case
:bool
, optional- For a simple string search, should the case be matched? Defaults to True.
job_name
:str
, optional- A reference name
silent
:bool
, optional- Suppress status messages during execution? Default is to show messages.
session
:Session
, optional- A connection session. If not specified, the default session is used.
Returns
TableAsset
- The search results, each on its own row
def unmask_columns(self, col_names: Union[str, List[str]], session: Optional[Session] = None) -> bool
-
Unmasks columns identified by the supplied list of col_names.
Args
col_names
:List
orstr
- Column name or list of names to unmask.
session
:Session
, optional- A connection session. If not specified, the default session is used.
Returns
True if the operation succeeded, otherwise false.
Inherited members