Glossary of Terms
A Docker container running on the user's infrastructure, connecting their data and algorithm assets to the TripleBlind ecosystem.
When a user outside of your organization wants to use an Asset you own, such as a dataset or algorithm, you must explicitly grant permission for each use. The outside organization user will initiate a process using your Asset, which will generate an Access Request. TripleBlind provides both a web interface and a command line tool (
tb.py) for approving or denying access requests. You can also establish Agreements with other organizations to automate the approval process, using the TripleBlind web interface or SDK.
Agreements are policies which apply to your Asset. They define how it may be used, by whom, and under what conditions. If an Asset has an Agreement in place, Access Requests are automatically approved when the requested use meets the terms and limitations defined within the Agreement. Asset Owners can use Agreements to grant usage rights to other organizations. For example, Agreements can be used to allow a specific organization to use your asset for a specific purpose without waiting for your manual approval for every usage.
TripleBlind supports training a wide range of AI network architectures. TripleBlind supports the Python PyTorch network architecture construction commands. Nearly any desired architecture can be built on nearly any data type (images, tabular, text, speech, video, etc.)
A trained model or standard program which can run against data.
An asset representing an operation, such as a trained neural network or a PMML definition of a statistical process.
Algorithm Connection Utility
Allows any algorithm to be connected to TripleBlind’s toolset. This means any algorithm can access data using the TripleBlind Router.
Algorithm encryption is one of TripleBlind’s core differentiators. We have patented technology for this encryption with the purpose being to protect the IP resident in algorithms offered via TripleBlind’s toolset. TripleBlind offers 4 levels of encryption, all intended to offer increasing levels of security for algorithms. Levels include: 1) downloaded local algorithm, 2) AES 256 encryption security, 3) Distributed Inference, 4) Secure Multiparty Compute (SMPC).
Trained or (to a lesser extent) externally trained and provisioned models generate a Profile view on the asset. This profile includes basic information about the type of algorithm (for example, “PyTorch” or “XGBoost”), the shape of input data for inferring against it (for example, a 3x3 matrix or 12 values) and the shape of the output. For models trained using TripleBlind, there will also be basic performance metrics, such as accuracy found during training tests.
UnderstandIng the profile of an algorithm helps potential users determine if it would be appropriate for their purposes, and helps actual users understand exactly how they need to format their data—for example, is it expecting a 2x3 or a 3x2 matrix?
Accuracy reports also give potential users understanding and assurances about the usage of the model.
A digital file hosted on an Access Point, belonging and managed by a specific organization. Assets can be further classified as algorithms and datasets.
An Asset Owner is a TripleBlind user that positions dataset and/or algorithm assets on their organization’s Access Point. They are typically given permission to publish assets, and may also be given permission to grant access to datasets, which allows them to approve access requests when someone wants to use an organization asset.
Asset Users are users that utilize assets within their own organization and/or assets published by other organizations. They are able to freely use assets that exist within their own organization. Assets from another organization may be used only if an Asset Owner from that organization approves the requested operation in an access request, or if the organization and the Asset User have established an active agreement that covers the use of that asset for the requested operation.
Universal Unique ID (UUID) used to identify an asset within the TripleBlind ecosystem.
TripleBlind uses cryptographic signatures to establish which specific assets are involved in a transaction. A record of each transaction is recorded. This allows each customer to know precisely which asset was used by whom, when and for what. For certain things (like medical diagnostics with FDA approved algorithms) the ability to ensure the “correct” (approved) algorithm is the one in use is important and can be proven via the Audit Trail.
Authentication / Access Token
A token unique to each user that is used to verify their access rights when using the TripleBlind system. The access token may be found in the
tripleblind.yaml file in a local SDK installation. Within the TripleBlind web interface, the token may be located by clicking on the username in the top-right corner of the app and navigating to My Account. Revealing the token through the web interface will reset and generate a new authentication token for the user.
A Bidirectional Encoder Representations from Transformers (aka BERT) encodes natural language using adjacent words to enhance contextual understanding. BERT was introduced by Google and has become ubiquitous in NLP. TripleBlind offers a BERT operation that allows for building on top of an existing model.
Optional parameter to “decorrelate” the relationship between a model’s input data and model parameters. This reduces the possibility of a training set membership attack on AI models. Without decorrelation a membership inference attack could potentially allow a malicious user to discern data which was part of the training dataset. If a member (record) in the training set can be inferred from the model, the model could reveal a patient’s identity at a fairly high level of probability.
Automatically de-identifies any type of data, at the byte-level. This non-traditional approach renders any type of data de-identified. TripleBlind has an expert opinion affirming the effectiveness of this technology. This applies to all data types, including genomic, image, tabular, voice, etc.
During multi-party model training, this is an optional layer inserted into the model to limit the information which could be extracted by a bad-actor party.
Inference refers to the process of taking a model that has already been trained and using it to infer a result from new data. TripleBlind supports privacy-preserving Blind Inferences on neural networks, random forest models, XGBoost models, statistical models, and many others. In addition to privacy-preserving distributed and federated inferences, TripleBlind also offers SMPC-based inferences that mathematically guarantee the privacy of the model and the data being run through the model.
Blind Join builds on Blind Match (Private Set Intersection) and provides powerful additional functions. It may be used to identify the subset of data within a counterparty’s (or within multiple counterparties’) dataset(s) that match to features within your own dataset, and bring in additional feature information for that subset of the third-party data. This is done while maintaining privacy for the non-matched data in those tables. In addition to matching on an exact key, Blind Join also supports the ability to perform “fuzzy matching” on identifiers that may differ slightly (e.g., name or address). This is an efficient workhorse method when two organizations are collaborating, allowing them to easily find matching subsets of data (usually for further processing).
Blind Join is a Safe with Care operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide), and has the potential for misuse. TripleBlind has a number of safeguards for its use:
- Blind Join is not permitted to return any columns the Asset Owner has masked; the assumption being that the underlying values in those columns contain PII/PHI or otherwise sensitive information.
- Blind Join is disabled by default at our strictest security levels.
- Unless an Agreement has been established permitting auto-approval of requests, all Blind Join operations require an informed Asset Owner approval through an Access Request.
k-Groupingis respected in the Blind Join operation as a minimum record threshold on the output; a join that would result in fewer than
krecords would automatically fail with a warning message.
Patented approach to allowing training across multiple data provider datasets in a manner that is much more efficient than standard distributed machine learning approaches. For instance, Blind Learning is more efficient than Federated Learning, and the effect is more pronounced the bigger the data and the more individual providers there are involved. In addition Blind Learning does not pass the model from provider to provider; therefore, it is much more “private” than Federated Learning.
TripleBlind has improved upon the existing Split Learning (see Split Learning) approach by incorporating concepts from other distributed learning methods, particularly Federated Learning. Specifically, unlike Split Learning, which requires participating entities to train sequentially, our approach enables participating entities to train their shares of the model in parallel, thereby reducing the overall training time.
Moreover, our approach mimics the Model Averaging technique used in Federated Learning to synchronize the model parameters across the participating entities. Therefore, not only does our approach improve the final model generalizability, but it also reduces the possibility of data leakage, since model updates are averaged across all participants.
Blind Match is TripleBlind’s implementation of a well-known secure multiparty computation known as a 🔗Private Set Intersection. Blind Match allows data from multiple sources to be privately joined based on a common identifier, returning only the values of the identifier that are found across all datasets. This allows a data provider to find the records in common with other data providers, without revealing any information apart from membership of the record to those sets. It is possible to create a common set of records (with features from multiple data providers), and use this intersection to train a model with Blind Learning.
Unlike other TripleBlind operations Blind Query is not an inherently privacy-preserving operation. Blind Query allows you to intentionally expose content to a requesting third party while maintaining visibility into the request and the capability to deny unauthorized access before the query executes. This is useful in special situations such as:
- The asset is a database view which is known to be privacy preserving (e.g. an SQL report that outputs summaries of classes of records)
- The asset is a safe output which you wish to make available to the other party
Blind Query is a Safe with Care operation (see Privacy Assurances and Risk in the Getting Started section of the User Guide), and has the potential for misuse. TripleBlind has a number of safeguards for its use:
- Blind Query is disabled by default at our strictest security levels.
- Unless an Agreement has been established permitting auto-approval of requests, all Blind Query operations require an informed Asset Owner approval through an Access Request. The Access Request for Blind Query contains information on any SQL statements that are invoked in the operation.
k-Groupingis respected in the Blind Query operation as a minimum record threshold on the output; a query that would result in fewer than
krecords would automatically fail with a warning message.
Blind Report allows you to position a database-backed query with predefined configurable parameters. Users can configure the query using these predefined options, have it run against your database, and receive a report table.
This is a powerful operation that allows the data steward to permit only specific, controlled access to the data they desire exposed to the consumer. For example, a report could be defined that allows a user to select a year and month along with a company division for generating salary statistics by ethnicity or gender for usage in compliance reporting.
Blind Sample generates a realistic privacy-preserving sample similar to the real data. The representative sample can be downloaded, examined, and used to develop your process before running against real data. This dataset can be used to understand the shape of the data and to build models offline before working against actual private data that is not visible. Blind Sample provides data scientists with data they can view and interact with, allowing them to refine their understanding and prototype the processes they will later implement with real, protected (blind) data.
The Blind Sample process respects all masking configurations set by the data owner in the Mock Data Editor within the TripleBlind web interface, and the resulting dataset will provide similar records as are visible within the asset’s Mock Data table. Unlike other positioned assets, the mock data table of a Blind Sample output dataset is not automatically masked with the “random” mask type, since the data that enters this process has already been masked appropriately by the data owner.
Blind Stats calculates federated descriptive summary statistics across multiple tabular datasets. This is a powerful privacy-preserving operation that allows a dataset user to understand a study population across multiple datasets, even when the data is in different organizations or regions. Blind Stats supports count, minimum, maximum, quartiles, mean, variance, standard deviation, standard error, skewness, kurtosis, and confidence intervals.
Stratification of samples is supported through a grouping function. Requests are automatically rejected for Blind Stats operations when they would return descriptive information on groups of records that do not meet the minimum
k-Grouping limits set on the involved datasets (see the
k-Grouping glossary entry).
Blind String Search
For tabular data, simple string and RegEx (regular expressions) searches can find matches and summary counts without exposing the actual data. This allows easy exploration of freeform text data without exposing the raw data.
A subset of operations (Blind Sample, Blind Join, & Blind Report) can be performed with zero code. Instead, the Create New Process web page walks the user through the required parameters and allows the operation to kick-off from the web. Results can then be seen and retrieved for viewing or further analysis in tools like Excel.
Create Asset / Position Data
The act of putting data on an Access Point in order to be used on the platform.
Allows data from multiple providers/sources to be aggregated into a single logical database. This happens in a way that does not allow any of the data sources to see the other provider’s data. This allows users of data to create analyses that encompass much larger data than in the past. The aggregation capability allows data to be horizontally combined (such as similar tables stacked on top of each other) and vertically combined (tables with rows with similar “keys” but different columns combined “side-by-side”).
Data Asset / Dataset
An asset representing a dataset, such as a CSV file, a set of images, or a database query which produces a fixed view of tabular data. The assets generated by an operation are also commonly data assets.
Database / Data Warehouse support
Support working with standard on-prem databases as well as data warehouses like BigQuery and Snowflake. Database assets provide live data. Data warehouse support allows those datasets to be used by others who don’t have accounts with the warehouse provider.
Data Discovery Tools
The TripleBlind web interface supports the ability to search for datasets. Allows data consumers to find interesting and appropriate datasets that are owned by either their own organization or by other organizations. Asset owners can choose whether or not to allow each of their dataset listings to be viewable by outside organizations.
Data Preprocessing Tools
These tools/parameters allow the model builder/data consumer to do standard preprocessing functions—normalization, feature engineering, etc. Pre-processing allows applying SQL syntax to modify an existing raw variable, create new variables from other raw variables, and/or filter records. Each dataset can independently be modified before use by any TripleBlind operation. SQL preprocessing is compatible with SQLite.
These tools are necessary for model builders/data consumers to position the data correctly for use. These tools are a differentiator for TripleBlind—they allow data providers to connect their data in its “raw” format and rely on the consumer to properly process it for their use. In this way, a single dataset is useful for multiple consumers. These tools are also important for data aggregation (in the event that multiple datasets need to be processed slightly differently). In sum, these tools mean data preparation isn’t a “team sport” across multiple providers (like it is when other encryption technology is applied—in those cases all data has to be prepared exactly the same way BEFORE encryption).
Digital Rights Management
TripleBlind enforces strict approval (see Access Requests) and permissions (see Agreements) for all “assets” (data or algorithms). This ensures that no “asset” is used without the owner's permission. This also ensures no nefarious usage of the asset—i.e. no unauthorized use.
The application of a trained model to remote data. The intellectual property and privacy of your model is protected by TripleBlind. This can include rows of data that span organizations. When attached to live data, this would allow a diagnosis to incorporate information from a hospital, an imaging company, and even financial records in-place.
Similar to split training, but applied to simpler regression models instead of neural networks. Use data from multiple organizations. Traditional regression models could not be built without bringing all of the data together at one location.
Sometimes referred to as the Docs or the Portal, the Documentation contains comprehensive information on everything TripleBlind: installation guides, tutorials, videos, user guides, SDK reference docs, release notes, user manuals, and more.
Exploratory Data Analysis
Report describing the overall tabular dataset (size, # of records) as well as statistical details about each variable (column) in the dataset. The goal of said report is to enable a data user to understand the data they are working with. The variable-level report includes ranges, graphs of distributions, type of data, etc. This tool reveals significant metadata regarding the data without revealing any privacy. Combined with Blind Sample, this provides a strong overview and detailed view of the data.
Users are able to submit questions to asset owners. The owner can reply to a question privately or make their answer publicly viewable. Direct communication between (potential) consumers of an asset and the owner is easy and context is usually obvious. This also helps the asset owner refine their offering to suit more consumers.
Federated learning is a decentralized machine learning technique that trains a model across multiple data providers each holding independent datasets. Typically this involves sending the model to each participant to initialize model training independently, and then averaging the activations of wait at a central point (without having to move or share their datasets). Our implementation of federated learning is seamlessly integrated with our access management, privacy controls, and auditing systems.
You can think of these operations as "bringing the program to the data." The Access Point executes the training or analysis in place against their own data. For certain operations intermediate data will be passed between Access Points, but no raw or un-aggregated data is ever shared. All data transmission uses industry standard SSL technologies, requiring TLS 1.2+ using SHA-256 or better cryptographic hashes, AES encryption and 2048 bit RSA keys.
See the glossary entry for Partitioned (Split) Data.
A dataset is said to have a “
k-anonymity” property if the information for each person contained in the release cannot be distinguished from at least
k-1 individuals whose information also appears in the dataset. For example, if each individual’s record is indistinguishable from 2 other records, the dataset has 3-anonymity.
k-anonymity, TripleBlind supports configuration of a
k-Grouping safeguard at the dataset-level. The
k-value ensures a minimum threshold number (
k) records are aggregated for each grouping in a Blind Stats computation, preventing exposure of sensitive information resulting from low numbers of records being operated on in aggregate. If this threshold is not met, the operation will fail with a warning message. This prevents accidental data leakage, eg. requesting the median of a group containing only 1 record. The
k-value is also respected in the Blind Query and Blind Join operations as a minimum record threshold on the output; a query that would result in fewer than
k records would automatically fail with a warning message.
k to 1 will position the dataset without any protection from this safeguard, as the record will be “hiding in a crowd” consisting of only that record. In other words, setting
k to 1 is equivalent to disabling
The operation and the data exist on the same machine. No privacy is guaranteed.
Machine Learning (ML) Training
TripleBlind supports training a variety of machine learning models including XGBoost, RandomForest, etc.
Tabular dataset assets are always positioned with every field masked to ensure complete privacy-protection. By default, a masked text field displays values containing random letter characters of varying length, while a masked numerical field displays values randomly generated within the range of the underlying data (and at the appropriate decimal precision). The default “random” masking may be replaced with realistic, reader-friendly masking types (ex. “Address”, “Full Name”) within the Mock Data Editor. The Asset Owner also has the option to mask or unmask fields via the SDK after positioning the dataset.
Unmasking a field will randomly sample real values from the underlying raw data for display in the Mock Data table and when creating a Blind Sample. This is not recommended for any field containing sensitive data. With the exception of Blind Sample, all TripleBlind operations work with the raw underlying data using privacy preserving techniques, not masked values. Masking only comes into play with reported results, such as the output columns for a Blind Join.
The Mock Data table on the Asset’s Overview tab displays a 10-record set of representative data for the dataset. It is important to remember that this representative data is just that; it does not contain raw values unless the Asset Owner chooses to unmask certain fields. This allows Asset Users to visualize the shape of the data without risking privacy.
This allows the output of an algorithm to be “systematically marked” in a way such that if the output of the model is used to create a copy of the model, the copy can be discovered. The primary purpose of this feature is to protect against model inversion attacks. If an algorithm customer accumulates enough algorithm outputs (for example, 1 million inferences from an AI Diagnostic model), said customer would have enough data to create a model of their own, which is essentially a copy of the original model. Model fingerprinting is a technique for discovering whether or not one model is the result of an inversion attack upon another model.
Allows any algorithm created with TripleBlind to be retrieved by the algorithm owner for use in any way the owner sees as appropriate. If a customer uses data to create an algorithm using TripleBlind’s tools, they do not have to use the TripleBlind toolset to use that algorithm. Users are able to retrieve (download) their algorithm and use it in any way they like outside of TripleBlind.
Allows AI models to be created across multiple files with different types of data (for example, tabular and image data to be processed together). Combining images, voice, tabular data, etc. together will bring together different “information” about a situation—for us, this is a slight expansion on our existing vertical partitioning capability.
Organizations using Okta for authentication management can easily manage users within Okta. Individual users will login via Okta tools. For an Okta organization, adding TripleBlind is as easy as selecting the Okta Tile for TripleBlind. Provisioning users is managed via the Okta tools.
ONNX (Open Neural Network Exchange) is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. TripleBlind supports a subset of ONNX operators in SMPC inference, and allows positioning models saved in the
An operation is a computation that will be executed on an asset or assets during a process. Agreements control which Operations are permitted for specific assets. In addition to predefined Operations such as Blind Join, Federated Learning, or Outlier Detection, an operation could also be a trained algorithm asset within your organization.
The Organization Owner is the first user account associated with a new organization and has primary responsibility for the management of the organization, including Access Point setup and user administration. The Organization Owner also has access to the audit logs for all assets. Other users of the system can be given permissions to handle administrative tasks, such as user management and agreement management.
Partitioned (Split) Data
Some operations specify working specifically on “vertically-partitioned” or “horizontally-partitioned” datasets. This distinction refers to the way that a distributed dataset may be split. Vertically-partitioned datasets are tables that have the same rows but perhaps different columns (or features), e.g. a set of hospitals that have patients in common but contain different information associated with those common patients. Horizontally-partitioned datasets are tables that have the same columns but different rows, e.g. a hospital network has the same structure for their patient database, but each country has its own “chunk” of the aggregate patients’ data.
Often datasets must be partitioned for operational or regulatory reasons. Vertically-partitioned datasets are common because PII or PHI is not easily shared or moved between silos. Horizontally-partitioned datasets may be split for operational purposes, e.g. so that it may be used more efficiently for processing in the cloud. The example above describes a situation in which a dataset that otherwise would be combined is horizontally-partitioned so that it may be compliant with data residency regulations.