Natural Language Processing

A distilled version of Bidirectional Encoder Representations from Transformers (aka DistilBERT) encodes natural language using adjacent words to enhance contextual understanding. The provided implementation allows building on top of the existing DistiBERT model for the task of Token Classification, which can be used to train named-entity recognition models, for example. This protocol, uses the following checkpoint from HuggingFace:

  • Model: distilbert/distilbert-base-uncased
  • Tokenizer: AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

This protocol has the following specifications:

  • It utilizes Blind Learning protocol as the training paradigm.
  • It uses the recommended optimizer torch.optim.AdamW. The optimizer requires two parameters, learning rate and weight decay, which can be defined by the user (see Parameters below).
  • It can automatically count and map the target entities to integers. Therefore, the provided dataset must include the training labels (as strings) instead of their integer representations (see the example below).
  • The protocol assumes the dataset is organized in two columns:
    • `text` a sequence of strings containing the textual data, such as doctor notes.
    • `entities` a list of lists indicating the start and end indices of the characters of the target text and a string representing the entity type of the target text.

Example two-column training dataset:

Text,entities
"Advil is a medicine for severe headache.", "[[0, 4, 'MEDICINE_NAME'], [24, 38, 'SYMPTOM'] ]"
Atenolol for high blood pressure led to fatigue.,"[[0, 8, 'MEDICATION'], [12, 32, 'DISEASE_DISORDER']]"

Operation

  • When using add_agreement() to allow a counterparty to use your dataset for model training, or using create_job() to train your model, use Operation.NLP_TRAIN.

Parameters

  • epochs: int = 1
  • batch_size: int = 1
  • learning_rate: float = 3e-4
  • weight_decay: float = 0.0

Limitations

  • Only local inference is supported for token classification tasks.