Improving NLU with Lambada Data Augmentation

Better NLG and NLU with Data Augmentation

One of our previous articles covered the LAMBADA method that makes use of Natural Language Generation (NLG) to generate training utterances for a Natural Language Understanding (NLU) task, namely intent classification. In this tutorial we walk you through the code to reproduce our PoC implementation of LAMBADA.

Before you go ahead with this tutorial, we suggest having a look at our article that explains the fundamental ideas and conceptsapplied by LAMBADA in more detail. In this tutorial we illustrate crucial methods providing an interactive COLAB notebook. Overall, we explain some of the key points of the code and demonstrate how to adjust parameters in order to match your requirements while omitting the less important parts. You can copy the notebook using your Google Account in order to follow along with the code. For training and testing you can insert your own data or use data that we provided.

Data Augmentation in NLU: Step 1 – Setting up the environment

We use distilBERT as a classification model and GPT-2 as text generation model. For both, we load pretrained weights and finetune them. In case of GPT-2 we apply the Huggingface Transfomers library to bootstrap a pretrained model and subsequently to fine-tune it. To load and fine-tune distilBERT we use Ktrain, a library that provides a high-level interface for language models, eliminating the need to worry about tokenization and other pre-processing tasks.

First, we install both libraries in our COLAB runtime:

    
    
    !pip install ktrain
    !pip install transformers

Chatbots with Data Augmentation: Step 2 – Data

We use one of the pre-labelled chitchat data sets from Microsoft’s Azure QnA Maker. Next, we split the chitchat data set such that we obtain ten intents with ten utterances each as an initial training data set and the remaining 1047 samples as a test data set. In the following, we use the test data set in order to benchmark the different intent classifiers we train in this tutorial.

Subsequently, we load the training data from file train.csv and split it in such a way to obtain six utterances per intent for training and four utterances per intent for validation.

    
    
    NUMBER_OF_TRAINING_UTTERANCES = 6
    import pandas
    from sklearn.model_selection import train_test_split
    data_train = pandas.read_csv('train.csv')
    intents = data_train['intent'].unique()
    X_train = []
    X_valid = []
    y_train = []
    y_valid = []
    for intent in intents:
        intent_X_train, intent_X_valid, intent_y_train, intent_y_valid = train_test_split(
        data_train[data_train['intent'] == intent]['utterance'],
            data_train[data_train['intent'] == intent]['intent'],
            train_size=NUMBER_OF_TRAINING_UTTERANCES,
            random_state=43
        )
        X_train.extend(intent_X_train)
        X_valid.extend(intent_X_valid)
        y_train.extend(intent_y_train)
        y_valid.extend(intent_y_valid)

NLU with the LAMBADA Method: Step 3 – Training the initial intent classifier

We download the pretrained distilBERT model, transform the training and validation data from pure text into the valid format for our model and initialize a learner object, which is used in KTrain to train the model.

    
    import ktrain
		from ktrain import text
    distil_bert = text.Transformer('distilbert-base-cased', maxlen=50, classes=intents)
    processed_train = distil_bert.preprocess_train(X_train, y_train)
		processed_test = distil_bert.preprocess_test(X_valid, y_valid)
    model = distil_bert.get_classifier()
		learner = ktrain.get_learner(model, train_data=processed_train, val_data=processed_test, batch_size=10)

Now it’s time to train the model. We feed the training data to the network multiple times, specified by the number of epochs. In the beginning both monitored metrics, namely the loss function (decrease) and the accuracy (increase), should indicate improvement of the model with each epoch passed. However, after training the model for a while the validation loss will increase and the validation accuracy drop. This is a result of overfitting the training data and it is time to stop feeding the same data to the network.

The optimal number of epochs depends on your data set, model and training parameters. If you do not know the right number of epochs beforehand you can use a high number of epochs and activate checkpoints by setting the checkpoint_folder parameter to select the best performing model afterwards.

    
    N_TRAINING_EPOCHS = 12
    learner.fit_onecycle(5e-5, N_TRAINING_EPOCHS)

    
    begin training using onecycle policy with max lr of 5e-05...
    Train for 6 steps, validate for 2 steps
    Epoch 1/12
    6/6 [==============================] - 7s 1s/step - loss: 2.3088 - accuracy: 0.1167 - val_loss: 2.3236 - val_accuracy: 0.1000
    Epoch 2/12
    6/6 [==============================] - 0s 68ms/step - loss: 2.2913 - accuracy: 0.1333 - val_loss: 2.3084 - val_accuracy: 0.1000
    Epoch 3/12
    6/6 [==============================] - 0s 68ms/step - loss: 2.2728 - accuracy: 0.1167 - val_loss: 2.2741 - val_accuracy: 0.1000
    Epoch 4/12
    6/6 [==============================] - 0s 66ms/step - loss: 2.2039 - accuracy: 0.4167 - val_loss: 2.1981 - val_accuracy: 0.4500
    Epoch 5/12
    6/6 [==============================] - 0s 68ms/step - loss: 2.0552 - accuracy: 0.7333 - val_loss: 2.0282 - val_accuracy: 0.6000
    Epoch 6/12
    6/6 [==============================] - 0s 66ms/step - loss: 1.7596 - accuracy: 0.9000 - val_loss: 1.7276 - val_accuracy: 0.7500
    Epoch 7/12
    6/6 [==============================] - 0s 66ms/step - loss: 1.3359 - accuracy: 0.9667 - val_loss: 1.4421 - val_accuracy: 0.8250
    Epoch 8/12
    6/6 [==============================] - 0s 67ms/step - loss: 0.9690 - accuracy: 1.0000 - val_loss: 1.2494 - val_accuracy: 0.8500
    Epoch 9/12
    6/6 [==============================] - 0s 67ms/step - loss: 0.7366 - accuracy: 1.0000 - val_loss: 1.0965 - val_accuracy: 0.8750
    Epoch 10/12
    6/6 [==============================] - 0s 68ms/step - loss: 0.5735 - accuracy: 1.0000 - val_loss: 1.0089 - val_accuracy: 0.8750
    Epoch 11/12
    6/6 [==============================] - 0s 67ms/step - loss: 0.5007 - accuracy: 1.0000 - val_loss: 0.9680 - val_accuracy: 0.8750
    Epoch 12/12
    6/6 [==============================] - 0s 69ms/step - loss: 0.4451 - accuracy: 1.0000 - val_loss: 0.9504 - val_accuracy: 0.8750

To check the performance of our trained classifier, we use our test data in the eval.csv file.

    
    import numpy

    data_test = pandas.read_csv('eval.csv')
    test_intents = data_test["intent"].tolist()
    test_utterances = data_test["utterance"].tolist()

    predictions = predictor.predict(test_utterances)

    np_test_intents = numpy.array(test_intents)
    np_predictions = numpy.array(predictions)

    result = (np_test_intents == np_predictions)

    print("Accuracy: {:.2f}%".format(result.sum()/len(result)*100))

Note that thanks to the KTrain interface we can simply feed the list of utterances to the predictor without the need to pre-process the raw strings beforehand. We get the accuracy of our classifier as an output:

Accuracy: 84.24%

NLU with the LAMBADA Method: Step 4 – Fine-tune GPT-2 to generate utterances

To fine-tune GPT-2, we use a Python script made available by Huggingface on their Github repository. Among others, we specify the following parameters:

the pretrained model that we want to use (gpt2-medium).Larger models, typically generate better text outputs. Please note, these models require a large amount of memory during training, so make sure you pick a model that fits into your (GPU-)memory.
the number of epochs. This parameter specifies how many times the training data is fed through the network. On the one hand, if the number of epochs is too small, the model will not learn to generate useful utterances. On the other hand, if the number is chosen too big, the model will likely overfit and the variability in the generated text data will be limited – the model will basically just remember the training data.
the batch size. This determines how many utterances are used for training in parallel. The larger the batch size the faster the training, larger batch sizes require more memory, though.
the block size. The block size defines an upper bound on the number of tokens considered from each training data instance that are used. Make sure that this number is sufficient so that utterances are not cropped.

    
    !python finetune_gpt.py \
        --output_dir='/content/transformers/output' \
        --model_type=gpt2-medium \
        --model_name_or_path=gpt2-medium \
        --num_train_epochs=3.0 \
        --do_train \
        --train_data_file=/content/train.csv \
        --per_gpu_train_batch_size=4 \
        --block_size=50 \
        --gradient_accumulation_steps=1 \
        --line_by_line \
        --overwrite_output_dir

Let’s load our model and generate some utterances! To trigger the generation of new utterances for a specific intent we provide the model with this intent as seed ('<intent>,', e.g. ‘inform_hungry,’).</intent>

 		
    from transformers import GPT2Tokenizer, TFGPT2LMHeadModel

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
    model = TFGPT2LMHeadModel.from_pretrained('/content/transformers/output/', pad_token_id=tokenizer.eos_token_id, from_pt=True)
		input_ids = tokenizer.encode('inform_hungry,', return_tensors='tf')
		sample_outputs = model.generate(
        input_ids,
        do_sample=True, 
        max_length=50, 
        top_k=1, 
        top_p=0.9, 
        num_return_sequences=10
    )

    print("Output:\n" + 100 * '-')
    for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

    
    Output:------------------------------------------------------------------------- 
    0: inform_hungry,I want a snack!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
    1: inform_hungry,I want to eat!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
    2: inform_hungry,I want some food!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
    3: inform_hungry,I'm so hungry I could eat!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!...

This looks good! The artificially generated utterances fit the intent, but in order to be a useful addition and to improve our model, these utterances must differ from utterances used for training. The training data for the intent inform_hungry was the following:

    
    inform_hungry,I want a snack
    inform_hungry,I am very hungry
    inform_hungry,I'm hangry
    inform_hungry,Need food
    inform_hungry,I want to eat
    inform_hungry,I'm a bit peckish
    inform_hungry,My stomach is rumbling
    inform_hungry,I'm so hungry I could eat a horse
    inform_hungry,I'm feeling hangry
    inform_hungry,I could eat

We can see that the two utterances “I want some food” and “I’m so hungry I could eat” are not part of the training data.

If we are not satisfied with our generated utterances because they are all very similar or if they do not match the underlying intent, we can adjust the variability of the generated output by modifying the following parameters:

do_sample. This parameter must be set to True, otherwise the model will keep returning the same output.
top_k. This parameter specifies the number of distinct tokens that are considered for sampling each step. The higher you set this parameter, the more diverse the output will be.
top_p. This parameter specifies the cumulative probability of most likely tokens considered for sampling, e.g. using top_p = 0.92 will sample from 92% most likely words. The higher top_p, the more diverse the output. The maximum value is 1.

Step 5 – Generate and filter new utterances

We now generate the new utterances for all intents. To have a sufficiently large sample that we can choose the best utterances from, we generate 200 per intent.

    
    NUMBER_OF_GENERATED_UTTERANCES_PER_INTENT = 200
    def generate_utterances_df(n_generated, tokenizer, model, intent):
      input_ids = tokenizer.encode(intent + ',', return_tensors='tf')
      sample_outputs = model.generate(
        input_ids,
        do_sample=True, 
        max_length=50, 
        top_k=n_generated, 
        top_p=0.92, 
        num_return_sequences=n_generated
      )

      list_of_intent_and_utterances = [
        (
            intent,
            tokenizer.decode(sample_output, skip_special_tokens=True)[len(intent)+1:]
        )
        for sample_output in sample_outputs
      ]

      return pandas.DataFrame(list_of_intent_and_utterances, columns=['intent', 'utterance'])
      intents = data_train["intent"].unique()

      generated_utterances_df = pandas.DataFrame(columns=['intent', 'utterance'])

      for intent in intents:
        print("Generating for intent " + intent)
        utterances_for_intent_df = generate_utterances_df(NUMBER_OF_GENERATED_UTTERANCES_PER_INTENT, tokenizer, model, intent)
        generated_utterances_df = generated_utterances_df.append(utterances_for_intent_df)

After a while the data is generated, and we can have a closer look at it. First, we use our old distilBERT classifier to predict the intent for all generated utterances. We also keep track of the prediction probability indicating the level of confidence of each individual prediction made by our model.

    
    predictions_for_generated = numpy.array(predictor.predict(generated_data['utterance'].tolist(), return_proba=False))
    proba_for_predictions_for_gen = predictor.predict(generated_data['utterance'].tolist(), return_proba=True)
    predicted_proba = numpy.array([max(probas) for probas in proba_for_predictions_for_gen])

    generated_data_predicted = pandas.DataFrame({"intent": generated_data['intent'],
                                                 "utterance": generated_data['utterance'],
                                                 "predicted_intent": predictions_for_generated,
                                                 "prediction_proba": predicted_proba})

	intent	utterances	predicted_intent	predicted_proba
0	body_related_question	Do you chew?	body_related_question	0.701058
1	body_related_question	Do you have a stomach?	body_related_question	0.737520
2	body_related_question	Do you sneeze?	body_related_question	0.741122
3	body_related_question	Do you have teeth?	body_related_question	0.714836
4	body_related_question	Do you have legs?	body_related_question	0.726910

Let’s have a look at some of the utterances for which the intent used for generation does not match the predicted intent.

    
    generated_data_predicted[generated_data_predicted['intent'] != generated_data_predicted['predicted_intent']].head(20)

	intent	utterances	predicted_intent	predicted_proba
7	ask_purpose	Where do you live?	get_location	0.745748
20	ask_purpose	What was your greatest passion growing up?	needs_love	0.192401
70	ask_purpose	Why are you here?	get_location	0.683455
182	ask_purpose	Where are you from?	get_location	0.697122
49	get_location	Are you in a computer?	body_related_question	0.498938
162	get_location	Tell me what you're doing	ask_purpose	0.571899
3	make_sing	I sing a song	inform_hungry	0.358060
18	make_sing	I want to sing	inform_hungry	0.604815
20	make_sing	You're so cute	needs_love	0.266433
41	make_sing	You're singing	inform_hungry	0.323076

We can see that in some cases the prediction is clearly wrong. However, there are also cases where the prediction matches the utterance, but doesn’t match the intent used for generation. This indicates that our GPT-2 model is not perfect as it doesn’t generate matching utterances for an intent all the time.

To stop from training our classifier with corrupt data, we drop all utterances for which the basic intent does not match the predicted intent. For those with matching instances, we only keep the ones with the highest prediction probability scores.

    
    correctly_predicted_data.drop_duplicates(subset='utterance', keep='first').sort_values(by=['intent', 'prediction_proba'], ascending=[True, False]).drop_duplicates(keep='first').groupby('intent').count()

intent	utterance	predicted_intent	predicted_proba
ask_purpose	60	60	60
body_related_question	41	41	41
get_location	48	48	48
greet	77	77	77
humor related	50	50	50
inform_hungry	35	35	35
inform_tired	67	67	67
make_sing	68	68	68
needs_love	71	71	71
suicide risk	67	67	67

We can see that for each intent, there are at least 35 mutually distinct utterances. To keep a balanced data set, we pick the top 30 utterances per intent according to the prediction probability.

    
    TOP_N = 30
    top_predictions_per_intent = correctly_predicted_data.drop_duplicates(subset='utterance', keep='first').sort_values(by=['intent', 'prediction_proba'], ascending=[True, False]).drop_duplicates(keep='first').groupby('intent').head(TOP_N)

NLU with LAMBADA’s Data Augmentation: Step 6 – Train the intent classifier with augmented data

We now combine the generated data with the initial training data and split the enriched data set intotraining and validation data.

 
  
      data_train_aug = data_train.append(top_predictions_per_intent[['intent', 'utterance']], ignore_index=True)

      intents = data_train_aug['intent'].unique()

      X_train_aug = []
      X_valid_aug = []
      y_train_aug = []
      y_valid_aug = []
      for intent in intents:
        intent_X_train, intent_X_valid, intent_y_train, intent_y_valid = train_test_split(
            data_train_aug[data_train_aug['intent'] == intent]['utterance'],
            data_train_aug[data_train_aug['intent'] == intent]['intent'],
            train_size=0.8,
            random_state=43
        )

        X_train_aug.extend(intent_X_train)
        X_valid_aug.extend(intent_X_valid)
        y_train_aug.extend(intent_y_train)
        y_valid_aug.extend(intent_y_valid)

Now it’s time to train our new intent classification model. The code is like the one above:

    
    distil_bert_augmented = text.Transformer('distilbert-base-cased', maxlen=50, classes=intents)
    
    processed_train_aug = distil_bert_augmented.preprocess_train(X_train_aug, y_train_aug)
    processed_test_aug = distil_bert_augmented.preprocess_test(X_valid_aug, y_valid_aug)
    
    model_aug = distil_bert_augmented.get_classifier()
    learner_aug = ktrain.get_learner(model_aug, train_data=processed_train_aug, val_data=processed_test_aug, batch_size=50)
    
    N_TRAINING_EPOCHS_AUGMENTED = 11
    learner_aug.fit_onecycle(5e-5, N_TRAINING_EPOCHS_AUGMENTED)

Finally, we use our evaluation data set to check the accuracy of our new intent classifier.

    
    Accuracy: 91.40%

We can see that the performance improved by a margin of 7%. Overall, the improvement in prediction accuracy was consistently more than 4% across all experiments we ran.

LAMBADA AI: Summary

We employed the LAMBADA method to augment data used for Natural Language Understanding (NLU) tasks. We trained a GPT-2 model to generate new training utterances and utilized them as training data for our intent classification model (distilBERT). The performance of the intent classification model improved by at least 4% in each of our tests.

Additionally, we saw that high-level libraries such as KTrain and Huggingface Transformers help to reduce the complexity of applying state-of-the-art transformer models for Natural Language Generation (NLG) and other Natural Language Processing (NLP) tasks such as classification and make these approaches broadly applicable.

LAMBADA Method: How to use Data Augmentation in NLU?

Better NLG and NLU with Data Augmentation

Data Augmentation in NLU: Step 1 – Setting up the environment

Chatbots with Data Augmentation: Step 2 – Data

NLU with the LAMBADA Method: Step 3 – Training the initial intent classifier

NLU with the LAMBADA Method: Step 4 – Fine-tune GPT-2 to generate utterances

Step 5 – Generate and filter new utterances

NLU with LAMBADA’s Data Augmentation: Step 6 – Train the intent classifier with augmented data

LAMBADA AI: Summary

Featured posts

Evaluating RAG Systems: A Guide with the Ragas Framework

Efficient AI Models for Edge Devices: Methods for Reducing the Size of AI Models

Protecting AI models from cyberattacks: Attack scenarios and best practices