Image by mohamed Hassan from Pixabay

This notebook was adapted from the following project:


Original license of the project this notebook was adapted from:


# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.


Hola! Today we will be creating a chatbot, but not just any chatbot. In this tutorial, you will create your own open-dialog chatbot, one that doesn't just have premade responses to very specific questions or commands!

The overall goal of this tutorial is to create a language learning companion where you can practice simple conversations in a language you care about. We will focus on the beautiful Spanish language in this series as I have been trying to learn the language for the past 5 years, however, you should be able to adapt this tutorial to other languages as well.

First we are going to cover some of the background material for how all this works (if you are already familiar with the GPT2 model, go ahead and skip this background section). Let's get to it!


What is GPT2?

In this post, we are going to use the GPT2 model (Generative Pre-Training 2), from the amazing paper "Language Models are Unsupervised Multitask Learners" by Alex Radford et. al. I will be giving a brief overview of this model. However, if you want a more in-depth explanation I highly recommend the blog post "The Illustrated GPT-2" by Jay Alammar.

GPT2 is what is called an autoregressive language model. This may sound complicated, but it is actually quiet simple, so lets break down what this means. Autoregressive means that the output of the model is fedback into the model as input. Here is a nice example of how that works:

Image From Deepmind

Now, a language model is usually some statistical model that gives the probability of some word given the context. So, take the following example:

An [blank] a day keeps the doctor away

A good language model would give higher probability to the word "apple" occuring in the [blank] than say the word "crocodile" since most likely encountering a crocodile daily would probably have the opposite effect.

Putting them together, we get an autoregressive language model where given some context

How much wood could a woodchuck chuck, if a woodchuck could [blank]

The statistical model then gives some probability to what the next word will be, which we will use in selecting the word. Once we have the selection we add it to our sentence and repeat the whole process again!

How much wood could a woodchuck chuck, if a woodchuck could chuck [blank]

Now, to train our autoregressive language model we just need to get a bunch of example sentences or just chunks of text, hide the last word, and use these sentences with the missing word as our inputs and the last words as the target. This is essentially the whole idea behind GPT2 and many other autoregressive language models, where they learn how language works by using the context to infer the next word.

GPT2 as a chatbot

Great, so you may be asking yourself, "how do we use GPT2 as a chatbot?" To answer this question we need to turn our attention to another paper, "DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation". To see how we can repurpose this generator, GPT2, look at the following example:

Hi, how are you? [end_of_turn] I'm good, what about you? [end_of_turn] Not so good, lots of long nights at work. [end_of_turn] Darn, that sucks :( [end_of_conversation]

This is a sample conversation between two speakers. What's special about it is that there are special tokens that signify when one of the speakers has finished talking, which we in the biz call a turn. If we treat this example like our previous one with the autorgressive language mode, we can do some interesting things:

Hi, how are you? [end_of_turn] [blank]

If we use the same logic as we did previously, it is easy to see how we can now use GPT2 to guess the next word in this conversation.

Hi, how are you? [end_of_turn] I'm [blank]

We keep feeding back the prediction of our model and there ya have it! A chatting GPT2, where all we need to do is show the model a bunch of these example conversations and have it predict the next word in the conversation.

I think that is plenty of background, we will revisit exactly how we design a system where we actually hold a conversation with GPT2 once we have the model trained ;).

! pip -q install transformers==2.9.0 gdown
     |████████████████████████████████| 645kB 5.5MB/s 
     |████████████████████████████████| 1.0MB 44.6MB/s 
     |████████████████████████████████| 890kB 45.8MB/s 
     |████████████████████████████████| 3.8MB 34.9MB/s 
  Building wheel for sacremoses ( ... done

Let's define to configuration variables so we don't have a bunch of magic numbers and strings!

# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

The Data!

To train our chatbot we will be using conversations scraped from subtitles of Spanish TV shows and movies. I've gone ahead and formated the data for us already, however, if you would like to use a different language to train your chatbot you can use this script to generate a csv with the same format I am going to use in the rest of this tutorial.

! gdown
To: /content/final_es_conv.csv
20.3MB [00:00, 55.6MB/s]
df = pd.read_csv('final_es_conv.csv')
df = df.dropna()
trn_df, val_df = train_test_split(df, test_size = 0.2)
response context context/0 context/1 context/2 context/3 context/4 context/5 context/6 context/7 context/8 context/9
36917 Es tan simple. No se que te detiene. ¡Muy persuasiva! No es crimen librarse de una alimaña. ¿Por qué tener lastima de un hombre tan vil? Además, también ha puesto sus ojos en Okayo. Hace 4 años que soy victima de mi marido. Solo estoy siendo franca contigo. Calmate. ¡Eres peor que el diablo! ¿Comprendes? Okayo me recuerda constantemente mi fracaso.
5449 Muy torpe, Joyce. A la sala de interrogación rápido. ¡Muévanse! De pie, muchachos. A la sala de interrogación rápido. ¡Muévanse! De pie, muchachos. ¡Use su cuchillo, hombre! ¡Adelántese, Thomson! ¡Bien hecho, Jenkins! Gracias. Muy bien. El bungaló del mayor Warden está al final del ... Continúe, conductor.
37004 Pídemelo. Sólo lo que quieras tú. Ya no soy yo. Eres preciosa y maravillosa. ¿No? Así te gustaré. Haré y diré lo que quieras. Nunca. Así nunca querrás estar con otras, ¿verdad? Siempre diré lo que tú desees y haré lo que tú... Pero yo sí. Pero...
47077 ¡Boris! ¡Nicolás, que alegría a mi corazón, volviste! ¡Regresan los Vencedores! ¡Miren! ¡Ahí vienen! Está vivo. Boris está vivo. Dasha prometió avisarme cuando regrese. Pero, en la fábrica dicen que él está en una u... Tampoco hay noticias de Stepan. ¡Quién sabe! ¿Por qué entonces, no hay noticias de él?
41450 Entonces por qué no estamos en mejor situación... Dora Hartley era una buena prueba. Mire, lo que hace usted creer ¿Qué los indios ... Aleja esa arma. Buenas noches. Es hora de ir a la cama. Seguro. Sí. recuerde que es un secreto. Es bonita. Está bien. ¿Ann Martin? Hola, Bax.
len(trn_df), len(val_df)
(40374, 10094)

def get_counter_and_lens(data, tokenizer):
    flatten = lambda l: [item for sublist in l for item in sublist]
    toks = [tokenizer.tokenize(x) for x in data]
    return list(map(len, toks)), Counter(flatten(toks)), Counter(' '.join(data).split())
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
lens, tok_cnt, word_cnt = get_counter_and_lens(trn_df[df.columns].apply(lambda x: ' '.join(x.astype(str)), axis = 1), tokenizer)

def plot_counts(counts, top_k = 30):
    labels, values = zip(*counts.most_common()[:top_k])

    indexes = np.arange(len(labels))
    width = 1
    plt.figure(num=None, figsize=(22, 4), dpi=60, facecolor='w', edgecolor='k'), values, width)
    plt.xticks(indexes + width * 0.5, labels)
plot_counts(tok_cnt, top_k = 30)
plot_counts(word_cnt, top_k = 30)

def plot_hist(lens, n_bins = 50):
    n, bins, patches = plt.hist(lens, n_bins, facecolor='blue', alpha=0.9)
print(f'Mean: {mean(lens)}, Median: {median(lens)}, Standard Deviation: {stdev(lens)}, 90th Percentile: {np.percentile(lens, 100)}')
Mean: 150.01203744984397, Median: 141.0, Standard Deviation: 44.59412209701778, 90th Percentile: 513.0

Let's get our data into a format that we can feed into our model using Pytorch's Dataset and Dataloader API. All these methods do are convert our dataframes where we have multiple historical dialog, i.e., context, and a response, into a single conversation string that is separated a special token that tells our model when a person is finished speaking.

These conversation strings are then tokenized using HuggingFace's awesome tokenizers into their numerical representation that our model actual understands!

def construct_conv(row, tokenizer, eos = True):
    # from:
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
  "Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
  "Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                if len(conv) > block_size: continue

            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
            # If your dataset is small, first you should loook for a bigger one :-) and second you
            # can change this behavior by adding (model specific) padding.

  "Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

Training and Evaluating

Now that we have THE DATA we can finally create our model and start training it! The training and evaluation loop are quite simple. We simplely take a batch of examples from our dataloader and use it both as our inputs and labels. We do this because GPT2 is an auto-regressive model, meaning it uses some context to predict the next token. This prediction is then added to the original context and fed back in as the new context for generating the next token.

To evaluate our model, we use the metric perplexity, which is a simple, but powerful metric. Perplexity is a measure of how unsure the model is in its choice of the next token. The more unsure our model is, the higher its perplexity. One fascinating thing about perplexity is that it correlates very well with what humans think of when it comes to coherent and specific natural conversations, which was shown in the amazing paper "Towards a Human-like Open-Domain Chatbot" by Daniel Adiwardana, et. al.

# Training of model

def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    # add_special_tokens_(model, tokenizer)

    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total

    # Check if saved optimizer or scheduler states exist
    if (
        and os.path.isfile(os.path.join(args.model_name_or_path, ""))
        and os.path.isfile(os.path.join(args.model_name_or_path, ""))
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "")))

    if args.fp16:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True

    # Train!"***** Running training *****")"  Num examples = %d", len(train_dataset))"  Num Epochs = %d", args.num_train_epochs)"  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )"  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)"  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

  "  Continuing training from checkpoint, will skip to saved global_step")
  "  Continuing training from epoch %d", epochs_trained)
  "  Continuing training from global step %d", global_step)
  "  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
  "  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs =
            labels =
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                scheduler.step()  # Update learning rate schedule
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training

          , os.path.join(output_dir, "training_args.bin"))
          "Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

          , os.path.join(output_dir, ""))
          , os.path.join(output_dir, ""))
          "Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
        if args.max_steps > 0 and global_step > args.max_steps:

    if args.local_rank in [-1, 0]:

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!"***** Running evaluation {} *****".format(prefix))"  Num examples = %d", len(eval_dataset))"  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs =
        labels =

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:"***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
  "  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

Now let's put it all together into our runner function and let our baby cook away!

# Main show runner

def main(df_trn, df_val):
    args = Args()
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        bool(args.local_rank != -1),

    # Set seed

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
    )"Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)"Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training

        # Good practice: save your training arguments together with the trained model, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging"Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())

    return results
%load_ext tensorboard
%tensorboard --logdir runs

Finally, we run our model! I found this can take anywhere from an hour to three hours depending on the GPU Google give to you to finish training a model that can sort of hold a coherent conversation for the Spanish language. If you are using a different language, you'll have to play around with how long to cook your model for.

main(trn_df, val_df)

Chatting with our Model

Now that we have our model trained, let's it out for a spin and have our first conversation with it!

In order to allow us to chitchat with our new bot we need to figure out when the model has finished its turn, i.e. when it has generated the [end_of_turn] token. When the model generates this token, we can switch back control of the conversation to the user so they can respond. Luckily, this is very easy to do with the Huggingface framework!

The below code is copied pretty much verbatim from the creators of the DialoGPT model, which you can find here.

tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output')

# Let's chat for 5 lines
for step in range(6):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids =[chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=1000,
        top_p=0.92, top_k = 50
    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
05/13/2020 00:27:10 - INFO - filelock -   Lock 139706162979168 acquired on /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5.lock
05/13/2020 00:27:10 - INFO - transformers.file_utils - not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpkhif9g52
05/13/2020 00:27:10 - INFO - transformers.file_utils -   storing in cache at /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
05/13/2020 00:27:10 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
05/13/2020 00:27:10 - INFO - filelock -   Lock 139706162979168 released on /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5.lock
05/13/2020 00:27:10 - INFO - transformers.configuration_utils -   loading configuration file from cache at /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
05/13/2020 00:27:10 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 50257

05/13/2020 00:27:10 - INFO - transformers.tokenization_utils -   Model name 'microsoft/DialoGPT-small' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'microsoft/DialoGPT-small' is a path, a model identifier, or url to a directory containing tokenizer files.

05/13/2020 00:27:11 - INFO - filelock -   Lock 139706164883072 acquired on /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock
05/13/2020 00:27:11 - INFO - transformers.file_utils - not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpaeb7ikva
05/13/2020 00:27:12 - INFO - transformers.file_utils -   storing in cache at /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
05/13/2020 00:27:12 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
05/13/2020 00:27:12 - INFO - filelock -   Lock 139706164883072 released on /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock

05/13/2020 00:27:12 - INFO - filelock -   Lock 139706162979168 acquired on /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
05/13/2020 00:27:12 - INFO - transformers.file_utils - not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp4k0b0lt0
05/13/2020 00:27:13 - INFO - transformers.file_utils -   storing in cache at /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
05/13/2020 00:27:13 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
05/13/2020 00:27:13 - INFO - filelock -   Lock 139706162979168 released on /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock

05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file from cache at /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file from cache at /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file from cache at None
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file from cache at None
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file from cache at None
05/13/2020 00:27:14 - INFO - transformers.configuration_utils -   loading configuration file output/config.json
05/13/2020 00:27:14 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 50257

05/13/2020 00:27:14 - INFO - transformers.modeling_utils -   loading weights file output/pytorch_model.bin
>> User:Hola.
DialoGPT: ¿Qué estás haciendo?
>> User:Estoy leyendo un libro nuevo.
DialoGPT: ¿Qué leyendo?
>> User:Se llama "The Witcher."
DialoGPT: ¿Qué quieres decir?
>> User:Es un libro sobre magia y monstruos.
DialoGPT: ¿Qué quieres decir?
>> User:¿Te gusta libros?
DialoGPT: ¿Qué te pasa?
>> User:Nada mucho.
DialoGPT: ¿Por qué no me lo dijiste?

Now, it ain't the best, however, training it for longer or using the DialogGPT-medium instead of DialogGPT-small does improve results, at least in my experiments. I decided to only include the DialogGPT-small in this tutorial due to the limited (But still AMAZING) resources of Google Colab. I've went ahead and trained a bigger DialogGPT-medium model for longer and have uploaded it to Huggingface for anyone to try out! Find the model card here.


In this tutorial, you learned how to train an Open-Dialog chatbot in any language we want to practice with! This involved learning about the amazing transformers library by Huggingface that has seen a lot of popularity recently. You've also learned what an Open-Dialog chatbot is and some of the difficulties that come with training them such as constructing training examples and generating repetitive text.

This is just part one in what I am hoping will be a three part series! In the next part, we will take our model and integrate it into a web app using the awesome Streamlit library. Finally, part three will then be generating an Android application for chatting with your new language companion!


If you do train a new chatbot for a language of your interest, please share it! I'd love to hear about your progress with it and I'm sure others would be also interested in it as these models can be quite expensive to train.

If you want an ease way to share it, I suggest submitting your trained model to Huggingface's model zoo, where others can view and download your model to use as a starting point for their applications! Here is a simple way for taking the model trained in this tutorial and uploading it to Hugginface's website following the instructions on the Huggingface website:

First make sure you have a Huggingface account: Next Run the following code snippets and that's it!

! rm -rf output/checkpoint-*
! mv output <name_of_model>
! transformers-cli login
# log in using the same credentials as on
! transformers-cli upload <name_of_model>