Open-Dialog Chatbots for Learning New Languages [Part 1]
How to fine-tune the DialoGPT model on a new dataset or language for open-dialog conversational chatbots.
- This notebook was adapted from the following project:
- About
- Background
- The Data!
- Training and Evaluating
- Chatting with our Model
- Conclusion
- PS
Image by mohamed Hassan from Pixabay
This notebook was adapted from the following project:
Original license of the project this notebook was adapted from: https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py
LICENSE
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
About
Hola! Today we will be creating a chatbot, but not just any chatbot. In this tutorial, you will create your own open-dialog chatbot, one that doesn't just have premade responses to very specific questions or commands!
The overall goal of this tutorial is to create a language learning companion where you can practice simple conversations in a language you care about. We will focus on the beautiful Spanish language in this series as I have been trying to learn the language for the past 5 years, however, you should be able to adapt this tutorial to other languages as well.
First we are going to cover some of the background material for how all this works (if you are already familiar with the GPT2 model, go ahead and skip this background section). Let's get to it!
What is GPT2?
In this post, we are going to use the GPT2 model (Generative Pre-Training 2), from the amazing paper "Language Models are Unsupervised Multitask Learners" by Alex Radford et. al. I will be giving a brief overview of this model. However, if you want a more in-depth explanation I highly recommend the blog post "The Illustrated GPT-2" by Jay Alammar.
GPT2 is what is called an autoregressive language model. This may sound complicated, but it is actually quiet simple, so lets break down what this means. Autoregressive means that the output of the model is fedback into the model as input. Here is a nice example of how that works:
Image From Deepmind
Now, a language model is usually some statistical model that gives the probability of some word given the context. So, take the following example:
An [blank] a day keeps the doctor away
A good language model would give higher probability to the word "apple" occuring in the [blank] than say the word "crocodile" since most likely encountering a crocodile daily would probably have the opposite effect.
Putting them together, we get an autoregressive language model where given some context
How much wood could a woodchuck chuck, if a woodchuck could [blank]
The statistical model then gives some probability to what the next word will be, which we will use in selecting the word. Once we have the selection we add it to our sentence and repeat the whole process again!
How much wood could a woodchuck chuck, if a woodchuck could chuck [blank]
Now, to train our autoregressive language model we just need to get a bunch of example sentences or just chunks of text, hide the last word, and use these sentences with the missing word as our inputs and the last words as the target. This is essentially the whole idea behind GPT2 and many other autoregressive language models, where they learn how language works by using the context to infer the next word.
GPT2 as a chatbot
Great, so you may be asking yourself, "how do we use GPT2 as a chatbot?" To answer this question we need to turn our attention to another paper, "DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation". To see how we can repurpose this generator, GPT2, look at the following example:
Hi, how are you? [end_of_turn] I'm good, what about you? [end_of_turn] Not so good, lots of long nights at work. [end_of_turn] Darn, that sucks :( [end_of_conversation]
This is a sample conversation between two speakers. What's special about it is that there are special tokens that signify when one of the speakers has finished talking, which we in the biz call a turn. If we treat this example like our previous one with the autorgressive language mode, we can do some interesting things:
Hi, how are you? [end_of_turn] [blank]
If we use the same logic as we did previously, it is easy to see how we can now use GPT2 to guess the next word in this conversation.
Hi, how are you? [end_of_turn] I'm [blank]
We keep feeding back the prediction of our model and there ya have it! A chatting GPT2, where all we need to do is show the model a bunch of these example conversations and have it predict the next word in the conversation.
I think that is plenty of background, we will revisit exactly how we design a system where we actually hold a conversation with GPT2 once we have the model trained ;).
! pip -q install transformers==2.9.0 gdown
Let's define to configuration variables so we don't have a bunch of magic numbers and strings!
# Args to allow for easy convertion of python script to notebook
class Args():
def __init__(self):
self.output_dir = 'output'
self.model_type = 'gpt2'
self.model_name_or_path = 'microsoft/DialoGPT-small'
self.config_name = 'microsoft/DialoGPT-small'
self.tokenizer_name = 'microsoft/DialoGPT-small'
self.cache_dir = 'cached'
self.block_size = 512
self.do_train = True
self.do_eval = True
self.evaluate_during_training = False
self.per_gpu_train_batch_size = 4
self.per_gpu_eval_batch_size = 4
self.gradient_accumulation_steps = 1
self.learning_rate = 5e-5
self.weight_decay = 0.0
self.adam_epsilon = 1e-8
self.max_grad_norm = 1.0
self.num_train_epochs = 3
self.max_steps = -1
self.warmup_steps = 0
self.logging_steps = 1000
self.save_steps = 3500
self.save_total_limit = None
self.eval_all_checkpoints = False
self.no_cuda = False
self.overwrite_output_dir = True
self.overwrite_cache = True
self.should_continue = False
self.seed = 42
self.local_rank = -1
self.fp16 = False
self.fp16_opt_level = 'O1'
args = Args()
The Data!
To train our chatbot we will be using conversations scraped from subtitles of Spanish TV shows and movies. I've gone ahead and formated the data for us already, however, if you would like to use a different language to train your chatbot you can use this script to generate a csv with the same format I am going to use in the rest of this tutorial.
! gdown https://drive.google.com/uc?id=1Lp-diuMohUTGyB9BSTFgeGZyY3dkNuEg
df = pd.read_csv('final_es_conv.csv')
df = df.dropna()
trn_df, val_df = train_test_split(df, test_size = 0.2)
trn_df.head()
len(trn_df), len(val_df)
def get_counter_and_lens(data, tokenizer):
flatten = lambda l: [item for sublist in l for item in sublist]
toks = [tokenizer.tokenize(x) for x in data]
return list(map(len, toks)), Counter(flatten(toks)), Counter(' '.join(data).split())
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
lens, tok_cnt, word_cnt = get_counter_and_lens(trn_df[df.columns].apply(lambda x: ' '.join(x.astype(str)), axis = 1), tokenizer)
def plot_counts(counts, top_k = 30):
labels, values = zip(*counts.most_common()[:top_k])
indexes = np.arange(len(labels))
width = 1
plt.figure(num=None, figsize=(22, 4), dpi=60, facecolor='w', edgecolor='k')
plt.bar(indexes, values, width)
plt.xticks(indexes + width * 0.5, labels)
plt.show()
plot_counts(tok_cnt, top_k = 30)
plot_counts(word_cnt, top_k = 30)
def plot_hist(lens, n_bins = 50):
n, bins, patches = plt.hist(lens, n_bins, facecolor='blue', alpha=0.9)
plt.show()
print(f'Mean: {mean(lens)}, Median: {median(lens)}, Standard Deviation: {stdev(lens)}, 90th Percentile: {np.percentile(lens, 100)}')
plot_hist(lens)
Let's get our data into a format that we can feed into our model using Pytorch's Dataset and Dataloader API. All these methods do are convert our dataframes where we have multiple historical dialog, i.e., context, and a response, into a single conversation string that is separated a special token that tells our model when a person is finished speaking.
These conversation strings are then tokenized using HuggingFace's awesome tokenizers into their numerical representation that our model actual understands!
def construct_conv(row, tokenizer, eos = True):
# from: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
flatten = lambda l: [item for sublist in l for item in sublist]
conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
conv = flatten(conv)
return conv
class ConversationDataset(Dataset):
def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):
block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)
directory = args.cache_dir
cached_features_file = os.path.join(
directory, args.model_type + "_cached_lm_" + str(block_size)
)
if os.path.exists(cached_features_file) and not args.overwrite_cache:
logger.info("Loading features from cached file %s", cached_features_file)
with open(cached_features_file, "rb") as handle:
self.examples = pickle.load(handle)
else:
logger.info("Creating features from dataset file at %s", directory)
self.examples = []
for _, row in df.iterrows():
conv = construct_conv(row, tokenizer)
if len(conv) > block_size: continue
self.examples.append(conv)
# Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
# If your dataset is small, first you should loook for a bigger one :-) and second you
# can change this behavior by adding (model specific) padding.
logger.info("Saving features into cached file %s", cached_features_file)
with open(cached_features_file, "wb") as handle:
pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)
def __len__(self):
return len(self.examples)
def __getitem__(self, item):
return torch.tensor(self.examples[item], dtype=torch.long)
Training and Evaluating
Now that we have THE DATA we can finally create our model and start training it! The training and evaluation loop are quite simple. We simplely take a batch of examples from our dataloader and use it both as our inputs and labels. We do this because GPT2 is an auto-regressive model, meaning it uses some context to predict the next token. This prediction is then added to the original context and fed back in as the new context for generating the next token.
To evaluate our model, we use the metric perplexity, which is a simple, but powerful metric. Perplexity is a measure of how unsure the model is in its choice of the next token. The more unsure our model is, the higher its perplexity. One fascinating thing about perplexity is that it correlates very well with what humans think of when it comes to coherent and specific natural conversations, which was shown in the amazing paper "Towards a Human-like Open-Domain Chatbot" by Daniel Adiwardana, et. al.
# Training of model
def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
""" Train the model """
if args.local_rank in [-1, 0]:
tb_writer = SummaryWriter()
args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)
def collate(examples: List[torch.Tensor]):
if tokenizer._pad_token is None:
return pad_sequence(examples, batch_first=True)
return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
train_dataloader = DataLoader(
train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
)
if args.max_steps > 0:
t_total = args.max_steps
args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
else:
t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
model = model.module if hasattr(model, "module") else model # Take care of distributed/parallel training
model.resize_token_embeddings(len(tokenizer))
# add_special_tokens_(model, tokenizer)
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
# Check if saved optimizer or scheduler states exist
if (
args.model_name_or_path
and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
):
# Load in optimizer and scheduler states
optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))
if args.fp16:
try:
from apex import amp
except ImportError:
raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
)
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
global_step = 0
epochs_trained = 0
steps_trained_in_current_epoch = 0
# Check if continuing training from a checkpoint
if args.model_name_or_path and os.path.exists(args.model_name_or_path):
try:
# set global_step to gobal_step of last saved checkpoint from model path
checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
global_step = int(checkpoint_suffix)
epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)
logger.info(" Continuing training from checkpoint, will skip to saved global_step")
logger.info(" Continuing training from epoch %d", epochs_trained)
logger.info(" Continuing training from global step %d", global_step)
logger.info(" Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
except ValueError:
logger.info(" Starting fine-tuning.")
tr_loss, logging_loss = 0.0, 0.0
model.zero_grad()
train_iterator = trange(
epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
)
set_seed(args) # Added here for reproducibility
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
# Skip past any already trained steps if resuming training
if steps_trained_in_current_epoch > 0:
steps_trained_in_current_epoch -= 1
continue
inputs, labels = (batch, batch)
if inputs.shape[1] > 1024: continue
inputs = inputs.to(args.device)
labels = labels.to(args.device)
model.train()
outputs = model(inputs, labels=labels)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
tr_loss += loss.item()
if (step + 1) % args.gradient_accumulation_steps == 0:
if args.fp16:
torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
else:
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
# Log metrics
if (
args.local_rank == -1 and args.evaluate_during_training
): # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer)
for key, value in results.items():
tb_writer.add_scalar("eval_{}".format(key), value, global_step)
tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
logging_loss = tr_loss
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
checkpoint_prefix = "checkpoint"
# Save model checkpoint
output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
os.makedirs(output_dir, exist_ok=True)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
_rotate_checkpoints(args, checkpoint_prefix)
torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
logger.info("Saving optimizer and scheduler states to %s", output_dir)
if args.max_steps > 0 and global_step > args.max_steps:
epoch_iterator.close()
break
if args.max_steps > 0 and global_step > args.max_steps:
train_iterator.close()
break
if args.local_rank in [-1, 0]:
tb_writer.close()
return global_step, tr_loss / global_step
# Evaluation of some model
def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_output_dir = args.output_dir
eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
os.makedirs(eval_output_dir, exist_ok=True)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
# Note that DistributedSampler samples randomly
def collate(examples: List[torch.Tensor]):
if tokenizer._pad_token is None:
return pad_sequence(examples, batch_first=True)
return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(
eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
)
# multi-gpu evaluate
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Eval!
logger.info("***** Running evaluation {} *****".format(prefix))
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
model.eval()
for batch in tqdm(eval_dataloader, desc="Evaluating"):
inputs, labels = (batch, batch)
inputs = inputs.to(args.device)
labels = labels.to(args.device)
with torch.no_grad():
outputs = model(inputs, labels=labels)
lm_loss = outputs[0]
eval_loss += lm_loss.mean().item()
nb_eval_steps += 1
eval_loss = eval_loss / nb_eval_steps
perplexity = torch.exp(torch.tensor(eval_loss))
result = {"perplexity": perplexity}
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(prefix))
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
return result
Now let's put it all together into our runner function and let our baby cook away!
# Main show runner
def main(df_trn, df_val):
args = Args()
if args.should_continue:
sorted_checkpoints = _sorted_checkpoints(args)
if len(sorted_checkpoints) == 0:
raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
else:
args.model_name_or_path = sorted_checkpoints[-1]
if (
os.path.exists(args.output_dir)
and os.listdir(args.output_dir)
and args.do_train
and not args.overwrite_output_dir
and not args.should_continue
):
raise ValueError(
"Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
args.output_dir
)
)
# Setup CUDA, GPU & distributed training
device = torch.device("cuda")
args.n_gpu = torch.cuda.device_count()
args.device = device
# Setup logging
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
)
logger.warning(
"Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
args.local_rank,
device,
args.n_gpu,
bool(args.local_rank != -1),
args.fp16,
)
# Set seed
set_seed(args)
config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
model = AutoModelWithLMHead.from_pretrained(
args.model_name_or_path,
from_tf=False,
config=config,
cache_dir=args.cache_dir,
)
model.to(args.device)
logger.info("Training/evaluation parameters %s", args)
# Training
if args.do_train:
train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
if args.do_train:
# Create output directory if needed
os.makedirs(args.output_dir, exist_ok=True)
logger.info("Saving model checkpoint to %s", args.output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
model_to_save.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = AutoModelWithLMHead.from_pretrained(args.output_dir)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
model.to(args.device)
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
)
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints:
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
model = AutoModelWithLMHead.from_pretrained(checkpoint)
model.to(args.device)
result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
results.update(result)
return results
%load_ext tensorboard
%tensorboard --logdir runs
Finally, we run our model! I found this can take anywhere from an hour to three hours depending on the GPU Google give to you to finish training a model that can sort of hold a coherent conversation for the Spanish language. If you are using a different language, you'll have to play around with how long to cook your model for.
main(trn_df, val_df)
Chatting with our Model
Now that we have our model trained, let's it out for a spin and have our first conversation with it!
In order to allow us to chitchat with our new bot we need to figure out when the model has finished its turn, i.e. when it has generated the [end_of_turn] token. When the model generates this token, we can switch back control of the conversation to the user so they can respond. Luckily, this is very easy to do with the Huggingface framework!
The below code is copied pretty much verbatim from the creators of the DialoGPT model, which you can find here.
tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output')
# Let's chat for 5 lines
for step in range(6):
# encode the new user input, add the eos_token and return a tensor in Pytorch
new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
# print(new_user_input_ids)
# append the new user input tokens to the chat history
bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids
# generated a response while limiting the total chat history to 1000 tokens,
chat_history_ids = model.generate(
bot_input_ids, max_length=1000,
pad_token_id=tokenizer.eos_token_id,
top_p=0.92, top_k = 50
)
# pretty print last ouput tokens from bot
print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))
Now, it ain't the best, however, training it for longer or using the DialogGPT-medium instead of DialogGPT-small does improve results, at least in my experiments. I decided to only include the DialogGPT-small in this tutorial due to the limited (But still AMAZING) resources of Google Colab. I've went ahead and trained a bigger DialogGPT-medium model for longer and have uploaded it to Huggingface for anyone to try out! Find the model card here.
Conclusion
In this tutorial, you learned how to train an Open-Dialog chatbot in any language we want to practice with! This involved learning about the amazing transformers
library by Huggingface that has seen a lot of popularity recently. You've also learned what an Open-Dialog chatbot is and some of the difficulties that come with training them such as constructing training examples and generating repetitive text.
This is just part one in what I am hoping will be a three part series! In the next part, we will take our model and integrate it into a web app using the awesome Streamlit library. Finally, part three will then be generating an Android application for chatting with your new language companion!
PS
If you do train a new chatbot for a language of your interest, please share it! I'd love to hear about your progress with it and I'm sure others would be also interested in it as these models can be quite expensive to train.
If you want an ease way to share it, I suggest submitting your trained model to Huggingface's model zoo, where others can view and download your model to use as a starting point for their applications! Here is a simple way for taking the model trained in this tutorial and uploading it to Hugginface's website following the instructions on the Huggingface website:
First make sure you have a Huggingface account: https://huggingface.co/join. Next Run the following code snippets and that's it!
! rm -rf output/checkpoint-*
! mv output <name_of_model>
! transformers-cli login
# log in using the same credentials as on huggingface.co
! transformers-cli upload <name_of_model>