About

Hi there, in this post you'll learn how to finetune the a RoBERT based model that's been trained on code data to automatically generate comments for code!

We will be focusing on the Java programming language, but you can apply the same techniques in this post for any programming language that interests you. Additionally, you'll see how to incorporate this code commenter into a VSCode extension so that you can generate comments for code snippets you highlight:

(Insert GIF of tool working)

As always, we'll start with a bit of background of the data and model we are using, but feel free to skip if you want to get straight to the awesomeness ;). Alright, let's GO!

Background

Data

We will be using the awesome CodeSearchNet Challenge dataset, which contains millions of pairs of methods and their docstrings for a large variety of programming languages. The dataset was initially constructed for evaluating how well different approaches perform at searching for code. However, we can easily repurpose it for us and lucky for us, the awesome authors did an awesome job collecting, documenting, and cleaning the data.

We'll be performing a bit more cleaning and formatting of the data as well as adding some more examples. These examples won't be method/docstring pairs, but code snippet/inline comment pairs. This allows our model to generate comments for arbitrary code snippets that a developer may want to document instead of just generating the docstring of a method.

CodeBERT

The pretrained model we will be finetuning comes from the awesome paper from Microsoft's research division aptly named CodeBERT: A Pre-Trained Model for Programming and Natural Languages. This model also used the CodeSearchNet challenge dataset, but instead of using it to generate comments it used to teach a RoBERTa based model to represent code and natural language in a useful way. This practice of eaching these large language models to represent text in a useful way is common practice now since these representations have been shown to be helpful in finetuning these models on other tasks. The CodeBERT paper showed these representations are helpful by finetuning them on the programming task of code search and comment generation, exactly what we will be doing! The difference between their comment generation task and ours is that we will do a bit more preprocessing and our model will be able to generate inline comments of code snippets and not just method level comments.

So, how does CodeBERT learn these representations? It combines two different training objectives that's been shown to be useful for natural language. The Masked Language Modeling objective (MLM), which is from the original BERT paper, and Replaced Token Detection (RTD) objective, which is from the ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators paper. The MLM objective is where we randomly mask out parts of the text that we feed into the model and ask the model to predict those masked out pieces. The RTD objective is where random tokens in the text are replaced and the model has to determine which of these tokens are replaced. However, to make it harder for the model, these replaced tokens attempt to be plausible alternatives and not just random words. The CodeBERT model actually used a n-gram based model to generate these alternatives where as the ELECTRA paper used a small BERT based model.

(From ELECTRA Paper)

Instead of using only natural language to apply these training objectives to, CodeBERT used code and docstrings. This allowed the CodeBERT model to learn a useful representation of code that could be used for other tasks.

Alright with that quick background knowledge down, lets get into actually finetuning our model!

! nvidia-smi

Thu Jan 14 20:43:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Data

First we'll install the necessary packages and download our data!

# Download and install the necessary dependencies
! pip install -q torch==1.4.0 -f https://download.pytorch.org/whl/cu101/torch_stable.html
! pip install -q transformers==3.5.0 fast-trees

! git clone -q https://github.com/microsoft/CodeXGLUE.git

# Download the CodeSearchNet Challenge dataset for the Java programming language
! wget -q https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip
! unzip -qq java.zip

     |████████████████████████████████| 753.4MB 21kB/s 
ERROR: torchvision 0.8.1+cu101 has requirement torch==1.7.0, but you'll have torch 1.4.0 which is incompatible.
     |████████████████████████████████| 1.3MB 12.6MB/s 
     |████████████████████████████████| 890kB 51.7MB/s 
     |████████████████████████████████| 2.9MB 50.3MB/s 
     |████████████████████████████████| 1.1MB 61.3MB/s 
     |████████████████████████████████| 112kB 64.8MB/s 
     |████████████████████████████████| 163kB 60.4MB/s 
     |████████████████████████████████| 71kB 11.8MB/s 
  Building wheel for sacremoses (setup.py) ... done
  Building wheel for tree-sitter (setup.py) ... done

Next let's read in our data and since these models take a long time to train, we will only select a subset of the data.

import pandas as pd

from pathlib import Path
from typing import List, Optional

# Code from CodeSearchNetChallenge: https://github.com/github/CodeSearchNet/blob/master/notebooks/ExploreData.ipynb
def jsonl_list_to_dataframe(file_list, columns=['code', 'docstring']):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f,
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

def get_dfs(path: Path) -> List[pd.DataFrame]:
    """Grabs the different data splits and converts them into dataframes"""
    dfs = []
    for split in ["train", "valid", "test"]:
        files = sorted((path/split).glob("**/*.gz"))
        df = jsonl_list_to_dataframe(files).rename(columns = {'code': 'mthd', 'docstring': 'cmt'})
        dfs.append(df)
        
    return dfs

path = Path('.')
df_trn, df_val, df_tst = get_dfs(path/"java/final/jsonl")
sample = 0.01
df_trn = df_trn.sample(frac = sample)
df_val = df_val.sample(frac = sample)
df_tst = df_tst.sample(frac = sample)

len(df_trn), len(df_val), len(df_tst)

(4545, 153, 269)

Let's see how the data looks. As shown, we have the data in a good format with one column all of the methods (input into the model) and the other all of the comments (output of the model).

df_trn.head()

Data Cleaning

Now, that we have the data, let's clean it! First, we'll remove any non-ascii characters to simplify the problem so that the model only has to think about generating English comments.

# From https://stackoverflow.com/a/27084708/5768407
def is_ascii(s):
    '''
    Determines if the given string contains only ascii characters

    :param s: the string to check
    :returns: whether or not the given string contains only ascii characters
    '''
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

df_trn = df_trn[df_trn['mthd'].apply(lambda x: is_ascii(x))]
df_val = df_val[df_val['mthd'].apply(lambda x: is_ascii(x))]
df_tst = df_tst[df_tst['mthd'].apply(lambda x: is_ascii(x))]

df_trn = df_trn[df_trn['cmt'].apply(lambda x: is_ascii(x))]
df_val = df_val[df_val['cmt'].apply(lambda x: is_ascii(x))]
df_tst = df_tst[df_tst['cmt'].apply(lambda x: is_ascii(x))]

len(df_trn), len(df_val), len(df_tst)

(4402, 141, 264)

Next, we'll remove any outdated comments by checking to see if the JavaDoc's parameter list is different from the method's parameter list. This also will remove pairs where the docstring doesn't actually document the parameters, which probably means the pairs are poor quality (you should always properly document your code :) ).

import re

from fast_trees.core import FastParser

parser = FastParser('java')

def get_cmt_params(cmt: str) -> List[str]:
    '''
    Grabs the parameter identifier names from a JavaDoc comment

    :param cmt: the comment to extract the parameter identifier names from
    :returns: an array of the parameter identifier names found in the given comment
    '''
    params = re.findall('@param+\s+\w+', cmt)
    param_names = []
    for param in params:
        param_names.append(param.split()[1])
    
    return param_names

def is_outdated(mthd: str, cmt: str, parser: FastParser) -> bool:
    '''
    Determines if a given method and comment are outdated by checking
    if the method's parameter identifier names match the comment's

    :param mthd: the method to compare against its corresponding comment
    :param cmt: the comment to compare against its corresponding method
    :param parser: parser for easily getting the parameter identifier names from a given method
    :returns: wheather or not a given comment is outdated compared to its corresponding method
    '''
    try:
        mthd_params = parser.get_params(mthd)
    except:
        return False
    
    cmt_params = get_cmt_params(cmt)

    return mthd_params != cmt_params

df_trn = df_trn[
    ~df_trn.apply(
        lambda x: is_outdated(x.mthd, x.cmt, parser), axis = 1
    )
]
df_val = df_val[
    ~df_val.apply(
        lambda x: is_outdated(x.mthd, x.cmt, parser), axis = 1
    )
]
df_tst = df_tst[
    ~df_tst.apply(
        lambda x: is_outdated(x.mthd, x.cmt, parser), axis = 1
    )
]

len(df_trn), len(df_val), len(df_tst)

Downloading repo https://github.com/tree-sitter/tree-sitter-java to /usr/local/lib/python3.6/dist-packages/fast_trees/tree-sitter-java.

(4402, 141, 264)

Now we'll add in the additional pairs of code snippets/inline comments.

P.S. One thing to note with adding these pairs is that the inline comments will appear twice in the datasets. The first in the method where the inline comment came from and the second in the target for the code snippet. This is only a problem for the training set since it allows for the model to cheat by simply remembering the inline comment from the example method it came from. However, in my testing, I found this to not be an issue and the model seems to still work well despite this problem. Just thought ya should know :).

from tqdm.auto import tqdm

def get_inline_pairs(mthd):
    '''
    Get all pairs of inline comments and corresponding code snippets

    :param mthd: the method to retrieve the pairs of comments and corresponding
    code snippets from
    :returns: all pairs of comments and corresponding code snippets
    '''
    pairs = [[]]

    comment = False
    bracket = False
    indent_lvl = -1
    lines = mthd.split("\n")
    for line in lines:
        if "//" in line and not bracket and not "://" in line:
            pairs[-1].append(line)
            if '\t' in line:
                indent_lvl = line.count('\t')
            else:
                indent_lvl = line.split("//")[0].count(' ')
            comment = True
            bracket = False
        elif comment:
            if '{' in line and not bracket:
                bracket = True
                pairs[-1].append(line)
            elif '}' in line:
                line_indent = -1
                if '\t' in line:
                    line_indent = line.count('\t')
                else:
                    line_indent = line.split("//")[0].count(' ')
                if indent_lvl == line_indent:
                    pairs[-1].append(line)
                if not bracket:
                    pairs.append([])
                    comment = False
                    bracket = False
            elif line.isspace() or line == '' and not bracket:
                pairs.append([])
                comment = False
            else:
                pairs[-1].append(line)
    
    # Convert pairs into proper format of (code snippet, inline comment) dataframe
    code_snippets   = []
    comments        = []
    for pair in pairs:
        if pair and len(pair) < 5:
            code    = []
            comment = []
            skip = False
            for line in pair:
                if "TODO" in line: break
                if "//" in line:
                    comment.append(line.replace('//', ''))
                else:
                    code.append(line)
            if len(code) > 1 and len(comment) > 0:
                        code_snippets.append('\n'.join(code))
                        comments.append('\n'.join(comment))

    pairs = pd.DataFrame(zip(code_snippets, comments), columns = ["mthd", "cmt"])
    return pairs


def add_inline(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Helper function to go through all methods in a given dataframe and add all
    pairs of inline comments and corresponding code snippets

    :param df: the dataframe to retrieve and add all pairs of inline comments
    and corresponding code snippets to
    :returns: a new dataframe with the newly added pairs of inline comments and
    corresponding code snippets
    '''
    new_df = df[df['mthd'].str.contains("//")]
    all_pairs = []
    for mthd in tqdm(new_df.mthd.values):
        pairs = get_inline_pairs(mthd)
        all_pairs.append(pairs)

    df_pairs = pd.concat([pairs for pairs in all_pairs])
    return pd.concat([df, df_pairs])

df_trn = add_inline(df_trn)
df_val = add_inline(df_val)
df_tst = add_inline(df_tst)

len(df_trn), len(df_val), len(df_tst)

(4584, 150, 271)

We'll also remove pairs where the size of the code is smaller than the comment. This is because I found that in these cases the comments contain a bunch of extra information that the model won't have access to such as how the method is being used by other methods in the software system.

df_trn = df_trn[df_trn.apply(lambda row: len(row.mthd) > len(row.cmt), axis = 1)]
df_val = df_val[df_val.apply(lambda row: len(row.mthd) > len(row.cmt), axis = 1)]
df_tst = df_tst[df_tst.apply(lambda row: len(row.mthd) > len(row.cmt), axis = 1)]

len(df_trn), len(df_val), len(df_tst)

(3713, 111, 228)

def has_code(cmt: str) -> bool:
    '''
    Determinine if the given comment contains the HTML <code> tag

    :param cmt: the comment to check whether it contains the HTML <code> tag
    :returns: whether or not the given comment contains the HTML <code> tag
    '''
    if '<code>' in cmt: return True
    else: return False

df_trn = df_trn[~df_trn['cmt'].apply(lambda x: has_code(x))]
df_val = df_val[~df_val['cmt'].apply(lambda x: has_code(x))]
df_tst = df_tst[~df_tst['cmt'].apply(lambda x: has_code(x))]

len(df_trn), len(df_val), len(df_tst)

(3580, 104, 221)

(3094, 94, 205)

177 102.0 283.76574846164925

17 12.0 19.77371993328519

(2809, 88, 193)

(559, 48)

/content/CodeXGLUE/Code-Text/code-to-text/code

2021-01-14 20:49:04.427229: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
01/14/2021 20:49:06 - INFO - __main__ -   Namespace(adam_epsilon=1e-08, beam_size=10, config_name='', dev_filename='./java/valid.jsonl', do_eval=True, do_lower_case=True, do_test=False, do_train=True, eval_batch_size=8, eval_steps=-1, gradient_accumulation_steps=1, learning_rate=5e-05, load_model_path=None, local_rank=-1, max_grad_norm=1.0, max_source_length=256, max_steps=-1, max_target_length=48, model_name_or_path='microsoft/codebert-base', model_type='roberta', no_cuda=False, num_train_epochs=10, output_dir='model/java', seed=42, test_filename=None, tokenizer_name='', train_batch_size=8, train_filename='./java/train.jsonl', train_steps=-1, warmup_steps=0, weight_decay=0.0)
01/14/2021 20:49:06 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False
01/14/2021 20:49:06 - INFO - filelock -   Lock 140293701425752 acquired on /root/.cache/torch/transformers/08477dcecf305af90229876aa01e4b0f3594dc8c638985a72277f39ea7d8d0c3.7fb14267817b1d26bb44a57cd5aa2fc003c25e87b75ef77e9c55c4804675b4cf.lock
Downloading: 100% 499M/499M [00:06<00:00, 73.5MB/s]
01/14/2021 20:49:13 - INFO - filelock -   Lock 140293701425752 released on /root/.cache/torch/transformers/08477dcecf305af90229876aa01e4b0f3594dc8c638985a72277f39ea7d8d0c3.7fb14267817b1d26bb44a57cd5aa2fc003c25e87b75ef77e9c55c4804675b4cf.lock
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 0
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', 'public', '_static', '_void', '_check', 'j', 'av', 'ain', 'ternal', 'access', '(', 'il', 'og', 'ger', '_logger', ')', '_{', '_if', '_(', 'log', 'ger', '_==', '_null', '_||', '_!', 'java', 'version', '.', 'is', 'at', 'le', 'ast', '(', 'java', 'version', '.', 'java', '_', '9', '))', '_{', '_//', '_older', '_java', '_versions', '_are', '_fine', '_with', '_the', '_reflection', '_return', ';', '_}', '_map', '<', 'string', ',', '_package', 'access', 'requ', 'irement', '[]', '>', '_requirements', '_=', '_new', '_tre', 'em', 'ap', '<', 'string', ',', '_package', 'access', 'requ', 'irement', '[]', '>', '();', '_requirements', '.', 'put', '("', 'java', '.', 'base', '",', '_new', '_package', 'access', 'requ', 'irement', '[]', '_{', '_create', 'requ', 'irement', '(', 'false', ',', '_"', 'j', 'dk', '.', 'internal', '.', 'ref', '"),', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'java', '.', 'lang', '"),', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'java', '.', 'n', 'io', '"),', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'sun', '.', 'n', 'io', '.', 'ch', '")', '_});', '_requirements', '.', 'put', '("', 'j', 'dk', '.', 'management', '",', '_get', 'j', 'dk', 'management', 'requ', 'irements', '());', '_requirements', '.', 'put', '("', 'java', '.', 'management', '",', '_new', '_package', 'access', 'requ', 'irement', '[]', '_{', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'sun', '.', 'management', '")', '_});', '_check', 'package', 'requ', 'irements', '(', 'log', 'ger', ',', '_requirements', ');', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 15110 25156 13842 1649 267 1469 1851 46378 28300 1640 718 2154 2403 37764 43 25522 114 36 12376 2403 45994 23796 45056 27785 43830 21747 4 354 415 459 1988 1640 43830 21747 4 43830 1215 466 35122 25522 21277 2530 46900 7952 32 2051 19 5 12456 671 131 35524 5456 41552 20951 6 3737 28300 42172 34074 48992 15698 3471 5457 92 6110 991 1115 41552 20951 6 3737 28300 42172 34074 48992 15698 47006 3471 4 9179 46469 43830 4 11070 1297 92 3737 28300 42172 34074 48992 25522 1045 42172 34074 1640 22303 6 22 267 43357 4 37559 4 13043 16844 1045 42172 34074 1640 29225 6 22 43830 4 32373 16844 1045 42172 34074 1640 29225 6 22 43830 4 282 1020 16844 1045 42172 34074 1640 29225 6 22 21381 4 282 1020 4 611 8070 47771 3471 4 9179 46469 267 43357 4 14668 1297 120 267 43357 14668 42172 48227 49291 3471 4 9179 46469 43830 4 14668 1297 92 3737 28300 42172 34074 48992 25522 1045 42172 34074 1640 29225 6 22 21381 4 14668 8070 47771 1649 46181 42172 48227 1640 12376 2403 6 3471 4397 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', 'prints', '_warning', '_to', '_given', '_if', '_haz', 'el', 'cast', '_is', '_not', '_provided', '_a', '_sufficient', '_access', '_to', '_java', '_internal', '_packages', '_on', '_java', '_9', '_and', '_newer', '.', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 31553 2892 7 576 114 32468 523 5182 16 45 1286 10 7719 899 7 46900 3425 8368 15 46900 361 8 13964 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 1
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', 'public', '_void', '_marsh', 'all', '(', 's', 'ct', 'e', '20', 'pl', 'use', 'mb', 'edd', 'edd', 'est', 'inations', 'ettings', '_s', 'ct', 'e', '20', 'pl', 'use', 'mb', 'edd', 'edd', 'est', 'inations', 'ettings', ',', '_protocol', 'm', 'arsh', 'all', 'er', '_protocol', 'm', 'arsh', 'all', 'er', ')', '_{', '_if', '_(', 's', 'ct', 'e', '20', 'pl', 'use', 'mb', 'edd', 'edd', 'est', 'inations', 'ettings', '_==', '_null', ')', '_{', '_throw', '_new', '_s', 'dk', 'client', 'ex', 'ception', '("', 'in', 'valid', '_argument', '_passed', '_to', '_marsh', 'all', '(', '...)', '");', '_}', '_try', '_{', '_}', '_catch', '_(', 'ex', 'ception', '_e', ')', '_{', '_throw', '_new', '_s', 'dk', 'client', 'ex', 'ception', '("', 'un', 'able', '_to', '_marsh', 'all', '_request', '_to', '_json', ':', '_"', '_+', '_e', '.', 'get', 'message', '(),', '_e', ');', '_}', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 15110 13842 16377 1250 1640 29 3894 242 844 2911 3698 6648 13093 13093 990 17808 48496 579 3894 242 844 2911 3698 6648 13093 13093 990 17808 48496 6 11883 119 14980 1250 254 11883 119 14980 1250 254 43 25522 114 36 29 3894 242 844 2911 3698 6648 13093 13093 990 17808 48496 45994 23796 43 25522 3211 92 579 43357 38557 3463 20900 46469 179 42679 4795 1595 7 16377 1250 1640 41137 45751 35524 860 25522 35524 2916 36 3463 20900 364 43 25522 3211 92 579 43357 38557 3463 20900 46469 879 868 7 16377 1250 2069 7 49133 35 22 2055 364 4 6460 44773 49196 364 4397 35524 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', 'm', 'arsh', 'all', '_the', '_given', '_parameter', '_object', '.', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 119 14980 1250 5 576 43797 7626 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 2
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', '@', 'over', 'ride', '_public', '_void', '_pref', 'etch', 'token', '(', 'final', '_file', '_token', 'file', ',', '_final', '_props', '_props', ',', '_final', '_logger', '_logger', ')', '_throws', '_had', 'oop', 'security', 'man', 'age', 'rex', 'ception', '_{', '_final', '_string', '_us', 'ert', 'op', 'roxy', '_=', '_props', '.', 'get', 'string', '(', 'job', 'properties', '.', 'user', '_', 'to', '_', 'proxy', ');', '_logger', '.', 'info', '("', 'getting', '_had', 'oop', '_tokens', '_based', '_on', '_props', '_for', '_"', '_+', '_us', 'ert', 'op', 'roxy', ');', '_dop', 'ref', 'etch', '(', 'token', 'file', ',', '_props', ',', '_logger', ',', '_us', 'ert', 'op', 'roxy', ');', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 1039 2137 23167 285 13842 33284 29094 46657 1640 6156 2870 19233 21710 6 507 26504 26504 6 507 37764 37764 43 6989 56 18042 15506 397 1580 19633 20900 25522 507 6755 201 2399 1517 46963 5457 26504 4 6460 20951 1640 30056 47276 4 12105 1215 560 1215 47315 4397 37764 4 23999 46469 31315 56 18042 22121 716 15 26504 13 22 2055 201 2399 1517 46963 4397 32331 13043 29094 1640 46657 21710 6 26504 6 37764 6 201 2399 1517 46963 4397 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', '/*', '_gets', '_had', 'oop', '_tokens', '_for', '_a', '_user', '_to', '_run', '_map', 'red', '/', 'h', 'ive', '_jobs', '_on', '_a', '_secured', '_cluster', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 49051 1516 56 18042 22121 13 10 3018 7 422 5456 2050 73 298 2088 1315 15 10 5288 18016 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 3
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', '@', 'over', 'ride', '_public', '_<', 'y', '>', '_singular', 'attribute', '<', 'x', ',', '_y', '>', '_get', 'decl', 'ared', 'id', '(', 'class', '<', 'y', '>', '_param', 'class', ')', '_{', '_if', '_(', 'id', 'attribute', '_!=', '_null', ')', '_{', '_if', '_(', 'id', 'attribute', '.', 'get', 'j', 'av', 'at', 'ype', '().', 'equ', 'als', '(', 'param', 'class', ')', '_&&', '_!', 'is', 'id', 'class', ')', '_{', '_return', '_(', 'sing', 'ular', 'attribute', '<', 'x', ',', '_y', '>)', '_id', 'attribute', ';', '_}', '_}', '_on', 'error', '();', '_return', '_null', ';', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 1039 2137 23167 285 28696 219 15698 23429 49202 41552 1178 6 1423 15698 120 32639 6537 808 1640 4684 41552 219 15698 40206 4684 43 25522 114 36 808 49202 49333 23796 43 25522 114 36 808 49202 4 6460 267 1469 415 37356 49123 8198 1536 1640 46669 4684 43 48200 27785 354 808 4684 43 25522 671 36 26058 8244 49202 41552 1178 6 1423 49798 13561 49202 131 35524 35524 15 44223 47006 671 23796 131 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', '/*', '_(', 'non', '-', 'j', 'av', 'ad', 'oc', ')', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 49051 36 13424 12 267 1469 625 1975 43 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 4
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', 'public', '_void', '_sync', '(', 'bo', 'olean', '_syn', 'call', 'se', 'gments', ')', '_{', '_commit', 'log', 'se', 'gment', '_current', '_=', '_alloc', 'ator', '.', 'all', 'ocating', 'from', '();', '_for', '_(', 'commit', 'log', 'se', 'gment', '_segment', '_:', '_alloc', 'ator', '.', 'get', 'act', 'ives', 'eg', 'ments', '())', '_{', '_if', '_(!', 'sync', 'all', 'se', 'gments', '_&&', '_segment', '.', 'id', '_>', '_current', '.', 'id', ')', '_return', ';', '_segment', '.', 'sync', '();', '_}', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 15110 13842 22785 1640 3983 48547 17796 16395 1090 30237 43 25522 6225 12376 1090 10757 595 5457 42793 2630 4 1250 18106 7761 47006 13 36 42721 12376 1090 10757 2835 4832 42793 2630 4 6460 7257 3699 3733 2963 49338 25522 114 48209 45176 1250 1090 30237 48200 2835 4 808 8061 595 4 808 43 671 131 2835 4 45176 47006 35524 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', 'forces', '_a', '_disk', '_flush', '_on', '_the', '_commit', '_log', '_files', '_that', '_need', '_it', '.', '_blocking', '.', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 34532 10 21675 24841 15 5 6225 7425 6773 14 240 24 4 8890 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:33 - INFO - __main__ -   ***** Running training *****
01/14/2021 20:49:33 - INFO - __main__ -     Num examples = 2809
01/14/2021 20:49:33 - INFO - __main__ -     Batch size = 8
01/14/2021 20:49:33 - INFO - __main__ -     Num epoch = 10
epoch 0 loss 6.8534: 100% 352/352 [02:53<00:00,  2.03it/s]
01/14/2021 20:52:27 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 20:52:27 - INFO - __main__ -     Num examples = 88
01/14/2021 20:52:27 - INFO - __main__ -     Batch size = 8
01/14/2021 20:52:29 - INFO - __main__ -     eval_ppl = 420.66683
01/14/2021 20:52:29 - INFO - __main__ -     global_step = 353
01/14/2021 20:52:29 - INFO - __main__ -     train_loss = 6.8534
01/14/2021 20:52:29 - INFO - __main__ -     ********************
01/14/2021 20:52:31 - INFO - __main__ -     Best ppl:420.66683
01/14/2021 20:52:31 - INFO - __main__ -     ********************
Total: 88
01/14/2021 20:52:58 - INFO - __main__ -     bleu-4 = 9.79 
01/14/2021 20:52:58 - INFO - __main__ -     ********************
01/14/2021 20:52:58 - INFO - __main__ -     Best bleu:9.79
01/14/2021 20:52:58 - INFO - __main__ -     ********************
epoch 1 loss 5.2249: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 20:55:58 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 20:55:58 - INFO - __main__ -     Num examples = 88
01/14/2021 20:55:58 - INFO - __main__ -     Batch size = 8
01/14/2021 20:56:00 - INFO - __main__ -     eval_ppl = 223.30135
01/14/2021 20:56:00 - INFO - __main__ -     global_step = 705
01/14/2021 20:56:00 - INFO - __main__ -     train_loss = 5.2249
01/14/2021 20:56:00 - INFO - __main__ -     ********************
01/14/2021 20:56:02 - INFO - __main__ -     Best ppl:223.30135
01/14/2021 20:56:02 - INFO - __main__ -     ********************
Total: 88
01/14/2021 20:56:30 - INFO - __main__ -     bleu-4 = 10.3 
01/14/2021 20:56:30 - INFO - __main__ -     ********************
01/14/2021 20:56:30 - INFO - __main__ -     Best bleu:10.3
01/14/2021 20:56:30 - INFO - __main__ -     ********************
epoch 2 loss 4.4676: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 20:59:31 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 20:59:31 - INFO - __main__ -     Num examples = 88
01/14/2021 20:59:31 - INFO - __main__ -     Batch size = 8
01/14/2021 20:59:32 - INFO - __main__ -     eval_ppl = 167.43889
01/14/2021 20:59:32 - INFO - __main__ -     global_step = 1057
01/14/2021 20:59:32 - INFO - __main__ -     train_loss = 4.4676
01/14/2021 20:59:32 - INFO - __main__ -     ********************
01/14/2021 20:59:35 - INFO - __main__ -     Best ppl:167.43889
01/14/2021 20:59:35 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:00:05 - INFO - __main__ -     bleu-4 = 10.68 
01/14/2021 21:00:05 - INFO - __main__ -     ********************
01/14/2021 21:00:05 - INFO - __main__ -     Best bleu:10.68
01/14/2021 21:00:05 - INFO - __main__ -     ********************
epoch 3 loss 3.8263: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:03:05 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:03:05 - INFO - __main__ -     Num examples = 88
01/14/2021 21:03:05 - INFO - __main__ -     Batch size = 8
01/14/2021 21:03:07 - INFO - __main__ -     eval_ppl = 160.25635
01/14/2021 21:03:07 - INFO - __main__ -     global_step = 1409
01/14/2021 21:03:07 - INFO - __main__ -     train_loss = 3.8263
01/14/2021 21:03:07 - INFO - __main__ -     ********************
01/14/2021 21:03:10 - INFO - __main__ -     Best ppl:160.25635
01/14/2021 21:03:10 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:03:38 - INFO - __main__ -     bleu-4 = 11.04 
01/14/2021 21:03:38 - INFO - __main__ -     ********************
01/14/2021 21:03:38 - INFO - __main__ -     Best bleu:11.04
01/14/2021 21:03:38 - INFO - __main__ -     ********************
epoch 4 loss 3.2797: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:06:38 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:06:38 - INFO - __main__ -     Num examples = 88
01/14/2021 21:06:38 - INFO - __main__ -     Batch size = 8
01/14/2021 21:06:40 - INFO - __main__ -     eval_ppl = 152.19858
01/14/2021 21:06:40 - INFO - __main__ -     global_step = 1761
01/14/2021 21:06:40 - INFO - __main__ -     train_loss = 3.2797
01/14/2021 21:06:40 - INFO - __main__ -     ********************
01/14/2021 21:06:42 - INFO - __main__ -     Best ppl:152.19858
01/14/2021 21:06:42 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:07:14 - INFO - __main__ -     bleu-4 = 10.36 
01/14/2021 21:07:14 - INFO - __main__ -     ********************
epoch 5 loss 2.8204: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:10:12 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:10:12 - INFO - __main__ -     Num examples = 88
01/14/2021 21:10:12 - INFO - __main__ -     Batch size = 8
01/14/2021 21:10:13 - INFO - __main__ -     eval_ppl = 150.95443
01/14/2021 21:10:13 - INFO - __main__ -     global_step = 2113
01/14/2021 21:10:13 - INFO - __main__ -     train_loss = 2.8204
01/14/2021 21:10:13 - INFO - __main__ -     ********************
01/14/2021 21:10:16 - INFO - __main__ -     Best ppl:150.95443
01/14/2021 21:10:16 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:10:45 - INFO - __main__ -     bleu-4 = 11.57 
01/14/2021 21:10:45 - INFO - __main__ -     ********************
01/14/2021 21:10:45 - INFO - __main__ -     Best bleu:11.57
01/14/2021 21:10:45 - INFO - __main__ -     ********************
epoch 6 loss 2.4442: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:13:46 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:13:46 - INFO - __main__ -     Num examples = 88
01/14/2021 21:13:46 - INFO - __main__ -     Batch size = 8
01/14/2021 21:13:47 - INFO - __main__ -     eval_ppl = 156.69898
01/14/2021 21:13:47 - INFO - __main__ -     global_step = 2465
01/14/2021 21:13:47 - INFO - __main__ -     train_loss = 2.4442
01/14/2021 21:13:47 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:14:17 - INFO - __main__ -     bleu-4 = 10.65 
01/14/2021 21:14:17 - INFO - __main__ -     ********************
epoch 7 loss 2.1565: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:17:15 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:17:15 - INFO - __main__ -     Num examples = 88
01/14/2021 21:17:15 - INFO - __main__ -     Batch size = 8
01/14/2021 21:17:16 - INFO - __main__ -     eval_ppl = 163.34726
01/14/2021 21:17:16 - INFO - __main__ -     global_step = 2817
01/14/2021 21:17:16 - INFO - __main__ -     train_loss = 2.1565
01/14/2021 21:17:16 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:17:50 - INFO - __main__ -     bleu-4 = 10.56 
01/14/2021 21:17:50 - INFO - __main__ -     ********************
epoch 8 loss 1.9398: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:20:47 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:20:47 - INFO - __main__ -     Num examples = 88
01/14/2021 21:20:47 - INFO - __main__ -     Batch size = 8
01/14/2021 21:20:49 - INFO - __main__ -     eval_ppl = 166.41823
01/14/2021 21:20:49 - INFO - __main__ -     global_step = 3169
01/14/2021 21:20:49 - INFO - __main__ -     train_loss = 1.9398
01/14/2021 21:20:49 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:21:26 - INFO - __main__ -     bleu-4 = 10.74 
01/14/2021 21:21:26 - INFO - __main__ -     ********************
epoch 9 loss 1.7877: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:24:24 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:24:24 - INFO - __main__ -     Num examples = 88
01/14/2021 21:24:24 - INFO - __main__ -     Batch size = 8
01/14/2021 21:24:25 - INFO - __main__ -     eval_ppl = 169.37057
01/14/2021 21:24:25 - INFO - __main__ -     global_step = 3521
01/14/2021 21:24:25 - INFO - __main__ -     train_loss = 1.7877
01/14/2021 21:24:25 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:24:59 - INFO - __main__ -     bleu-4 = 10.28 
01/14/2021 21:24:59 - INFO - __main__ -     ********************

def has_code(cmt: str) -> bool:
    '''
    Determinine if the given comment contains the HTML <code> tag

    :param cmt: the comment to check whether it contains the HTML <code> tag
    :returns: whether or not the given comment contains the HTML <code> tag
    '''
    if '<code>' in cmt: return True
    else: return False

df_trn = df_trn[~df_trn['cmt'].apply(lambda x: has_code(x))]
df_val = df_val[~df_val['cmt'].apply(lambda x: has_code(x))]
df_tst = df_tst[~df_tst['cmt'].apply(lambda x: has_code(x))]

len(df_trn), len(df_val), len(df_tst)

(3580, 104, 221)

Lastly, we're gonna remove the JavaDoc parts of the comments other than the description since that is really all we care about. The other pieces of information can usually be autogenerated or may require external knowledge to document them.

def remove_jdocs(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Remove the JavaDocs leaving only the description of the comment

    :param df: the pandas dataframe to remove the JavaDocs from
    :returns: a new pandas dataframe with the JavaDocs removed
    '''
    methods = []
    comments = []
    for i, row in tqdm(list(df.iterrows())):
        comment = row["cmt"]
        # Remove {} text in comments from https://stackoverflow.com/questions/14596884/remove-text-between-and-in-python/14598135
        comment = re.sub("([\{\[]).*?([\)\}])", '', comment)
        
        
        cleaned = []
        for line in comment.split('\n'):
            if "@" in line: break
            cleaned.append(line)
        comments.append('\n'.join(cleaned))
        methods.append(row["mthd"])
    new_df = pd.DataFrame(zip(methods, comments), columns = ["mthd", "cmt"])

    return new_df

df_trn = remove_jdocs(df_trn);
df_val = remove_jdocs(df_val);
df_tst = remove_jdocs(df_tst);

Almost there! In this step, we'll remove any HTML tags from the comments so the model doesn't have to also learn HTML. Bless those that do...

def clean_html(cmt: str) -> str:
    '''
    Remove any HTML tags from a given comment

    :param cmt: the comment to remove any HTML tags from
    :returns: the comment with any HTML tags removed
    '''
    result = re.sub(r"<.?span[^>]*>|<.?code[^>]*>|<.?p[^>]*>|<.?hr[^>]*>|<.?h[1-3][^>]*>|<.?a[^>]*>|<.?b[^>]*>|<.?blockquote[^>]*>|<.?del[^>]*>|<.?dd[^>]*>|<.?dl[^>]*>|<.?dt[^>]*>|<.?em[^>]*>|<.?i[^>]*>|<.?img[^>]*>|<.?kbd[^>]*>|<.?li[^>]*>|<.?ol[^>]*>|<.?pre[^>]*>|<.?s[^>]*>|<.?sup[^>]*>|<.?sub[^>]*>|<.?strong[^>]*>|<.?strike[^>]*>|<.?ul[^>]*>|<.?br[^>]*>", "", cmt)
    return result

df_trn.cmt = df_trn.cmt.apply(clean_html)
df_val.cmt = df_val.cmt.apply(clean_html)
df_tst.cmt = df_tst.cmt.apply(clean_html)

FINALLY!! We'll make everything lower case, remove extra whitespace, remove empty comments, and remove duplicates.

df_trn = df_trn.applymap(lambda x: ' '.join(x.split()).lower())
df_val = df_val.applymap(lambda x: ' '.join(x.split()).lower())
df_tst = df_tst.applymap(lambda x: ' '.join(x.split()).lower())

df_trn = df_trn[~(df_trn['cmt'] == '')]
df_val = df_val[~(df_val['cmt'] == '')]
df_tst = df_tst[~(df_tst['cmt'] == '')]

df_trn = df_trn[~df_trn['cmt'].duplicated()]
df_val = df_val[~df_val['cmt'].duplicated()]
df_tst = df_tst[~df_tst['cmt'].duplicated()]

len(df_trn), len(df_val), len(df_tst)

(3094, 94, 205)

Now let's see what the data looks like.

df_trn.head()

Data Exploring

As good Data Scientists, we will also explore our data to uncover any secrets. Data can be sneaky like that :).

import numpy as np

from collections import Counter
from statistics import mean, median, stdev
from transformers import AutoTokenizer

def get_counter(df: pd.DataFrame, tokenizer: AutoTokenizer, col: str) -> Counter:
    '''
    Get the counts for each token in a given pandas dataframe column

    :param df: the pandas dataframe to get the counts of tokens from
    :param tokenizer: the tokenizer to use for tokenizing the rows in the pandas
    dataframe
    :param col: the column to grab rows from when tokenizing
    :returns: the counts of each token in the given pandas dataframe
    column
    '''
    toks = []
    for i, row in df.iterrows():
        toks.extend(tokenizer.tokenize(row[col]))
            
    cnt = Counter()
    for tok in toks:
        cnt[tok] += 1  
    return cnt

tokenizer = AutoTokenizer.from_pretrained('microsoft/codebert-base')
mthd_cnt = get_counter(df_trn, tokenizer, 'mthd')
cmt_cnt = get_counter(df_trn, tokenizer, 'cmt')
mthd_lens = df_trn.mthd.apply(lambda x: len(tokenizer.tokenize(x))).values
cmt_lens = df_trn.cmt.apply(lambda x: len(tokenizer.tokenize(x))).values
max_mthd_len = int(np.quantile(mthd_lens, 0.95))
max_cmt_len = int(np.quantile(cmt_lens, 0.95))

import matplotlib.pyplot as plt

def plot_counts(counts:Counter, top_k: Optional[int] = 30):
    '''
    Plot a bar chart of the most common tokens

    :param counts: the counts of each token
    :param top_k: the number of tokens to display in the plot
    '''
    labels, values = zip(*counts.most_common()[:top_k])

    indexes = np.arange(len(labels))
    width = 1
    plt.figure(num=None, figsize=(22, 4), dpi=60, facecolor='w', edgecolor='k')
    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels)
    plt.show()

Let's look at the most common tokens in our methods and comments.

plot_counts(mthd_cnt, top_k = 30)
plot_counts(cmt_cnt, top_k = 30)

def plot_hist(lens: List[int], n_bins: Optional[int] = 50):
    '''
    Plot a histogram of the given number of tokens in a column 

    :param lens: the number of tokens in a column
    :param n_bins: the number of bins to sort the number of tokens into
    '''
    n, bins, patches = plt.hist(lens, n_bins, facecolor='blue', alpha=0.9)
    plt.show()

Now, let's look at the distribution of method and comment lengths.

print(mean(mthd_lens), median(mthd_lens), stdev(mthd_lens))
plot_hist(mthd_lens)
print(mean(cmt_lens), median(cmt_lens), stdev(cmt_lens))
plot_hist(cmt_lens)

177 102.0 283.76574846164925

17 12.0 19.77371993328519

Using this new information on the length distribution, we can remove outliers by filter by lengths of methods that fall outside of 95th percentile (chosen for completely arbitrary reasons)!

def filter_len(
    row: pd.Series, tokenizer: AutoTokenizer, mthd_len: int, cmt_len: int
    ) -> bool:
    '''
    Determine if a given panda dataframe row has a method or comment that has
    more tokens than max length

    :param row: the row to check if it has a method or comment that is too long
    :param tokenizer: the tokenizer to tokenize a method or comment
    :param mthd_len: the max number of tokens a method can have
    :param cmt_len: the max number of tokens a comment can have
    :returns: whether or not the given row have a method or comment that have
    more tokens than a max length
    '''
    return len(tokenizer.tokenize(row.mthd)) < mthd_len and len(tokenizer.tokenize(row.cmt)) < cmt_len

df_trn = df_trn[df_trn.apply(
    lambda row: filter_len(
        row, tokenizer, max_mthd_len,
        max_cmt_len
    ), axis = 1
)]
df_val = df_val[df_val.apply(
    lambda row: filter_len(
        row, tokenizer, max_mthd_len,
        max_cmt_len
    ), axis = 1
)]
df_tst = df_tst[df_tst.apply(
    lambda row: filter_len(
        row, tokenizer, max_mthd_len,
        max_cmt_len
    ), axis = 1
)]

len(df_trn), len(df_val), len(df_tst)

(2809, 88, 193)

max_mthd_len, max_cmt_len

(559, 48)

We could do a lot more exploring of our data as the above exploration was the bare minimum. As an exercise, I suggest for you to explore the data on your own using whatever means necessary!

Training

Now that we have our data processed and in a format we like, let's go ahead and start training! To accomplish this we will be using code from the awesome CodeXGLUE repository. This repository is similar to the NLP equivalent GLUE benchmarks where a ton of awesome code related benchmarks are standardized and put into one place for the community to use! They have a ton of interesting ones and I highly suggest looking through their repo if you are interested in other code related tasks.

cd ./CodeXGLUE/Code-Text/code-to-text/code

/content/CodeXGLUE/Code-Text/code-to-text/code

Okay, I lied, sorry :(. One last processing step is required of our data, which is to just output the data into the structure that the awesome CodeXGLUE Code-Text benchmark expects.

import json

df_trn['code_tokens'] = df_trn.mthd.apply(lambda x: x.split())
df_trn['docstring_tokens'] = df_trn.cmt.apply(lambda x: x.split())
with open('java/train.jsonl','w') as f:
    for _, row in df_trn.iterrows():
        f.write(json.dumps(row.to_dict()) + '\n')

df_val['code_tokens'] = df_val.mthd.apply(lambda x: x.split())
df_val['docstring_tokens'] = df_val.cmt.apply(lambda x: x.split())
with open('java/valid.jsonl','w') as f:
    for _, row in df_val.iterrows():
        f.write(json.dumps(row.to_dict()) + '\n')

df_tst['code_tokens'] = df_tst.mthd.apply(lambda x: x.split())
df_tst['docstring_tokens'] = df_tst.cmt.apply(lambda x: x.split())
with open('java/test.jsonl','w') as f:
    for _, row in df_tst.iterrows():
        f.write(json.dumps(row.to_dict()) + '\n')

lang = 'java' # programming language
lr = 5e-5
batch_size = 8 # change depending on the GPU Colab gives you
beam_size = 10
source_length = 256
target_length = max_cmt_len
data_dir = '.'
output_dir = f'model/{lang}'
train_file = f'{data_dir}/{lang}/train.jsonl'
dev_file = f'{data_dir}/{lang}/valid.jsonl'
epochs = 10 
pretrained_model = 'microsoft/codebert-base'

! python run.py \
    --do_train \
    --do_eval \
    --do_lower_case \
    --model_type roberta \
    --model_name_or_path {pretrained_model} \
    --train_filename {train_file} \
    --dev_filename {dev_file} \
    --output_dir {output_dir} \
    --max_source_length {source_length} \
    --max_target_length {target_length} \
    --beam_size {beam_size} \
    --train_batch_size {batch_size} \
    --eval_batch_size {batch_size} \
    --learning_rate {lr} \
    --num_train_epochs {epochs}

2021-01-14 20:49:04.427229: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
01/14/2021 20:49:06 - INFO - __main__ -   Namespace(adam_epsilon=1e-08, beam_size=10, config_name='', dev_filename='./java/valid.jsonl', do_eval=True, do_lower_case=True, do_test=False, do_train=True, eval_batch_size=8, eval_steps=-1, gradient_accumulation_steps=1, learning_rate=5e-05, load_model_path=None, local_rank=-1, max_grad_norm=1.0, max_source_length=256, max_steps=-1, max_target_length=48, model_name_or_path='microsoft/codebert-base', model_type='roberta', no_cuda=False, num_train_epochs=10, output_dir='model/java', seed=42, test_filename=None, tokenizer_name='', train_batch_size=8, train_filename='./java/train.jsonl', train_steps=-1, warmup_steps=0, weight_decay=0.0)
01/14/2021 20:49:06 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False
01/14/2021 20:49:06 - INFO - filelock -   Lock 140293701425752 acquired on /root/.cache/torch/transformers/08477dcecf305af90229876aa01e4b0f3594dc8c638985a72277f39ea7d8d0c3.7fb14267817b1d26bb44a57cd5aa2fc003c25e87b75ef77e9c55c4804675b4cf.lock
Downloading: 100% 499M/499M [00:06<00:00, 73.5MB/s]
01/14/2021 20:49:13 - INFO - filelock -   Lock 140293701425752 released on /root/.cache/torch/transformers/08477dcecf305af90229876aa01e4b0f3594dc8c638985a72277f39ea7d8d0c3.7fb14267817b1d26bb44a57cd5aa2fc003c25e87b75ef77e9c55c4804675b4cf.lock
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 0
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', 'public', '_static', '_void', '_check', 'j', 'av', 'ain', 'ternal', 'access', '(', 'il', 'og', 'ger', '_logger', ')', '_{', '_if', '_(', 'log', 'ger', '_==', '_null', '_||', '_!', 'java', 'version', '.', 'is', 'at', 'le', 'ast', '(', 'java', 'version', '.', 'java', '_', '9', '))', '_{', '_//', '_older', '_java', '_versions', '_are', '_fine', '_with', '_the', '_reflection', '_return', ';', '_}', '_map', '<', 'string', ',', '_package', 'access', 'requ', 'irement', '[]', '>', '_requirements', '_=', '_new', '_tre', 'em', 'ap', '<', 'string', ',', '_package', 'access', 'requ', 'irement', '[]', '>', '();', '_requirements', '.', 'put', '("', 'java', '.', 'base', '",', '_new', '_package', 'access', 'requ', 'irement', '[]', '_{', '_create', 'requ', 'irement', '(', 'false', ',', '_"', 'j', 'dk', '.', 'internal', '.', 'ref', '"),', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'java', '.', 'lang', '"),', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'java', '.', 'n', 'io', '"),', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'sun', '.', 'n', 'io', '.', 'ch', '")', '_});', '_requirements', '.', 'put', '("', 'j', 'dk', '.', 'management', '",', '_get', 'j', 'dk', 'management', 'requ', 'irements', '());', '_requirements', '.', 'put', '("', 'java', '.', 'management', '",', '_new', '_package', 'access', 'requ', 'irement', '[]', '_{', '_create', 'requ', 'irement', '(', 'true', ',', '_"', 'sun', '.', 'management', '")', '_});', '_check', 'package', 'requ', 'irements', '(', 'log', 'ger', ',', '_requirements', ');', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 15110 25156 13842 1649 267 1469 1851 46378 28300 1640 718 2154 2403 37764 43 25522 114 36 12376 2403 45994 23796 45056 27785 43830 21747 4 354 415 459 1988 1640 43830 21747 4 43830 1215 466 35122 25522 21277 2530 46900 7952 32 2051 19 5 12456 671 131 35524 5456 41552 20951 6 3737 28300 42172 34074 48992 15698 3471 5457 92 6110 991 1115 41552 20951 6 3737 28300 42172 34074 48992 15698 47006 3471 4 9179 46469 43830 4 11070 1297 92 3737 28300 42172 34074 48992 25522 1045 42172 34074 1640 22303 6 22 267 43357 4 37559 4 13043 16844 1045 42172 34074 1640 29225 6 22 43830 4 32373 16844 1045 42172 34074 1640 29225 6 22 43830 4 282 1020 16844 1045 42172 34074 1640 29225 6 22 21381 4 282 1020 4 611 8070 47771 3471 4 9179 46469 267 43357 4 14668 1297 120 267 43357 14668 42172 48227 49291 3471 4 9179 46469 43830 4 14668 1297 92 3737 28300 42172 34074 48992 25522 1045 42172 34074 1640 29225 6 22 21381 4 14668 8070 47771 1649 46181 42172 48227 1640 12376 2403 6 3471 4397 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', 'prints', '_warning', '_to', '_given', '_if', '_haz', 'el', 'cast', '_is', '_not', '_provided', '_a', '_sufficient', '_access', '_to', '_java', '_internal', '_packages', '_on', '_java', '_9', '_and', '_newer', '.', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 31553 2892 7 576 114 32468 523 5182 16 45 1286 10 7719 899 7 46900 3425 8368 15 46900 361 8 13964 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 1
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', 'public', '_void', '_marsh', 'all', '(', 's', 'ct', 'e', '20', 'pl', 'use', 'mb', 'edd', 'edd', 'est', 'inations', 'ettings', '_s', 'ct', 'e', '20', 'pl', 'use', 'mb', 'edd', 'edd', 'est', 'inations', 'ettings', ',', '_protocol', 'm', 'arsh', 'all', 'er', '_protocol', 'm', 'arsh', 'all', 'er', ')', '_{', '_if', '_(', 's', 'ct', 'e', '20', 'pl', 'use', 'mb', 'edd', 'edd', 'est', 'inations', 'ettings', '_==', '_null', ')', '_{', '_throw', '_new', '_s', 'dk', 'client', 'ex', 'ception', '("', 'in', 'valid', '_argument', '_passed', '_to', '_marsh', 'all', '(', '...)', '");', '_}', '_try', '_{', '_}', '_catch', '_(', 'ex', 'ception', '_e', ')', '_{', '_throw', '_new', '_s', 'dk', 'client', 'ex', 'ception', '("', 'un', 'able', '_to', '_marsh', 'all', '_request', '_to', '_json', ':', '_"', '_+', '_e', '.', 'get', 'message', '(),', '_e', ');', '_}', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 15110 13842 16377 1250 1640 29 3894 242 844 2911 3698 6648 13093 13093 990 17808 48496 579 3894 242 844 2911 3698 6648 13093 13093 990 17808 48496 6 11883 119 14980 1250 254 11883 119 14980 1250 254 43 25522 114 36 29 3894 242 844 2911 3698 6648 13093 13093 990 17808 48496 45994 23796 43 25522 3211 92 579 43357 38557 3463 20900 46469 179 42679 4795 1595 7 16377 1250 1640 41137 45751 35524 860 25522 35524 2916 36 3463 20900 364 43 25522 3211 92 579 43357 38557 3463 20900 46469 879 868 7 16377 1250 2069 7 49133 35 22 2055 364 4 6460 44773 49196 364 4397 35524 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', 'm', 'arsh', 'all', '_the', '_given', '_parameter', '_object', '.', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 119 14980 1250 5 576 43797 7626 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 2
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', '@', 'over', 'ride', '_public', '_void', '_pref', 'etch', 'token', '(', 'final', '_file', '_token', 'file', ',', '_final', '_props', '_props', ',', '_final', '_logger', '_logger', ')', '_throws', '_had', 'oop', 'security', 'man', 'age', 'rex', 'ception', '_{', '_final', '_string', '_us', 'ert', 'op', 'roxy', '_=', '_props', '.', 'get', 'string', '(', 'job', 'properties', '.', 'user', '_', 'to', '_', 'proxy', ');', '_logger', '.', 'info', '("', 'getting', '_had', 'oop', '_tokens', '_based', '_on', '_props', '_for', '_"', '_+', '_us', 'ert', 'op', 'roxy', ');', '_dop', 'ref', 'etch', '(', 'token', 'file', ',', '_props', ',', '_logger', ',', '_us', 'ert', 'op', 'roxy', ');', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 1039 2137 23167 285 13842 33284 29094 46657 1640 6156 2870 19233 21710 6 507 26504 26504 6 507 37764 37764 43 6989 56 18042 15506 397 1580 19633 20900 25522 507 6755 201 2399 1517 46963 5457 26504 4 6460 20951 1640 30056 47276 4 12105 1215 560 1215 47315 4397 37764 4 23999 46469 31315 56 18042 22121 716 15 26504 13 22 2055 201 2399 1517 46963 4397 32331 13043 29094 1640 46657 21710 6 26504 6 37764 6 201 2399 1517 46963 4397 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', '/*', '_gets', '_had', 'oop', '_tokens', '_for', '_a', '_user', '_to', '_run', '_map', 'red', '/', 'h', 'ive', '_jobs', '_on', '_a', '_secured', '_cluster', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 49051 1516 56 18042 22121 13 10 3018 7 422 5456 2050 73 298 2088 1315 15 10 5288 18016 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 3
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', '@', 'over', 'ride', '_public', '_<', 'y', '>', '_singular', 'attribute', '<', 'x', ',', '_y', '>', '_get', 'decl', 'ared', 'id', '(', 'class', '<', 'y', '>', '_param', 'class', ')', '_{', '_if', '_(', 'id', 'attribute', '_!=', '_null', ')', '_{', '_if', '_(', 'id', 'attribute', '.', 'get', 'j', 'av', 'at', 'ype', '().', 'equ', 'als', '(', 'param', 'class', ')', '_&&', '_!', 'is', 'id', 'class', ')', '_{', '_return', '_(', 'sing', 'ular', 'attribute', '<', 'x', ',', '_y', '>)', '_id', 'attribute', ';', '_}', '_}', '_on', 'error', '();', '_return', '_null', ';', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 1039 2137 23167 285 28696 219 15698 23429 49202 41552 1178 6 1423 15698 120 32639 6537 808 1640 4684 41552 219 15698 40206 4684 43 25522 114 36 808 49202 49333 23796 43 25522 114 36 808 49202 4 6460 267 1469 415 37356 49123 8198 1536 1640 46669 4684 43 48200 27785 354 808 4684 43 25522 671 36 26058 8244 49202 41552 1178 6 1423 49798 13561 49202 131 35524 35524 15 44223 47006 671 23796 131 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', '/*', '_(', 'non', '-', 'j', 'av', 'ad', 'oc', ')', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 49051 36 13424 12 267 1469 625 1975 43 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   *** Example ***
01/14/2021 20:49:30 - INFO - __main__ -   idx: 4
01/14/2021 20:49:30 - INFO - __main__ -   source_tokens: ['<s>', 'public', '_void', '_sync', '(', 'bo', 'olean', '_syn', 'call', 'se', 'gments', ')', '_{', '_commit', 'log', 'se', 'gment', '_current', '_=', '_alloc', 'ator', '.', 'all', 'ocating', 'from', '();', '_for', '_(', 'commit', 'log', 'se', 'gment', '_segment', '_:', '_alloc', 'ator', '.', 'get', 'act', 'ives', 'eg', 'ments', '())', '_{', '_if', '_(!', 'sync', 'all', 'se', 'gments', '_&&', '_segment', '.', 'id', '_>', '_current', '.', 'id', ')', '_return', ';', '_segment', '.', 'sync', '();', '_}', '_}', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   source_ids: 0 15110 13842 22785 1640 3983 48547 17796 16395 1090 30237 43 25522 6225 12376 1090 10757 595 5457 42793 2630 4 1250 18106 7761 47006 13 36 42721 12376 1090 10757 2835 4832 42793 2630 4 6460 7257 3699 3733 2963 49338 25522 114 48209 45176 1250 1090 30237 48200 2835 4 808 8061 595 4 808 43 671 131 2835 4 45176 47006 35524 35524 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:30 - INFO - __main__ -   target_tokens: ['<s>', 'forces', '_a', '_disk', '_flush', '_on', '_the', '_commit', '_log', '_files', '_that', '_need', '_it', '.', '_blocking', '.', '</s>']
01/14/2021 20:49:30 - INFO - __main__ -   target_ids: 0 34532 10 21675 24841 15 5 6225 7425 6773 14 240 24 4 8890 4 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
01/14/2021 20:49:30 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
01/14/2021 20:49:33 - INFO - __main__ -   ***** Running training *****
01/14/2021 20:49:33 - INFO - __main__ -     Num examples = 2809
01/14/2021 20:49:33 - INFO - __main__ -     Batch size = 8
01/14/2021 20:49:33 - INFO - __main__ -     Num epoch = 10
epoch 0 loss 6.8534: 100% 352/352 [02:53<00:00,  2.03it/s]
01/14/2021 20:52:27 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 20:52:27 - INFO - __main__ -     Num examples = 88
01/14/2021 20:52:27 - INFO - __main__ -     Batch size = 8
01/14/2021 20:52:29 - INFO - __main__ -     eval_ppl = 420.66683
01/14/2021 20:52:29 - INFO - __main__ -     global_step = 353
01/14/2021 20:52:29 - INFO - __main__ -     train_loss = 6.8534
01/14/2021 20:52:29 - INFO - __main__ -     ********************
01/14/2021 20:52:31 - INFO - __main__ -     Best ppl:420.66683
01/14/2021 20:52:31 - INFO - __main__ -     ********************
Total: 88
01/14/2021 20:52:58 - INFO - __main__ -     bleu-4 = 9.79 
01/14/2021 20:52:58 - INFO - __main__ -     ********************
01/14/2021 20:52:58 - INFO - __main__ -     Best bleu:9.79
01/14/2021 20:52:58 - INFO - __main__ -     ********************
epoch 1 loss 5.2249: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 20:55:58 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 20:55:58 - INFO - __main__ -     Num examples = 88
01/14/2021 20:55:58 - INFO - __main__ -     Batch size = 8
01/14/2021 20:56:00 - INFO - __main__ -     eval_ppl = 223.30135
01/14/2021 20:56:00 - INFO - __main__ -     global_step = 705
01/14/2021 20:56:00 - INFO - __main__ -     train_loss = 5.2249
01/14/2021 20:56:00 - INFO - __main__ -     ********************
01/14/2021 20:56:02 - INFO - __main__ -     Best ppl:223.30135
01/14/2021 20:56:02 - INFO - __main__ -     ********************
Total: 88
01/14/2021 20:56:30 - INFO - __main__ -     bleu-4 = 10.3 
01/14/2021 20:56:30 - INFO - __main__ -     ********************
01/14/2021 20:56:30 - INFO - __main__ -     Best bleu:10.3
01/14/2021 20:56:30 - INFO - __main__ -     ********************
epoch 2 loss 4.4676: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 20:59:31 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 20:59:31 - INFO - __main__ -     Num examples = 88
01/14/2021 20:59:31 - INFO - __main__ -     Batch size = 8
01/14/2021 20:59:32 - INFO - __main__ -     eval_ppl = 167.43889
01/14/2021 20:59:32 - INFO - __main__ -     global_step = 1057
01/14/2021 20:59:32 - INFO - __main__ -     train_loss = 4.4676
01/14/2021 20:59:32 - INFO - __main__ -     ********************
01/14/2021 20:59:35 - INFO - __main__ -     Best ppl:167.43889
01/14/2021 20:59:35 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:00:05 - INFO - __main__ -     bleu-4 = 10.68 
01/14/2021 21:00:05 - INFO - __main__ -     ********************
01/14/2021 21:00:05 - INFO - __main__ -     Best bleu:10.68
01/14/2021 21:00:05 - INFO - __main__ -     ********************
epoch 3 loss 3.8263: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:03:05 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:03:05 - INFO - __main__ -     Num examples = 88
01/14/2021 21:03:05 - INFO - __main__ -     Batch size = 8
01/14/2021 21:03:07 - INFO - __main__ -     eval_ppl = 160.25635
01/14/2021 21:03:07 - INFO - __main__ -     global_step = 1409
01/14/2021 21:03:07 - INFO - __main__ -     train_loss = 3.8263
01/14/2021 21:03:07 - INFO - __main__ -     ********************
01/14/2021 21:03:10 - INFO - __main__ -     Best ppl:160.25635
01/14/2021 21:03:10 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:03:38 - INFO - __main__ -     bleu-4 = 11.04 
01/14/2021 21:03:38 - INFO - __main__ -     ********************
01/14/2021 21:03:38 - INFO - __main__ -     Best bleu:11.04
01/14/2021 21:03:38 - INFO - __main__ -     ********************
epoch 4 loss 3.2797: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:06:38 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:06:38 - INFO - __main__ -     Num examples = 88
01/14/2021 21:06:38 - INFO - __main__ -     Batch size = 8
01/14/2021 21:06:40 - INFO - __main__ -     eval_ppl = 152.19858
01/14/2021 21:06:40 - INFO - __main__ -     global_step = 1761
01/14/2021 21:06:40 - INFO - __main__ -     train_loss = 3.2797
01/14/2021 21:06:40 - INFO - __main__ -     ********************
01/14/2021 21:06:42 - INFO - __main__ -     Best ppl:152.19858
01/14/2021 21:06:42 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:07:14 - INFO - __main__ -     bleu-4 = 10.36 
01/14/2021 21:07:14 - INFO - __main__ -     ********************
epoch 5 loss 2.8204: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:10:12 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:10:12 - INFO - __main__ -     Num examples = 88
01/14/2021 21:10:12 - INFO - __main__ -     Batch size = 8
01/14/2021 21:10:13 - INFO - __main__ -     eval_ppl = 150.95443
01/14/2021 21:10:13 - INFO - __main__ -     global_step = 2113
01/14/2021 21:10:13 - INFO - __main__ -     train_loss = 2.8204
01/14/2021 21:10:13 - INFO - __main__ -     ********************
01/14/2021 21:10:16 - INFO - __main__ -     Best ppl:150.95443
01/14/2021 21:10:16 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:10:45 - INFO - __main__ -     bleu-4 = 11.57 
01/14/2021 21:10:45 - INFO - __main__ -     ********************
01/14/2021 21:10:45 - INFO - __main__ -     Best bleu:11.57
01/14/2021 21:10:45 - INFO - __main__ -     ********************
epoch 6 loss 2.4442: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:13:46 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:13:46 - INFO - __main__ -     Num examples = 88
01/14/2021 21:13:46 - INFO - __main__ -     Batch size = 8
01/14/2021 21:13:47 - INFO - __main__ -     eval_ppl = 156.69898
01/14/2021 21:13:47 - INFO - __main__ -     global_step = 2465
01/14/2021 21:13:47 - INFO - __main__ -     train_loss = 2.4442
01/14/2021 21:13:47 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:14:17 - INFO - __main__ -     bleu-4 = 10.65 
01/14/2021 21:14:17 - INFO - __main__ -     ********************
epoch 7 loss 2.1565: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:17:15 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:17:15 - INFO - __main__ -     Num examples = 88
01/14/2021 21:17:15 - INFO - __main__ -     Batch size = 8
01/14/2021 21:17:16 - INFO - __main__ -     eval_ppl = 163.34726
01/14/2021 21:17:16 - INFO - __main__ -     global_step = 2817
01/14/2021 21:17:16 - INFO - __main__ -     train_loss = 2.1565
01/14/2021 21:17:16 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:17:50 - INFO - __main__ -     bleu-4 = 10.56 
01/14/2021 21:17:50 - INFO - __main__ -     ********************
epoch 8 loss 1.9398: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:20:47 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:20:47 - INFO - __main__ -     Num examples = 88
01/14/2021 21:20:47 - INFO - __main__ -     Batch size = 8
01/14/2021 21:20:49 - INFO - __main__ -     eval_ppl = 166.41823
01/14/2021 21:20:49 - INFO - __main__ -     global_step = 3169
01/14/2021 21:20:49 - INFO - __main__ -     train_loss = 1.9398
01/14/2021 21:20:49 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:21:26 - INFO - __main__ -     bleu-4 = 10.74 
01/14/2021 21:21:26 - INFO - __main__ -     ********************
epoch 9 loss 1.7877: 100% 352/352 [02:57<00:00,  1.98it/s]
01/14/2021 21:24:24 - INFO - __main__ -   
***** Running evaluation *****
01/14/2021 21:24:24 - INFO - __main__ -     Num examples = 88
01/14/2021 21:24:24 - INFO - __main__ -     Batch size = 8
01/14/2021 21:24:25 - INFO - __main__ -     eval_ppl = 169.37057
01/14/2021 21:24:25 - INFO - __main__ -     global_step = 3521
01/14/2021 21:24:25 - INFO - __main__ -     train_loss = 1.7877
01/14/2021 21:24:25 - INFO - __main__ -     ********************
Total: 88
01/14/2021 21:24:59 - INFO - __main__ -     bleu-4 = 10.28 
01/14/2021 21:24:59 - INFO - __main__ -     ********************

Yay! Our model has finished baking and we can now see how well it turned out by evaluating it!

batch_size=64
dev_file=f"{data_dir}/{lang}/valid.jsonl"
test_file=f"{data_dir}/{lang}/test.jsonl"
test_model=f"{output_dir}/checkpoint-best-bleu/pytorch_model.bin" #checkpoint for test

! python run.py \
    --do_test \
    --model_type roberta \
    --model_name_or_path microsoft/codebert-base \
    --load_model_path {test_model} \
    --dev_filename {dev_file} \
    --test_filename {test_file} \
    --output_dir {output_dir} \
    --max_source_length {source_length} \
    --max_target_length {target_length} \
    --beam_size {beam_size} \
    --eval_batch_size {batch_size}

2021-01-14 21:25:04.498200: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
01/14/2021 21:25:07 - INFO - __main__ -   Namespace(adam_epsilon=1e-08, beam_size=10, config_name='', dev_filename='./java/valid.jsonl', do_eval=False, do_lower_case=False, do_test=True, do_train=False, eval_batch_size=64, eval_steps=-1, gradient_accumulation_steps=1, learning_rate=5e-05, load_model_path='model/java/checkpoint-best-bleu/pytorch_model.bin', local_rank=-1, max_grad_norm=1.0, max_source_length=256, max_steps=-1, max_target_length=48, model_name_or_path='microsoft/codebert-base', model_type='roberta', no_cuda=False, num_train_epochs=3, output_dir='model/java', seed=42, test_filename='./java/test.jsonl', tokenizer_name='', train_batch_size=8, train_filename=None, train_steps=-1, warmup_steps=0, weight_decay=0.0)
01/14/2021 21:25:07 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False
01/14/2021 21:25:23 - INFO - __main__ -   reload model from model/java/checkpoint-best-bleu/pytorch_model.bin
01/14/2021 21:25:48 - INFO - __main__ -   Test file: ./java/valid.jsonl
100% 2/2 [00:26<00:00, 13.34s/it]
Total: 88
01/14/2021 21:26:15 - INFO - __main__ -     bleu-4 = 11.57 
01/14/2021 21:26:15 - INFO - __main__ -     ********************
01/14/2021 21:26:15 - INFO - __main__ -   Test file: ./java/test.jsonl
100% 4/4 [00:55<00:00, 13.95s/it]
Total: 193
01/14/2021 21:27:11 - INFO - __main__ -     bleu-4 = 9.74 
01/14/2021 21:27:11 - INFO - __main__ -     ********************

Let's now load up our model and take it for a spin!

import torch

import torch.nn as nn

from model import Seq2Seq
from transformers import RobertaConfig, RobertaModel

config = RobertaConfig.from_pretrained(pretrained_model)
encoder = RobertaModel.from_pretrained(pretrained_model, config = config)    
decoder_layer = nn.TransformerDecoderLayer(d_model=config.hidden_size, nhead=config.num_attention_heads)
decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
model = Seq2Seq(encoder = encoder,decoder = decoder,config=config,
                beam_size=beam_size,max_length=target_length,
                sos_id=tokenizer.cls_token_id,eos_id=tokenizer.sep_token_id)
model.load_state_dict(torch.load(Path(output_dir)/"checkpoint-last/pytorch_model.bin"))
model.to('cuda')

Seq2Seq(
  (encoder): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (1): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (2): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (3): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (4): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (5): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (6): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (7): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (8): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (9): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (10): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (11): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): RobertaPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (linear1): Linear(in_features=768, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=768, bias=True)
        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (linear1): Linear(in_features=768, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=768, bias=True)
        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
      (2): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (linear1): Linear(in_features=768, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=768, bias=True)
        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
      (3): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (linear1): Linear(in_features=768, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=768, bias=True)
        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
      (4): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (linear1): Linear(in_features=768, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=768, bias=True)
        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
      (5): TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (multihead_attn): MultiheadAttention(
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (linear1): Linear(in_features=768, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=768, bias=True)
        (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
        (dropout3): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (dense): Linear(in_features=768, out_features=768, bias=True)
  (lm_head): Linear(in_features=768, out_features=50265, bias=False)
  (lsm): LogSoftmax()
)

idx = 0
TEXT_TO_SUMMARIZE = df_val.mthd.values[idx]
print('Code:', TEXT_TO_SUMMARIZE)
print('Original Comment:', df_val.cmt.values[idx])

Code: public static byte[] decode(final string s) { int delta = s.endswith("==") ? 2 : s.endswith("=") ? 1 : 0; byte[] buffer = new byte[s.length() * bytes_per_unencoded_block / bytes_per_encoded_block - delta]; int mask = 0xff; int pos = 0; for (int i = 0; i < s.length(); i += bytes_per_encoded_block) { int c0 = decode_table[s.charat(i)]; int c1 = decode_table[s.charat(i + 1)]; buffer[pos++] = (byte) (((c0 << 2) | (c1 >> 4)) & mask); if (pos >= buffer.length) { return buffer; } int c2 = decode_table[s.charat(i + 2)]; buffer[pos++] = (byte) (((c1 << 4) | (c2 >> 2)) & mask); if (pos >= buffer.length) { return buffer; } int c3 = decode_table[s.charat(i + 3)]; buffer[pos++] = (byte) (((c2 << 6) | c3) & mask); } return buffer; }
Original Comment: decodes the given base64-encoded string.

from run import convert_examples_to_features, Example

class Args:
    max_source_length = source_length
    max_target_length = target_length

args = Args()

def get_preds(df: pd.DataFrame):
    ps = []
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        examples = [
            Example(idx, source = row.mthd, target = row.cmt)
        ]
        eval_features = convert_examples_to_features(
            examples, tokenizer, args, stage='test'
        )
        source_ids = torch.tensor(eval_features[0].source_ids, dtype = torch.long).unsqueeze(0).to('cuda')
        source_mask = torch.tensor(eval_features[0].source_mask, dtype = torch.long).unsqueeze(0).to('cuda')

        with torch.no_grad():
            preds = model(source_ids = source_ids, source_mask = source_mask)  
            for pred in preds:
                t = pred[0].cpu().numpy()
                t = list(t)
                if 0 in t:
                    t = t[:t.index(0)]
                text = tokenizer.decode(t,clean_up_tokenization_spaces=False)
                ps.append(text)
    
    return ps

df_val = df_val.reset_index()
preds = get_preds(df_val.head(10))
for idx, row in df_val.head(10).iterrows():
    print('Code:', row.mthd)
    print('Original Comment:', row.cmt)
    print('Generated Comment:', preds[idx])
    print('='*40)

Code: public static byte[] decode(final string s) { int delta = s.endswith("==") ? 2 : s.endswith("=") ? 1 : 0; byte[] buffer = new byte[s.length() * bytes_per_unencoded_block / bytes_per_encoded_block - delta]; int mask = 0xff; int pos = 0; for (int i = 0; i < s.length(); i += bytes_per_encoded_block) { int c0 = decode_table[s.charat(i)]; int c1 = decode_table[s.charat(i + 1)]; buffer[pos++] = (byte) (((c0 << 2) | (c1 >> 4)) & mask); if (pos >= buffer.length) { return buffer; } int c2 = decode_table[s.charat(i + 2)]; buffer[pos++] = (byte) (((c1 << 4) | (c2 >> 2)) & mask); if (pos >= buffer.length) { return buffer; } int c3 = decode_table[s.charat(i + 3)]; buffer[pos++] = (byte) (((c2 << 6) | c3) & mask); } return buffer; }
Original Comment: decodes the given base64-encoded string.
Generated Comment: decode encode a string representation of string
========================================
Code: private void extractapklib( artifact apklibartifact ) throws mojoexecutionexception { getunpackedlibhelper().extractapklib( apklibartifact ); // copy the assets to the the combinedassets folder. // add the apklib source and resource to the compile. // nb apklib sources are added to compilesourceroot because we may need to compile against them. // this means the apklib classes will be compiled into target/classes and packaged with this build. copyfolder( getunpackedlibassetsfolder( apklibartifact ), combinedassets ); final file apklibsourcefolder = getunpackedapklibsourcefolder( apklibartifact ); final list<string> resourceexclusions = arrays.aslist( "**/*.java", "**/*.aidl" ); projecthelper.addresource( project, apklibsourcefolder.getabsolutepath(), null, resourceexclusions ); project.addcompilesourceroot( apklibsourcefolder.getabsolutepath() ); }
Original Comment: extracts apklib and adds the assets and apklib sources and resources to the build.
Generated Comment: extracts the cp libraries from the given source library.
========================================
Code: static <t> t[] copy(object[] source, int from, int to, t[] arrayoftype) { t[] result = newarray(arrayoftype, to - from); system.arraycopy(source, from, result, 0, to - from); return result; }
Original Comment: equivalent to arrays.copyofrange(source, from, to, arrayoftype.getclass()).
Generated Comment: creates a new object from the array.
========================================
Code: private static runtimedelegate finddelegate() { runtimedelegate result=null; try { result=createruntimedelegatefromspi(); if(result==null) { result=createruntimedelegatefromconfigurationfile(); } if(result==null) { string delegateclassname = system.getproperty(application_engine_spi_property); if(delegateclassname!=null) { result=createruntimedelegateforclassname(delegateclassname); } } } catch (exception ex) { logger.warn("could not find application engine",ex); } return result; }
Original Comment: obtain an instance using the method described in }.
Generated Comment: /* package
========================================
Code: public static string getcategory(string eventsrcname) { if (eventsrcname == null) { return null; } int end = eventsrcname.lastindexof('.'); eventsrcname = eventsrcname.substring(0, end); if (checkstyle_package.equals(eventsrcname)) { return "misc"; } else if (!eventsrcname.startswith(checkstyle_package)) { return "extension"; } return eventsrcname.substring(eventsrcname.lastindexof('.') + 1); }
Original Comment: get the rule category from an audit event source name.
Generated Comment: returns the contents of the event name.
========================================
Code: private collection<artifact> getserverdependencies(final string servertype, final expressionevaluator expressionevaluator) throws componentconfigurationexception { try { final mavenproject project = (mavenproject) expressionevaluator.evaluate("${project}"); final string localrepo = (string) expressionevaluator.evaluate("${settings.localrepository}"); final artifactrepository localrepository = repositorysystem.createlocalrepository(new file(localrepo)); final repositoryrequest repositoryrequest = new defaultrepositoryrequest(); repositoryrequest.setremoterepositories(project.getremoteartifactrepositories()); repositoryrequest.setlocalrepository(localrepository); final artifactresolutionrequest request = new artifactresolutionrequest(repositoryrequest); request.setartifact(getserverartifact(servertype)); request.setresolvetransitively(true); final artifactresolutionresult result = repositorysystem.resolve(request); if (result.issuccess()) { return result.getartifacts(); } boolean first = true; final stringbuilder builder = new stringbuilder("cannot resolve dependencies: ["); for (final artifact artifact : result.getmissingartifacts()) { if (!first) { builder.append(','); } else { first = false; } builder.append(artifact.getgroupid()); builder.append(':'); builder.append(artifact.getartifactid()); builder.append(':'); builder.append(artifact.getversion()); } builder.append("]"); throw new componentconfigurationexception(builder.tostring()); } catch (final expressionevaluationexception e) { throw new componentconfigurationexception("error evaluating expression", e); } catch (final invalidrepositoryexception e) { throw new componentconfigurationexception("error resolving local repository", e); } }
Original Comment: resolve the ldap server type artifact and its dependencies.
Generated Comment: gets the repositories from the repository.
========================================
Code: private void frame4() { long currenttime = system.currenttimemillis(); // xxx: lots of dummy value // record trade information in trade table. // insert into trade (t_id, t_dts, t_st_id, t_tt_id, t_is_cash, // t_s_symb, t_qty, t_bid_price, t_ca_id, t_exec_name, t_trade_price, // t_chrg, t_comm, t_tax, t_lifo) values (...) string sql = string.format("insert into trade (t_id, t_dts, t_st_id, t_tt_id, " + "t_is_cash, t_s_symb, t_qty, t_bid_price, t_ca_id, t_exec_name, " + "t_trade_price, t_chrg, t_comm, t_tax, t_lifo) values (%d, %d, '%s', " + "'%s', %d, '%s', %d, %f, %d, '%s', %f, %f, %f, %f, %d)", paramhelper.gettradeid(), currenttime, statusid, paramhelper.gettradetypeid(), 1, paramhelper.getsymbol(), paramhelper.gettradeqty(), marketprice, paramhelper.getacctid(), "exec_name", paramhelper.gettradeprice(), 0.0, 0.0, 0.0, 1); executeupdate(sql); // todo: implement this (not in the simplified version) // record pending trade information in trade_request table // if this trade is a limit trade // insert into trade_request (tr_t_id, tr_tt_id, tr_s_symb, tr_qty, // tr_bid_price, tr_b_id) values (...) // record trade information in trade_history table // insert into trade_history (th_t_id, th_dts, th_st_id) values (...) sql = string.format("insert into trade_history (th_t_id, th_dts, th_st_id) values " + "(%d, %d, '%s')", paramhelper.gettradeid(), currenttime, statusid); executeupdate(sql); }
Original Comment: record the trade request by making all related updates
Generated Comment: this method is used to create the database.
========================================
Code: protected string getquery() { final stringbuilder ret = new stringbuilder(); try { final string clazzname; if (efapssystemconfiguration.get().containsattributevalue("org.efaps.kernel.index.querybuilder")) { clazzname = efapssystemconfiguration.get().getattributevalue("org.efaps.kernel.index.querybuilder"); } else { clazzname = "org.efaps.esjp.admin.index.lucencequerybuilder"; } final class<?> clazz = class.forname(clazzname, false, efapsclassloader.getinstance()); final object obj = clazz.newinstance(); final method method = clazz.getmethod("getquery4dimvalues", string.class, list.class, list.class); final object newquery = method.invoke(obj, getcurrentquery(), getincluded(), getexcluded()); ret.append(newquery); } catch (final efapsexception | classnotfoundexception | instantiationexception | illegalaccessexception | nosuchmethodexception | securityexception | illegalargumentexception | invocationtargetexception e) { indexsearch.log.error("catched", e); ret.append(getcurrentquery()); } return ret.tostring(); }
Original Comment: gets the query.
Generated Comment: get the query instance.
========================================
Code: private languagedata findlanguage(final string locale) { for (final languagedata languagedata : languagedatadao.getall()) { if (languagedata.getlanguagecode().equalsignorecase(locale)) { return languagedata; } } return null; }
Original Comment: find language.
Generated Comment: gets the specified locale.
========================================
Code: private standardintrospectionresponse callstandardintrospection(string parameters) { if (parameters == null) { // authlete returns different error codes for null and an empty string. // 'null' is regarded as a caller's error. an empty string is regarded // as a client application's error. parameters = ""; } // create a request for authlete's /api/auth/introspection/standard api. standardintrospectionrequest request = new standardintrospectionrequest() .setparameters(parameters); try { // call authlete's /api/auth/introspection/standard api. return mapi.standardintrospection(request); } catch (authleteapiexception e) { // the api call failed. throw apifailure("/api/auth/introspection/standard", e); } }
Original Comment: call authlete's api.
Generated Comment: returns a set of authentication object.
========================================

The model seems to be doing a good job, but if you play with it some more you'll realize it is mostly taking the name of the method and using that to guide the comment. This makes sense, but it probably isn't learning much more than this association, at least with this small model. Let's explore it a bit more by looking at all the examples in the validation set it is failing the most on.

def get_preds_losses(df: pd.DataFrame):
    ps = []
    losses = []
    for idx, row in tqdm(df.iterrows(), total=len(df)):
        examples = [
            Example(idx, source = row.mthd, target = row.cmt)
        ]
        eval_features = convert_examples_to_features(
            examples, tokenizer, args, stage='test'
        )
        source_ids = torch.tensor([f.source_ids for f in eval_features], dtype = torch.long).to('cuda')
        source_mask = torch.tensor([f.source_mask for f in eval_features], dtype = torch.long).to('cuda')
        target_ids = torch.tensor([f.target_ids for f in eval_features], dtype = torch.long).to('cuda')
        target_mask = torch.tensor([f.target_mask for f in eval_features], dtype = torch.long).to('cuda')

        with torch.no_grad():
            _, loss, _ = model(
                source_ids = source_ids, source_mask = source_mask,
                target_ids = target_ids, target_mask = target_mask
            )
            preds = model(source_ids = source_ids, source_mask = source_mask)  
            for pred in preds:
                t = pred[0].cpu().numpy()
                t = list(t)
                if 0 in t:
                    t = t[:t.index(0)]
                text = tokenizer.decode(t,clean_up_tokenization_spaces=False)
                ps.append(text)
                losses.append(loss.item())
    
    return ps, losses

df_head = df_val.copy()
ps, losses = get_preds_losses(df_head)
df_head['pred'] = ps
df_head['loss'] = losses
df_sorted_losses = df_head.sort_values('loss', ascending = False)

for _, row in df_sorted_losses.head(10).iterrows():
    print('Code:', row.mthd)
    print('Original Comment:', row.cmt)
    print('Generated Comment:', row.pred)
    print(row.loss)
    print('='*40)

Code: private collection<artifact> getserverdependencies(final string servertype, final expressionevaluator expressionevaluator) throws componentconfigurationexception { try { final mavenproject project = (mavenproject) expressionevaluator.evaluate("${project}"); final string localrepo = (string) expressionevaluator.evaluate("${settings.localrepository}"); final artifactrepository localrepository = repositorysystem.createlocalrepository(new file(localrepo)); final repositoryrequest repositoryrequest = new defaultrepositoryrequest(); repositoryrequest.setremoterepositories(project.getremoteartifactrepositories()); repositoryrequest.setlocalrepository(localrepository); final artifactresolutionrequest request = new artifactresolutionrequest(repositoryrequest); request.setartifact(getserverartifact(servertype)); request.setresolvetransitively(true); final artifactresolutionresult result = repositorysystem.resolve(request); if (result.issuccess()) { return result.getartifacts(); } boolean first = true; final stringbuilder builder = new stringbuilder("cannot resolve dependencies: ["); for (final artifact artifact : result.getmissingartifacts()) { if (!first) { builder.append(','); } else { first = false; } builder.append(artifact.getgroupid()); builder.append(':'); builder.append(artifact.getartifactid()); builder.append(':'); builder.append(artifact.getversion()); } builder.append("]"); throw new componentconfigurationexception(builder.tostring()); } catch (final expressionevaluationexception e) { throw new componentconfigurationexception("error evaluating expression", e); } catch (final invalidrepositoryexception e) { throw new componentconfigurationexception("error resolving local repository", e); } }
Original Comment: resolve the ldap server type artifact and its dependencies.
Generated Comment: gets the repository from the repository.
24.875783920288086
========================================
Code: public static byte[] decode(final string s) { int delta = s.endswith("==") ? 2 : s.endswith("=") ? 1 : 0; byte[] buffer = new byte[s.length() * bytes_per_unencoded_block / bytes_per_encoded_block - delta]; int mask = 0xff; int pos = 0; for (int i = 0; i < s.length(); i += bytes_per_encoded_block) { int c0 = decode_table[s.charat(i)]; int c1 = decode_table[s.charat(i + 1)]; buffer[pos++] = (byte) (((c0 << 2) | (c1 >> 4)) & mask); if (pos >= buffer.length) { return buffer; } int c2 = decode_table[s.charat(i + 2)]; buffer[pos++] = (byte) (((c1 << 4) | (c2 >> 2)) & mask); if (pos >= buffer.length) { return buffer; } int c3 = decode_table[s.charat(i + 3)]; buffer[pos++] = (byte) (((c2 << 6) | c3) & mask); } return buffer; }
Original Comment: decodes the given base64-encoded string.
Generated Comment: encodes a string value from a string.
24.304515838623047
========================================
Code: @override public void init(configurationvalueprovider... configurationvalueproviders) { if (configurationvalueproviders != null) { for (configurationproperty property : getcontainer().properties.values()) { property.init(configurationvalueproviders); } } }
Original Comment: override default values for properties with the given configurationproviders.
Generated Comment: configures all the options in the given configuration.
24.276317596435547
========================================
Code: private static boolean validatepart(string part, boolean isfinalpart) { // these tests could be collapsed into one big boolean expression, but // they have been left as independent tests for clarity. if (part.length() < 1 || part.length() > max_domain_part_length) { return false; } /* * gwt claims to support java.lang.character's char-classification methods, but it actually only * works for ascii. so for now, assume any non-ascii characters are valid. the only place this * seems to be documented is here: * http://osdir.com/ml/googlewebtoolkitcontributors/2010-03/msg00178.html * * <p>ascii characters in the part are expected to be valid per rfc 1035, with underscore also * being allowed due to widespread practice. */ string asciichars = charmatcher.ascii().retainfrom(part); if (!part_char_matcher.matchesallof(asciichars)) { return false; } // no initial or final dashes or underscores. if (dash_matcher.matches(part.charat(0)) || dash_matcher.matches(part.charat(part.length() - 1))) { return false; } /* * note that we allow (in contravention of a strict interpretation of the relevant rfcs) domain * parts other than the last may begin with a digit (for example, "3com.com"). it's important to * disallow an initial digit in the last part; it's the only thing that stops an ipv4 numeric * address like 127.0.0.1 from looking like a valid domain name. */ if (isfinalpart && charmatcher.digit().matches(part.charat(0))) { return false; } return true; }
Original Comment: helper method for }. validates that one part of a domain name is valid.
Generated Comment: parses a string representation of the given string.
24.256574630737305
========================================
Code: private void extractapklib( artifact apklibartifact ) throws mojoexecutionexception { getunpackedlibhelper().extractapklib( apklibartifact ); // copy the assets to the the combinedassets folder. // add the apklib source and resource to the compile. // nb apklib sources are added to compilesourceroot because we may need to compile against them. // this means the apklib classes will be compiled into target/classes and packaged with this build. copyfolder( getunpackedlibassetsfolder( apklibartifact ), combinedassets ); final file apklibsourcefolder = getunpackedapklibsourcefolder( apklibartifact ); final list<string> resourceexclusions = arrays.aslist( "**/*.java", "**/*.aidl" ); projecthelper.addresource( project, apklibsourcefolder.getabsolutepath(), null, resourceexclusions ); project.addcompilesourceroot( apklibsourcefolder.getabsolutepath() ); }
Original Comment: extracts apklib and adds the assets and apklib sources and resources to the build.
Generated Comment: extracts compiled from the cp compiler.
23.989707946777344
========================================
Code: public void adddefaultheader(final string name, final string value) { validate.notempty(name, "header name cannot be empty"); validate.notnull(value, "header value cannot be null, use an empty string instead"); this.checkconfigurable(); this.defaultheaders.put(name, value); }
Original Comment: adds a default header to be added to every stub http response.
Generated Comment: adds the headers to the headers.
23.846609115600586
========================================
Code: public static schema getschema(final file xsd, final errorhandler errorhandler) throws saxexception { // create a new instance for an xsd-aware schemafactory final schemafactory schemafactory = schemafactory .newinstance(http_www_w3_org_2001_xml_schema); // set the errorhandler implementation. schemafactory.seterrorhandler(errorhandler); // get the custom xsd schema that describes // the required format for my xml files. return schemafactory.newschema(xsd); }
Original Comment: gets the schema.
Generated Comment: creates a xml object from the given namespace.
23.77509880065918
========================================
Code: @override protected formatwriter createwriter(final outputstream outputstream, final formatlogger logger) { try { return new dsmlformatwriter(outputstream); } catch (final ioexception e) { logger.logerror("could not create and intialise the dsml writer", e); } return null; }
Original Comment: create the ldap writer that will dump ldap entries to a dsml file.
Generated Comment: creates a new writer as a xml file.
23.688125610351562
========================================
Code: @override public volatileimage createcompatiblevolatileimage(int width, int height, imagecapabilities caps, int transparency) throws awtexception { if (img == null) { img = new bufferedimage(1, 1, bufferedimage.type_int_argb); gc = img.creategraphics().getdeviceconfiguration(); } return gc.createcompatiblevolatileimage(width, height, caps, transparency); }
Original Comment: returns a volatile image. this method is a workaround for a classcastexception that occurs on macosx when exporting a swing ui that uses the nimbus look and feel to svg.
Generated Comment: create a new image from a new image.
23.60519790649414
========================================
Code: private static void printstacktrace(printstream out, throwable err) { out.println(err.getclass().getname() + ": " + err.getmessage()); for (stacktraceelement ste : err.getstacktrace()) { out.println("\tat " + ste.tostring()); } if (err.getcause() != null) { out.print("caused by: "); printstacktrace(out, err.getcause()); } }
Original Comment: print a complete stack trace. this differs from throwable.printstacktrace() in that it always prints all of the trace.
Generated Comment: print out a message.
23.529924392700195
========================================

What's Next?

If you'd like to see how you can integrate this code comment summarizer model into the popular VSCode IDE, check out my video that goes over just that!

Image by mohamed Hassan from Pixabay

This notebook was adapted from the following project:

https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py

Original license of the project this notebook was adapted from: https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py

LICENSE

# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

About

Hola! Today we will be creating a chatbot, but not just any chatbot. In this tutorial, you will create your own open-dialog chatbot, one that doesn't just have premade responses to very specific questions or commands!

The overall goal of this tutorial is to create a language learning companion where you can practice simple conversations in a language you care about. We will focus on the beautiful Spanish language in this series as I have been trying to learn the language for the past 5 years, however, you should be able to adapt this tutorial to other languages as well.

First we are going to cover some of the background material for how all this works (if you are already familiar with the GPT2 model, go ahead and skip this background section). Let's get to it!

Background

What is GPT2?

In this post, we are going to use the GPT2 model (Generative Pre-Training 2), from the amazing paper "Language Models are Unsupervised Multitask Learners" by Alex Radford et. al. I will be giving a brief overview of this model. However, if you want a more in-depth explanation I highly recommend the blog post "The Illustrated GPT-2" by Jay Alammar.

GPT2 is what is called an autoregressive language model. This may sound complicated, but it is actually quiet simple, so lets break down what this means. Autoregressive means that the output of the model is fedback into the model as input. Here is a nice example of how that works:

Image From Deepmind

Now, a language model is usually some statistical model that gives the probability of some word given the context. So, take the following example:

An [blank] a day keeps the doctor away

A good language model would give higher probability to the word "apple" occuring in the [blank] than say the word "crocodile" since most likely encountering a crocodile daily would probably have the opposite effect.

Putting them together, we get an autoregressive language model where given some context

How much wood could a woodchuck chuck, if a woodchuck could [blank]

The statistical model then gives some probability to what the next word will be, which we will use in selecting the word. Once we have the selection we add it to our sentence and repeat the whole process again!

How much wood could a woodchuck chuck, if a woodchuck could chuck [blank]

Now, to train our autoregressive language model we just need to get a bunch of example sentences or just chunks of text, hide the last word, and use these sentences with the missing word as our inputs and the last words as the target. This is essentially the whole idea behind GPT2 and many other autoregressive language models, where they learn how language works by using the context to infer the next word.

GPT2 as a chatbot

Great, so you may be asking yourself, "how do we use GPT2 as a chatbot?" To answer this question we need to turn our attention to another paper, "DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation". To see how we can repurpose this generator, GPT2, look at the following example:

Hi, how are you? [end_of_turn] I'm good, what about you? [end_of_turn] Not so good, lots of long nights at work. [end_of_turn] Darn, that sucks :( [end_of_conversation]

This is a sample conversation between two speakers. What's special about it is that there are special tokens that signify when one of the speakers has finished talking, which we in the biz call a turn. If we treat this example like our previous one with the autorgressive language mode, we can do some interesting things:

Hi, how are you? [end_of_turn] [blank]

If we use the same logic as we did previously, it is easy to see how we can now use GPT2 to guess the next word in this conversation.

Hi, how are you? [end_of_turn] I'm [blank]

We keep feeding back the prediction of our model and there ya have it! A chatting GPT2, where all we need to do is show the model a bunch of these example conversations and have it predict the next word in the conversation.

I think that is plenty of background, we will revisit exactly how we design a system where we actually hold a conversation with GPT2 once we have the model trained ;).

! pip -q install transformers==2.9.0 gdown

     |████████████████████████████████| 645kB 5.5MB/s 
     |████████████████████████████████| 1.0MB 44.6MB/s 
     |████████████████████████████████| 890kB 45.8MB/s 
     |████████████████████████████████| 3.8MB 34.9MB/s 
  Building wheel for sacremoses (setup.py) ... done

Let's define to configuration variables so we don't have a bunch of magic numbers and strings!

# Args to allow for easy convertion of python script to notebook
class Args():
    def __init__(self):
        self.output_dir = 'output'
        self.model_type = 'gpt2'
        self.model_name_or_path = 'microsoft/DialoGPT-small'
        self.config_name = 'microsoft/DialoGPT-small'
        self.tokenizer_name = 'microsoft/DialoGPT-small'
        self.cache_dir = 'cached'
        self.block_size = 512
        self.do_train = True
        self.do_eval = True
        self.evaluate_during_training = False
        self.per_gpu_train_batch_size = 4
        self.per_gpu_eval_batch_size = 4
        self.gradient_accumulation_steps = 1
        self.learning_rate = 5e-5
        self.weight_decay = 0.0
        self.adam_epsilon = 1e-8
        self.max_grad_norm = 1.0
        self.num_train_epochs = 3
        self.max_steps = -1
        self.warmup_steps = 0
        self.logging_steps = 1000
        self.save_steps = 3500
        self.save_total_limit = None
        self.eval_all_checkpoints = False
        self.no_cuda = False
        self.overwrite_output_dir = True
        self.overwrite_cache = True
        self.should_continue = False
        self.seed = 42
        self.local_rank = -1
        self.fp16 = False
        self.fp16_opt_level = 'O1'

args = Args()

The Data!

To train our chatbot we will be using conversations scraped from subtitles of Spanish TV shows and movies. I've gone ahead and formated the data for us already, however, if you would like to use a different language to train your chatbot you can use this script to generate a csv with the same format I am going to use in the rest of this tutorial.

! gdown https://drive.google.com/uc?id=1Lp-diuMohUTGyB9BSTFgeGZyY3dkNuEg

Downloading...
From: https://drive.google.com/uc?id=1Lp-diuMohUTGyB9BSTFgeGZyY3dkNuEg
To: /content/final_es_conv.csv
20.3MB [00:00, 55.6MB/s]

df = pd.read_csv('final_es_conv.csv')
df = df.dropna()
trn_df, val_df = train_test_split(df, test_size = 0.2)
trn_df.head()

len(trn_df), len(val_df)

(40374, 10094)

def get_counter_and_lens(data, tokenizer):
    flatten = lambda l: [item for sublist in l for item in sublist]
    toks = [tokenizer.tokenize(x) for x in data]
    
    return list(map(len, toks)), Counter(flatten(toks)), Counter(' '.join(data).split())

tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
lens, tok_cnt, word_cnt = get_counter_and_lens(trn_df[df.columns].apply(lambda x: ' '.join(x.astype(str)), axis = 1), tokenizer)

def plot_counts(counts, top_k = 30):
    labels, values = zip(*counts.most_common()[:top_k])

    indexes = np.arange(len(labels))
    width = 1
    plt.figure(num=None, figsize=(22, 4), dpi=60, facecolor='w', edgecolor='k')
    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels)
    plt.show()

plot_counts(tok_cnt, top_k = 30)
plot_counts(word_cnt, top_k = 30)

def plot_hist(lens, n_bins = 50):
    n, bins, patches = plt.hist(lens, n_bins, facecolor='blue', alpha=0.9)
    plt.show()

print(f'Mean: {mean(lens)}, Median: {median(lens)}, Standard Deviation: {stdev(lens)}, 90th Percentile: {np.percentile(lens, 100)}')
plot_hist(lens)

Mean: 150.01203744984397, Median: 141.0, Standard Deviation: 44.59412209701778, 90th Percentile: 513.0

Let's get our data into a format that we can feed into our model using Pytorch's Dataset and Dataloader API. All these methods do are convert our dataframes where we have multiple historical dialog, i.e., context, and a response, into a single conversation string that is separated a special token that tells our model when a person is finished speaking.

These conversation strings are then tokenized using HuggingFace's awesome tokenizers into their numerical representation that our model actual understands!

def construct_conv(row, tokenizer, eos = True):
    # from: https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists
    flatten = lambda l: [item for sublist in l for item in sublist]
    conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
    conv = flatten(conv)
    return conv

class ConversationDataset(Dataset):
    def __init__(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

        block_size = block_size - (tokenizer.max_len - tokenizer.max_len_single_sentence)

        directory = args.cache_dir
        cached_features_file = os.path.join(
            directory, args.model_type + "_cached_lm_" + str(block_size)
        )

        if os.path.exists(cached_features_file) and not args.overwrite_cache:
            logger.info("Loading features from cached file %s", cached_features_file)
            with open(cached_features_file, "rb") as handle:
                self.examples = pickle.load(handle)
        else:
            logger.info("Creating features from dataset file at %s", directory)

            self.examples = []
            for _, row in df.iterrows():
                conv = construct_conv(row, tokenizer)
                if len(conv) > block_size: continue
                self.examples.append(conv)

            # Note that we are loosing the last truncated example here for the sake of simplicity (no padding)
            # If your dataset is small, first you should loook for a bigger one :-) and second you
            # can change this behavior by adding (model specific) padding.

            logger.info("Saving features into cached file %s", cached_features_file)
            with open(cached_features_file, "wb") as handle:
                pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item], dtype=torch.long)

Training and Evaluating

Now that we have THE DATA we can finally create our model and start training it! The training and evaluation loop are quite simple. We simplely take a batch of examples from our dataloader and use it both as our inputs and labels. We do this because GPT2 is an auto-regressive model, meaning it uses some context to predict the next token. This prediction is then added to the original context and fed back in as the new context for generating the next token.

To evaluate our model, we use the metric perplexity, which is a simple, but powerful metric. Perplexity is a measure of how unsure the model is in its choice of the next token. The more unsure our model is, the higher its perplexity. One fascinating thing about perplexity is that it correlates very well with what humans think of when it comes to coherent and specific natural conversations, which was shown in the amazing paper "Towards a Human-like Open-Domain Chatbot" by Daniel Adiwardana, et. al.

# Training of model

def train(args, train_dataset, model: PreTrainedModel, tokenizer: PreTrainedTokenizer) -> Tuple[int, float]:
    """ Train the model """
    if args.local_rank in [-1, 0]:
        tb_writer = SummaryWriter()

    args.train_batch_size = args.per_gpu_train_batch_size * max(1, args.n_gpu)

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
    train_dataloader = DataLoader(
        train_dataset, sampler=train_sampler, batch_size=args.train_batch_size, collate_fn=collate, drop_last = True
    )

    if args.max_steps > 0:
        t_total = args.max_steps
        args.num_train_epochs = args.max_steps // (len(train_dataloader) // args.gradient_accumulation_steps) + 1
    else:
        t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs

    model = model.module if hasattr(model, "module") else model  # Take care of distributed/parallel training
    model.resize_token_embeddings(len(tokenizer))
    # add_special_tokens_(model, tokenizer)


    # Prepare optimizer and schedule (linear warmup and decay)
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": args.weight_decay,
        },
        {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
    ]
    optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
    )

    # Check if saved optimizer or scheduler states exist
    if (
        args.model_name_or_path
        and os.path.isfile(os.path.join(args.model_name_or_path, "optimizer.pt"))
        and os.path.isfile(os.path.join(args.model_name_or_path, "scheduler.pt"))
    ):
        # Load in optimizer and scheduler states
        optimizer.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "optimizer.pt")))
        scheduler.load_state_dict(torch.load(os.path.join(args.model_name_or_path, "scheduler.pt")))

    if args.fp16:
        try:
            from apex import amp
        except ImportError:
            raise ImportError("Please install apex from https://www.github.com/nvidia/apex to use fp16 training.")
        model, optimizer = amp.initialize(model, optimizer, opt_level=args.fp16_opt_level)

    # multi-gpu training (should be after apex fp16 initialization)
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Distributed training (should be after apex fp16 initialization)
    if args.local_rank != -1:
        model = torch.nn.parallel.DistributedDataParallel(
            model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
        )

    # Train!
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", len(train_dataset))
    logger.info("  Num Epochs = %d", args.num_train_epochs)
    logger.info("  Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
    logger.info(
        "  Total train batch size (w. parallel, distributed & accumulation) = %d",
        args.train_batch_size
        * args.gradient_accumulation_steps
        * (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
    )
    logger.info("  Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
    logger.info("  Total optimization steps = %d", t_total)

    global_step = 0
    epochs_trained = 0
    steps_trained_in_current_epoch = 0
    # Check if continuing training from a checkpoint
    if args.model_name_or_path and os.path.exists(args.model_name_or_path):
        try:
            # set global_step to gobal_step of last saved checkpoint from model path
            checkpoint_suffix = args.model_name_or_path.split("-")[-1].split("/")[0]
            global_step = int(checkpoint_suffix)
            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)

            logger.info("  Continuing training from checkpoint, will skip to saved global_step")
            logger.info("  Continuing training from epoch %d", epochs_trained)
            logger.info("  Continuing training from global step %d", global_step)
            logger.info("  Will skip the first %d steps in the first epoch", steps_trained_in_current_epoch)
        except ValueError:
            logger.info("  Starting fine-tuning.")

    tr_loss, logging_loss = 0.0, 0.0

    model.zero_grad()
    train_iterator = trange(
        epochs_trained, int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]
    )
    set_seed(args)  # Added here for reproducibility
    for _ in train_iterator:
        epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
        for step, batch in enumerate(epoch_iterator):

            # Skip past any already trained steps if resuming training
            if steps_trained_in_current_epoch > 0:
                steps_trained_in_current_epoch -= 1
                continue

            inputs, labels = (batch, batch)
            if inputs.shape[1] > 1024: continue
            inputs = inputs.to(args.device)
            labels = labels.to(args.device)
            model.train()
            outputs = model(inputs, labels=labels)
            loss = outputs[0]  # model outputs are always tuple in transformers (see doc)

            if args.n_gpu > 1:
                loss = loss.mean()  # mean() to average on multi-gpu parallel training
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            if args.fp16:
                with amp.scale_loss(loss, optimizer) as scaled_loss:
                    scaled_loss.backward()
            else:
                loss.backward()

            tr_loss += loss.item()
            if (step + 1) % args.gradient_accumulation_steps == 0:
                if args.fp16:
                    torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
                else:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1

                if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
                    # Log metrics
                    if (
                        args.local_rank == -1 and args.evaluate_during_training
                    ):  # Only evaluate when single GPU otherwise metrics may not average well
                        results = evaluate(args, model, tokenizer)
                        for key, value in results.items():
                            tb_writer.add_scalar("eval_{}".format(key), value, global_step)
                    tb_writer.add_scalar("lr", scheduler.get_lr()[0], global_step)
                    tb_writer.add_scalar("loss", (tr_loss - logging_loss) / args.logging_steps, global_step)
                    logging_loss = tr_loss

                if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
                    checkpoint_prefix = "checkpoint"
                    # Save model checkpoint
                    output_dir = os.path.join(args.output_dir, "{}-{}".format(checkpoint_prefix, global_step))
                    os.makedirs(output_dir, exist_ok=True)
                    model_to_save = (
                        model.module if hasattr(model, "module") else model
                    )  # Take care of distributed/parallel training
                    model_to_save.save_pretrained(output_dir)
                    tokenizer.save_pretrained(output_dir)

                    torch.save(args, os.path.join(output_dir, "training_args.bin"))
                    logger.info("Saving model checkpoint to %s", output_dir)

                    _rotate_checkpoints(args, checkpoint_prefix)

                    torch.save(optimizer.state_dict(), os.path.join(output_dir, "optimizer.pt"))
                    torch.save(scheduler.state_dict(), os.path.join(output_dir, "scheduler.pt"))
                    logger.info("Saving optimizer and scheduler states to %s", output_dir)

            if args.max_steps > 0 and global_step > args.max_steps:
                epoch_iterator.close()
                break
        if args.max_steps > 0 and global_step > args.max_steps:
            train_iterator.close()
            break

    if args.local_rank in [-1, 0]:
        tb_writer.close()

    return global_step, tr_loss / global_step

# Evaluation of some model

def evaluate(args, model: PreTrainedModel, tokenizer: PreTrainedTokenizer, df_trn, df_val, prefix="") -> Dict:
    # Loop to handle MNLI double evaluation (matched, mis-matched)
    eval_output_dir = args.output_dir

    eval_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=True)
    os.makedirs(eval_output_dir, exist_ok=True)
    args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
    # Note that DistributedSampler samples randomly

    def collate(examples: List[torch.Tensor]):
        if tokenizer._pad_token is None:
            return pad_sequence(examples, batch_first=True)
        return pad_sequence(examples, batch_first=True, padding_value=tokenizer.pad_token_id)

    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(
        eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate, drop_last = True
    )

    # multi-gpu evaluate
    if args.n_gpu > 1:
        model = torch.nn.DataParallel(model)

    # Eval!
    logger.info("***** Running evaluation {} *****".format(prefix))
    logger.info("  Num examples = %d", len(eval_dataset))
    logger.info("  Batch size = %d", args.eval_batch_size)
    eval_loss = 0.0
    nb_eval_steps = 0
    model.eval()

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        inputs = inputs.to(args.device)
        labels = labels.to(args.device)

        with torch.no_grad():
            outputs = model(inputs, labels=labels)
            lm_loss = outputs[0]
            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
    with open(output_eval_file, "w") as writer:
        logger.info("***** Eval results {} *****".format(prefix))
        for key in sorted(result.keys()):
            logger.info("  %s = %s", key, str(result[key]))
            writer.write("%s = %s\n" % (key, str(result[key])))

    return result

Now let's put it all together into our runner function and let our baby cook away!

# Main show runner

def main(df_trn, df_val):
    args = Args()
    
    if args.should_continue:
        sorted_checkpoints = _sorted_checkpoints(args)
        if len(sorted_checkpoints) == 0:
            raise ValueError("Used --should_continue but no checkpoint was found in --output_dir.")
        else:
            args.model_name_or_path = sorted_checkpoints[-1]

    if (
        os.path.exists(args.output_dir)
        and os.listdir(args.output_dir)
        and args.do_train
        and not args.overwrite_output_dir
        and not args.should_continue
    ):
        raise ValueError(
            "Output directory ({}) already exists and is not empty. Use --overwrite_output_dir to overcome.".format(
                args.output_dir
            )
        )

    # Setup CUDA, GPU & distributed training
    device = torch.device("cuda")
    args.n_gpu = torch.cuda.device_count()
    args.device = device

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN,
    )
    logger.warning(
        "Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s",
        args.local_rank,
        device,
        args.n_gpu,
        bool(args.local_rank != -1),
        args.fp16,
    )

    # Set seed
    set_seed(args)

    config = AutoConfig.from_pretrained(args.config_name, cache_dir=args.cache_dir)
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, cache_dir=args.cache_dir)
    model = AutoModelWithLMHead.from_pretrained(
        args.model_name_or_path,
        from_tf=False,
        config=config,
        cache_dir=args.cache_dir,
    )
    model.to(args.device)
    
    logger.info("Training/evaluation parameters %s", args)

    # Training
    if args.do_train:
        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)

        global_step, tr_loss = train(args, train_dataset, model, tokenizer)
        logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)

    # Saving best-practices: if you use save_pretrained for the model and tokenizer, you can reload them using from_pretrained()
    if args.do_train:
        # Create output directory if needed
        os.makedirs(args.output_dir, exist_ok=True)

        logger.info("Saving model checkpoint to %s", args.output_dir)
        # Save a trained model, configuration and tokenizer using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        model_to_save = (
            model.module if hasattr(model, "module") else model
        )  # Take care of distributed/parallel training
        model_to_save.save_pretrained(args.output_dir)
        tokenizer.save_pretrained(args.output_dir)

        # Good practice: save your training arguments together with the trained model
        torch.save(args, os.path.join(args.output_dir, "training_args.bin"))

        # Load a trained model and vocabulary that you have fine-tuned
        model = AutoModelWithLMHead.from_pretrained(args.output_dir)
        tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
        model.to(args.device)

    # Evaluation
    results = {}
    if args.do_eval and args.local_rank in [-1, 0]:
        checkpoints = [args.output_dir]
        if args.eval_all_checkpoints:
            checkpoints = list(
                os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
            )
            logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN)  # Reduce logging
        logger.info("Evaluate the following checkpoints: %s", checkpoints)
        for checkpoint in checkpoints:
            global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
            prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""

            model = AutoModelWithLMHead.from_pretrained(checkpoint)
            model.to(args.device)
            result = evaluate(args, model, tokenizer, df_trn, df_val, prefix=prefix)
            result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
            results.update(result)

    return results

%load_ext tensorboard
%tensorboard --logdir runs

Finally, we run our model! I found this can take anywhere from an hour to three hours depending on the GPU Google give to you to finish training a model that can sort of hold a coherent conversation for the Spanish language. If you are using a different language, you'll have to play around with how long to cook your model for.

main(trn_df, val_df)

Chatting with our Model

Now that we have our model trained, let's it out for a spin and have our first conversation with it!

In order to allow us to chitchat with our new bot we need to figure out when the model has finished its turn, i.e. when it has generated the [end_of_turn] token. When the model generates this token, we can switch back control of the conversation to the user so they can respond. Luckily, this is very easy to do with the Huggingface framework!

The below code is copied pretty much verbatim from the creators of the DialoGPT model, which you can find here.

tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
model = AutoModelWithLMHead.from_pretrained('output')

# Let's chat for 5 lines
for step in range(6):
    # encode the new user input, add the eos_token and return a tensor in Pytorch
    new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt')
    # print(new_user_input_ids)

    # append the new user input tokens to the chat history
    bot_input_ids = torch.cat([chat_history_ids, new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids

    # generated a response while limiting the total chat history to 1000 tokens, 
    chat_history_ids = model.generate(
        bot_input_ids, max_length=1000,
        pad_token_id=tokenizer.eos_token_id,
        top_p=0.92, top_k = 50
    )
    
    # pretty print last ouput tokens from bot
    print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

05/13/2020 00:27:10 - INFO - filelock -   Lock 139706162979168 acquired on /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5.lock
05/13/2020 00:27:10 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpkhif9g52
05/13/2020 00:27:10 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json in cache at /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
05/13/2020 00:27:10 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
05/13/2020 00:27:10 - INFO - filelock -   Lock 139706162979168 released on /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5.lock
05/13/2020 00:27:10 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at /root/.cache/torch/transformers/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
05/13/2020 00:27:10 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 50257
}

05/13/2020 00:27:10 - INFO - transformers.tokenization_utils -   Model name 'microsoft/DialoGPT-small' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'microsoft/DialoGPT-small' is a path, a model identifier, or url to a directory containing tokenizer files.

05/13/2020 00:27:11 - INFO - filelock -   Lock 139706164883072 acquired on /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock
05/13/2020 00:27:11 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/vocab.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpaeb7ikva
05/13/2020 00:27:12 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/vocab.json in cache at /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
05/13/2020 00:27:12 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
05/13/2020 00:27:12 - INFO - filelock -   Lock 139706164883072 released on /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71.lock

05/13/2020 00:27:12 - INFO - filelock -   Lock 139706162979168 acquired on /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
05/13/2020 00:27:12 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmp4k0b0lt0
05/13/2020 00:27:13 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/merges.txt in cache at /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
05/13/2020 00:27:13 - INFO - transformers.file_utils -   creating metadata file for /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
05/13/2020 00:27:13 - INFO - filelock -   Lock 139706162979168 released on /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock

05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/vocab.json from cache at /root/.cache/torch/transformers/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/merges.txt from cache at /root/.cache/torch/transformers/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/added_tokens.json from cache at None
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/special_tokens_map.json from cache at None
05/13/2020 00:27:14 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/tokenizer_config.json from cache at None
05/13/2020 00:27:14 - INFO - transformers.configuration_utils -   loading configuration file output/config.json
05/13/2020 00:27:14 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 50257
}

05/13/2020 00:27:14 - INFO - transformers.modeling_utils -   loading weights file output/pytorch_model.bin

>> User:Hola.
DialoGPT: ¿Qué estás haciendo?
>> User:Estoy leyendo un libro nuevo.
DialoGPT: ¿Qué leyendo?
>> User:Se llama "The Witcher."
DialoGPT: ¿Qué quieres decir?
>> User:Es un libro sobre magia y monstruos.
DialoGPT: ¿Qué quieres decir?
>> User:¿Te gusta libros?
DialoGPT: ¿Qué te pasa?
>> User:Nada mucho.
DialoGPT: ¿Por qué no me lo dijiste?

Now, it ain't the best, however, training it for longer or using the DialogGPT-medium instead of DialogGPT-small does improve results, at least in my experiments. I decided to only include the DialogGPT-small in this tutorial due to the limited (But still AMAZING) resources of Google Colab. I've went ahead and trained a bigger DialogGPT-medium model for longer and have uploaded it to Huggingface for anyone to try out! Find the model card here.

Conclusion

In this tutorial, you learned how to train an Open-Dialog chatbot in any language we want to practice with! This involved learning about the amazing transformers library by Huggingface that has seen a lot of popularity recently. You've also learned what an Open-Dialog chatbot is and some of the difficulties that come with training them such as constructing training examples and generating repetitive text.

This is just part one in what I am hoping will be a three part series! In the next part, we will take our model and integrate it into a web app using the awesome Streamlit library. Finally, part three will then be generating an Android application for chatting with your new language companion!

PS

If you do train a new chatbot for a language of your interest, please share it! I'd love to hear about your progress with it and I'm sure others would be also interested in it as these models can be quite expensive to train.

If you want an ease way to share it, I suggest submitting your trained model to Huggingface's model zoo, where others can view and download your model to use as a starting point for their applications! Here is a simple way for taking the model trained in this tutorial and uploading it to Hugginface's website following the instructions on the Huggingface website:

First make sure you have a Huggingface account: https://huggingface.co/join. Next Run the following code snippets and that's it!

! rm -rf output/checkpoint-*
! mv output <name_of_model>

! transformers-cli login
# log in using the same credentials as on huggingface.co

! transformers-cli upload <name_of_model>

About

In this post, you will be create a Deep Learning model that given a piece of code, will automatically generate a comment describing (hopefully 🤞) what the piece of code does. This post will focus on Java code, however, the same approach should be able to be applied to other programming languages such as Python or Javascript.

Collecting, preparing and exploring the data

You will be using the CodeSearchNet Challenge dataset from GitHub as it provides a large collection of clean code in multiple different languages. They have a really nice example on how to download and read in the data in their repo that you'll use to get started.

! wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip
! unzip java.zip

--2020-03-07 16:37:37--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.179.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.179.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1060569153 (1011M) [application/zip]
Saving to: ‘java.zip’

java.zip            100%[===================>]   1011M  88.0MB/s    in 12s     

2020-03-07 16:37:49 (84.9 MB/s) - ‘java.zip’ saved [1060569153/1060569153]

Archive:  java.zip
   creating: java/
   creating: java/final/
   creating: java/final/jsonl/
   creating: java/final/jsonl/train/
  inflating: java/final/jsonl/train/java_train_12.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_9.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_3.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_5.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_7.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_1.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_10.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_14.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_0.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_6.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_8.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_15.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_2.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_4.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_13.jsonl.gz  
  inflating: java/final/jsonl/train/java_train_11.jsonl.gz  
   creating: java/final/jsonl/test/
  inflating: java/final/jsonl/test/java_test_0.jsonl.gz  
   creating: java/final/jsonl/valid/
  inflating: java/final/jsonl/valid/java_valid_0.jsonl.gz  
  inflating: java_dedupe_definitions_v2.pkl  
  inflating: java_licenses.pkl

jsonl_list_to_dataframe method is directly from the CodeSearchNet Challenge example code and get_dfs is just a helper for you to properly grab the data into the correct training, validation, and testing splits. Let's see what your data looks like :D!

def jsonl_list_to_dataframe(file_list, columns=None):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f,
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

def get_dfs(path):
    """Grabs the different data splits and converts them into dataframes"""
    dfs = []
    for split in ["train", "valid", "test"]:
        files = sorted((path/split).glob("**/*.gz"))
        df = jsonl_list_to_dataframe(files, ["code", "docstring"])
        dfs.append(df)
        
    return dfs

df_trn, df_val, df_tst = get_dfs(path/"java/final/jsonl")
df_trn.head()

You are going to only use a small subset of the data in order to train your model in a reasonable time. However, if you want to adjust the amount of data used you can just adjust the sample size.

sample = 0.2
df_trn = df_trn.sample(frac = sample)
df_val = df_val.sample(frac = sample)
df_tst = df_tst.sample(frac = sample)

Awesome! Now that you have the data, there's a few other preprocessing steps you need to perform. First we are going to remove any non-english comments. Next, you will also remove the JavaDocs, i.e., any line with an @ symbol or curly braces, as that will significantlly lessen the amount of learning your model will have to do. This also works out well since the JavaDoc syntax can usually be autogenerated from the method's signature.

# From https://stackoverflow.com/a/27084708/5768407
def isASCII(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

df_trn = df_trn[df_trn['docstring'].apply(lambda x: isASCII(x))]
df_val = df_val[df_val['docstring'].apply(lambda x: isASCII(x))]
df_tst = df_tst[df_tst['docstring'].apply(lambda x: isASCII(x))]

def filter_jdocs(df):
    methods = []
    comments = []
    for i, row in progress_bar(list(df.iterrows())):
        comment = row["docstring"]
        # Remove {} text in comments from https://stackoverflow.com/questions/14596884/remove-text-between-and-in-python/14598135
        comment = re.sub("([\{\[]).*?([\)\}])", '', comment)
        
        
        cleaned = []
        for line in comment.split('\n'):
            if "@" in line: break
            cleaned.append(line)
        comments.append('\n'.join(cleaned))
        methods.append(row["code"])
    new_df = pd.DataFrame(zip(methods, comments), columns = ["code", "docstring"])

    return new_df

df_trn = filter_jdocs(df_trn);
df_val = filter_jdocs(df_val);
df_tst = filter_jdocs(df_tst);

Now you are going to remove any empty comments or duplicate comments for your datasets.

df_trn = df_trn[~(df_trn['docstring'] == '')]
df_val = df_val[~(df_val['docstring'] == '')]
df_tst = df_tst[~(df_tst['docstring'] == '')]

df_trn = df_trn[~df_trn['docstring'].duplicated()]
df_val = df_val[~df_val['docstring'].duplicated()]
df_tst = df_tst[~df_tst['docstring'].duplicated()]

Not bad, still leaves you with plenty of data to learn with!

len(df_trn), len(df_val), len(df_tst)

(73755, 2427, 4615)

Exploring your data!

As a good machine learning practitioner, it is extremely important to be careful with your data. This includes checking for biases, duplicates, and also describing the data that you have. Not doing so is setting you up for disaster. I have personally experienced such travesty when working on one of my own research projects where I forgot to check for duplicates before splitting my data. Sadly for me and all my restless nights working on the project, the data was full of duplicates and so my test set was contaminated with data points from my training set, which lead to inflated evaluation metrics :(.

Always explore your data!

You'll be doing some basic descriptive statistics for this Exploratory Data Analysis (EDA), which just means calculating some means, medians, and standard deviations for different views of your data. The first view you will be exploring is the tokens that make up your code and comments. To tokenize your data into these tokens you will use something called Byte Pair Encoding, which has shown great results for tokenizing both natural language and code as shown in Karampatsis and Sutton's paper "Maybe Deep Neural Networks are the Best Choice for Modeling Source Code."

A great resource on learning more about how Byte Pair Encoding works is this blog post by Akashdeep Singh Jaswal and this Youtube video by Christopher Manning. Specifically, you will be using the awesome library by Google called sentencepiece.

def df_to_txt_file(df, output, col):
    """Converts a dataframe and converts it into a text file that SentencePiece can use to train a BPE model"""
    with open(output/'text.txt', 'w') as f:
        f.write('\n'.join(list(df[col])))
    return output/'text.txt'

def gen_sp_model(df, output, tokenizer_name, col):
    """Trains a SentencePiece BPE model from a pandas dataframe"""
    fname = df_to_txt_file(df, output, col)
    sp.SentencePieceTrainer.train(f'--input={fname} --model_prefix={output / tokenizer_name} --hard_vocab_limit=false')

To use Byte Pair Encoding, you have to train the tokenizer on your data. However, no need to train your BPE on all of your data, so you will just be doing it on a subset (10%) of your training data. You are picking to train the BPE model from your training set to not perform any inadvertant data snooping by biasing your BPE model to tokenize more common words in your validation or testing set. This also will help show that you are indeed solving the out of vocabulary problem because you will most likely encounter words in your testing set that were not in your training set.

p_bpe = 0.1
method_tokenizer = "method_bpe"
gen_sp_model(df_trn.sample(frac = p_bpe), path, method_tokenizer, col = "code")
comment_tokenizer = "comment_bpe"
gen_sp_model(df_trn.sample(frac = p_bpe), path, comment_tokenizer, col = "docstring")

Now that you have the ability to tokenize your text, let us explore! First you will just generate the frequency of each of your tokens and while you are at it, let's collect how long your methods are by via the common software metric Lines of Code (LOC).

def get_counter_and_lens(df, spm, col):
    toks = []
    locs = []
    for i, row in progress_bar(list(df.iterrows())):
        toks.extend(spm.EncodeAsPieces(row[col]))
        locs.append(len(row[col].split('\n')))
            
    cnt = Counter()
    for tok in progress_bar(toks):
        cnt[tok] += 1  
    return list(map(len, toks)), cnt, locs

code_lens, code_cnt, locs = get_counter_and_lens(df_trn, method_spm, 'code')
comment_lens, comment_cnt, _ = get_counter_and_lens(df_trn, method_spm, 'docstring')

def plot_counts(counts, top_k = 30):
    labels, values = zip(*counts.most_common()[:top_k])

    indexes = np.arange(len(labels))
    width = 1
    plt.figure(num=None, figsize=(22, 4), dpi=60, facecolor='w', edgecolor='k')
    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels)
    plt.show()

plot_counts(code_cnt, top_k = 30)
plot_counts(comment_cnt, top_k = 30)

Plotting your frequencies as a bar chart you start to see a nice picture of your data. Not that suprising, but the most common token happens to be the period and some other common syntactical tokens like curly braces and also key words like if and return.

def plot_hist(lens, n_bins = 50):
    n, bins, patches = plt.hist(lens, n_bins, facecolor='blue', alpha=0.9)
    plt.show()

print(mean(code_lens), median(code_lens), stdev(code_lens))
plot_hist(code_lens)
print(mean(locs), median(locs), stdev(locs))
plot_hist(locs)
print(mean(comment_lens), median(comment_lens), stdev(comment_lens))
plot_hist(comment_lens)

3.4662519750610943 3.0 2.6490695431339177

18.54957629991187 10 50.99032748692644

3.57512896650546 3.0 2.605938655784157

As you can see, there are HTML elements left with < and > occurring quite often in your comments dataset that may make it harder for your model to learn to generate the comments that contain those elements. Luckily for us, it won't really affect your models accuracy, but exploring your data like this does allow us to see how your data may be influencing your model.

TODO For You: Perform some further cleaning steps to remove HTML and any other cleaning you deem necessary and see how your performance changes.

Loading the data using FastAI

Now that you have the data processed and cleaned you need a way to get it into the format that FastAI uses. To do that you will use some code from Rachel Thomas' awesome course on NLP, which allows us to create a Sequence to Sequence (since you are going from the sequence of code to the sequence of the code's docstring) DataBunch (this is just the format FastAI uses for managing loading the data into memory for training and evaluating).

def seq2seq_collate(samples, pad_idx=1, pad_first=True, backwards=False):
    "Function that collect samples and adds padding. Flips token order if needed"
    samples = to_data(samples)
    max_len_x,max_len_y = max([len(s[0]) for s in samples]),max([len(s[1]) for s in samples])
    res_x = torch.zeros(len(samples), max_len_x).long() + pad_idx
    res_y = torch.zeros(len(samples), max_len_y).long() + pad_idx
    if backwards: pad_first = not pad_first
    for i,s in enumerate(samples):
        if pad_first: 
            res_x[i,-len(s[0]):],res_y[i,-len(s[1]):] = LongTensor(s[0]),LongTensor(s[1])
        else:         
            res_x[i, :len(s[0])],res_y[i, :len(s[1])] = LongTensor(s[0]),LongTensor(s[1])
    if backwards: res_x,res_y = res_x.flip(1),res_y.flip(1)
    return res_x, res_y

class Seq2SeqDataBunch(TextDataBunch):
    "Create a `TextDataBunch` suitable for training an RNN classifier."
    @classmethod
    def create(cls, train_ds, valid_ds, test_ds=None, path='.', bs=32, val_bs=None, pad_idx=1,
               dl_tfms=None, pad_first=False, device=None, no_check=False, backwards=False, **dl_kwargs):
        "Function that transform the `datasets` in a `DataBunch` for classification. Passes `**dl_kwargs` on to `DataLoader()`"
        datasets = cls._init_ds(train_ds, valid_ds, test_ds)
        val_bs = ifnone(val_bs, bs)
        collate_fn = partial(seq2seq_collate, pad_idx=pad_idx, pad_first=pad_first, backwards=backwards)
        train_sampler = SortishSampler(datasets[0].x, key=lambda t: len(datasets[0][t][0].data), bs=bs//2)
        train_dl = DataLoader(datasets[0], batch_size=bs, sampler=train_sampler, drop_last=True, **dl_kwargs)
        dataloaders = [train_dl]
        for ds in datasets[1:]:
            lengths = [len(t) for t in ds.x.items]
            sampler = SortSampler(ds.x, key=lengths.__getitem__)
            dataloaders.append(DataLoader(ds, batch_size=val_bs, sampler=sampler, **dl_kwargs))
        return cls(*dataloaders, path=path, device=device, collate_fn=collate_fn, no_check=no_check)

class Seq2SeqTextList(TextList):
    _bunch = Seq2SeqDataBunch
    _label_cls = TextList

Here is where you are telling FastAI to use your trained BPE models for tokenizing your data. FastAI's tokenizers will also do some additional processing of your text such as lower casing all words, removing repetitions, etc. You can find a full list of the processing FastAI uses here.

method_processor = SPProcessor(
    sp_model = path / (method_tokenizer + ".model"),
    sp_vocab = path / (method_tokenizer + ".vocab"),
    include_eos = True)
comment_processor = SPProcessor(
    sp_model = path / (comment_tokenizer + ".model"),
    sp_vocab = path / (comment_tokenizer + ".vocab"),
    include_eos = True)

Now that you have your BPE model you will generate the DataBunches suitable for your task, which will be the Seq2Seq DataBunch. You will also filter out sequences that your too long so that you can fit everything onto a Google Colab GPU and to not have your training take too long.

def gen_dbs(df_trn, df_val, df_tst, method_processor, comment_processor, bs = 96, max_seq = 128):
    is_valid = [False] * len(df_trn) + [True] * len(df_val)
    df_merged = pd.concat([df_trn, df_val])
    df_merged = pd.DataFrame(zip(df_merged["code"].to_list(), df_merged["docstring"].to_list(), is_valid),
                                columns = ["code", "docstring", "valid"]
    )
                             
    db_trn = (Seq2SeqTextList
              .from_df(df_merged, path = path, cols='code', processor = method_processor)
              .split_from_df(col='valid')
              .label_from_df(cols='docstring', label_cls=TextList, processor = comment_processor)
              .filter_by_func(lambda x, y: len(x) > max_seq or len(y) > max_seq)
              .databunch(bs = bs))
    
    db_tst = (Seq2SeqTextList
              .from_df(df_tst, path = path, cols='code', processor = method_processor)
              .split_by_rand_pct(valid_pct = 0.01)
              .label_from_df(cols='docstring', label_cls=TextList, processor = comment_processor)
              .filter_by_func(lambda x, y: len(x) > max_seq or len(y) > max_seq)
              .databunch(bs = 16))
    
    return db_trn, db_tst

db_trn, db_tst = gen_dbs(df_trn, df_val, df_tst, method_processor, comment_processor, bs = 96, max_seq = 128)
db_trn.show_batch()

def shift_tfm(b):
    x,y = b
    y = F.pad(y, (1, 0), value=1)
    return [x,y[:,:-1]], y[:,1:]

# Add the necessary shift transformation for training your Transformer model
db_trn.add_tfm(shift_tfm)
db_tst.add_tfm(shift_tfm)

Defining your model

In this example, you will be using the Transformer architecture that was developed by Vaswani et. al.. If you want a better understanding of this model, I highly suggest The Annotated Transformer blog post and the NLP course by Rachel Thomas, which this model code is copied from.

class PositionalEncoding(nn.Module):
    "Encode the position with a sinusoid."
    def __init__(self, d):
        super().__init__()
        self.register_buffer('freq', 1 / (10000 ** (torch.arange(0., d, 2.)/d)))
    
    def forward(self, pos):
        inp = torch.ger(pos, self.freq)
        enc = torch.cat([inp.sin(), inp.cos()], dim=-1)
        return enc

class TransformerEmbedding(nn.Module):
    "Embedding + positional encoding + dropout"
    def __init__(self, vocab_sz, emb_sz, inp_p=0.):
        super().__init__()
        self.emb_sz = emb_sz
        self.embed = embedding(vocab_sz, emb_sz)
        self.pos_enc = PositionalEncoding(emb_sz)
        self.drop = nn.Dropout(inp_p)
    
    def forward(self, inp): 
        pos = torch.arange(0, inp.size(1), device=inp.device).float()
        return self.drop(self.embed(inp) * math.sqrt(self.emb_sz) + self.pos_enc(pos))

def feed_forward(d_model, d_ff, ff_p=0., double_drop=True):
    layers = [nn.Linear(d_model, d_ff), nn.ReLU()]
    if double_drop: layers.append(nn.Dropout(ff_p))
    return SequentialEx(*layers, nn.Linear(d_ff, d_model), nn.Dropout(ff_p), MergeLayer(), nn.LayerNorm(d_model))

class MultiHeadAttention(nn.Module):
    def __init__(self, n_heads, d_model, d_head=None, p=0., bias=True, scale=True):
        super().__init__()
        d_head = ifnone(d_head, d_model//n_heads)
        self.n_heads,self.d_head,self.scale = n_heads,d_head,scale
        self.q_wgt,self.k_wgt,self.v_wgt = [nn.Linear(
            d_model, n_heads * d_head, bias=bias) for o in range(3)]
        self.out = nn.Linear(n_heads * d_head, d_model, bias=bias)
        self.drop_att,self.drop_res = nn.Dropout(p),nn.Dropout(p)
        self.ln = nn.LayerNorm(d_model)
        
    def forward(self, q, kv, mask=None):
        return self.ln(q + self.drop_res(self.out(self._apply_attention(q, kv, mask=mask))))
    
    def create_attn_mat(self, x, layer, bs):
        return layer(x).view(bs, x.size(1), self.n_heads, self.d_head
                            ).permute(0, 2, 1, 3)
    
    def _apply_attention(self, q, kv, mask=None):
        bs,seq_len = q.size(0),q.size(1)
        wq,wk,wv = map(lambda o: self.create_attn_mat(*o,bs),
                       zip((q,kv,kv),(self.q_wgt,self.k_wgt,self.v_wgt)))
        attn_score = wq @ wk.transpose(2,3)
        if self.scale: attn_score /= math.sqrt(self.d_head)
        if mask is not None: 
            attn_score = attn_score.float().masked_fill(mask, -float('inf')).type_as(attn_score)
        attn_prob = self.drop_att(F.softmax(attn_score, dim=-1))
        attn_vec = attn_prob @ wv
        return attn_vec.permute(0, 2, 1, 3).contiguous().view(bs, seq_len, -1)

def get_output_mask(inp, pad_idx=1):
    return torch.triu(inp.new_ones(inp.size(1),inp.size(1)), diagonal=1)[None,None].byte()

class EncoderBlock(nn.Module):
    "Encoder block of a Transformer model."
    #Can't use Sequential directly cause more than one input...
    def __init__(self, n_heads, d_model, d_head, d_inner, p=0., bias=True, scale=True, double_drop=True):
        super().__init__()
        self.mha = MultiHeadAttention(n_heads, d_model, d_head, p=p, bias=bias, scale=scale)
        self.ff  = feed_forward(d_model, d_inner, ff_p=p, double_drop=double_drop)
    
    def forward(self, x, mask=None): return self.ff(self.mha(x, x, mask=mask))

class DecoderBlock(nn.Module):
    "Decoder block of a Transformer model."
    #Can't use Sequential directly cause more than one input...
    def __init__(self, n_heads, d_model, d_head, d_inner, p=0., bias=True, scale=True, double_drop=True):
        super().__init__()
        self.mha1 = MultiHeadAttention(n_heads, d_model, d_head, p=p, bias=bias, scale=scale)
        self.mha2 = MultiHeadAttention(n_heads, d_model, d_head, p=p, bias=bias, scale=scale)
        self.ff   = feed_forward(d_model, d_inner, ff_p=p, double_drop=double_drop)
    
    def forward(self, x, enc, mask_out=None): return self.ff(self.mha2(self.mha1(x, x, mask_out), enc))

class Transformer(Module):
    def __init__(self, inp_vsz, out_vsz, n_layers=6, n_heads=8, d_model=256, d_head=32, 
                 d_inner=1024, p=0.1, bias=True, scale=True, double_drop=True, pad_idx=1):
        self.enc_emb = TransformerEmbedding(inp_vsz, d_model, p)
        self.dec_emb = TransformerEmbedding(out_vsz, d_model, 0.)
        args = (n_heads, d_model, d_head, d_inner, p, bias, scale, double_drop)
        self.encoder = nn.ModuleList([EncoderBlock(*args) for _ in range(n_layers)])
        self.decoder = nn.ModuleList([DecoderBlock(*args) for _ in range(n_layers)])
        self.out = nn.Linear(d_model, out_vsz)
        self.out.weight = self.dec_emb.embed.weight
        self.pad_idx = pad_idx
        
    def forward(self, inp, out):
        mask_out = get_output_mask(out, self.pad_idx)
        enc,out = self.enc_emb(inp),self.dec_emb(out)
        enc = compose(self.encoder)(enc)
        out = compose(self.decoder)(out, enc, mask_out)
        return self.out(out)

To evaluate your model you will be using the commonly used BLEU score, which is a measure for determining how closely your model's generated comment is to the real comment of a method. (This code is also copied from the NLP tutorial from Rachel Thomas)

class NGram():
    def __init__(self, ngram, max_n=5000): self.ngram,self.max_n = ngram,max_n
    def __eq__(self, other):
        if len(self.ngram) != len(other.ngram): return False
        return np.all(np.array(self.ngram) == np.array(other.ngram))
    def __hash__(self): return int(sum([o * self.max_n**i for i,o in enumerate(self.ngram)]))

def get_grams(x, n, max_n=5000):
    return x if n==1 else [NGram(x[i:i+n], max_n=max_n) for i in range(len(x)-n+1)]

def get_correct_ngrams(pred, targ, n, max_n=5000):
    pred_grams,targ_grams = get_grams(pred, n, max_n=max_n),get_grams(targ, n, max_n=max_n)
    pred_cnt,targ_cnt = Counter(pred_grams),Counter(targ_grams)
    return sum([min(c, targ_cnt[g]) for g,c in pred_cnt.items()]),len(pred_grams)

class CorpusBLEU(Callback):
    def __init__(self, vocab_sz):
        self.vocab_sz = vocab_sz
        self.name = 'bleu'
    
    def on_epoch_begin(self, **kwargs):
        self.pred_len,self.targ_len,self.corrects,self.counts = 0,0,[0]*4,[0]*4
    
    def on_batch_end(self, last_output, last_target, **kwargs):
        last_output = last_output.argmax(dim=-1)
        for pred,targ in zip(last_output.cpu().numpy(),last_target.cpu().numpy()):
            self.pred_len += len(pred)
            self.targ_len += len(targ)
            for i in range(4):
                c,t = get_correct_ngrams(pred, targ, i+1, max_n=self.vocab_sz)
                self.corrects[i] += c
                self.counts[i]   += t
    
    def on_epoch_end(self, last_metrics, **kwargs):
        precs = [c/t for c,t in zip(self.corrects,self.counts)]
        len_penalty = exp(1 - self.targ_len/self.pred_len) if self.pred_len < self.targ_len else 1
        bleu = len_penalty * ((precs[0]*precs[1]*precs[2]*precs[3]) ** 0.25)
        return add_metrics(last_metrics, bleu)

n_x_vocab, n_y_vocab = len(db_trn.train_ds.x.vocab.itos), len(db_trn.train_ds.y.vocab.itos)

model = Transformer(n_x_vocab, n_y_vocab, d_model=256)
learn = Learner(db_trn, model, metrics=[accuracy, CorpusBLEU(n_y_vocab)], loss_func = CrossEntropyFlat())

Now you are going to use the awesome Learning Rate finder provided by FastAI, which is based on the awesome paper from Leslie N. Smith "Cyclical Learning Rates for Training Neural Networks". This way you don't have to do a bunch of hyperparameter searching to find the perfect fit.

learn.lr_find()
learn.recorder.plot(suggestion = True)

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.
Min numerical gradient: 3.02E-03
Min loss divided by 10: 1.74E-02

max_lr = 5e-4

def train_model(learn, epochs, model_name, max_lr = 5e-4):
    """Trains a model using save model, early stopping, and show graph call backs."""
    callback_fns = [
        callbacks.SaveModelCallback(
            learn, every='improvement',
            monitor='valid_loss', name=f'{model_name}_save_model'
        ),
        callbacks.EarlyStoppingCallback(
            learn, monitor='valid_loss', min_delta = 0.01,
            patience = 3
        ),
        ShowGraph(learn)
    ]
    
    learn.fit_one_cycle(epochs, max_lr, div_factor=5, callbacks = callback_fns)

epochs = 8
model_name = 'comment_gen'

train_model(learn, epochs, model_name, max_lr = max_lr)

Better model found at epoch 0 with valid_loss value: 1.133453130722046.

Better model found at epoch 1 with valid_loss value: 0.9542644023895264.
Better model found at epoch 2 with valid_loss value: 0.8755126595497131.
Better model found at epoch 3 with valid_loss value: 0.8288350701332092.
Better model found at epoch 4 with valid_loss value: 0.7948615550994873.
Better model found at epoch 5 with valid_loss value: 0.7777946591377258.
Better model found at epoch 6 with valid_loss value: 0.7700592279434204.
Better model found at epoch 7 with valid_loss value: 0.7698812484741211.

def get_predictions(learn, ds_type=DatasetType.Valid):
    learn.model.eval()
    inputs, targets, outputs = [],[],[]
    with torch.no_grad():
        for xb,yb in progress_bar(learn.dl(ds_type)):
            out = learn.model(*xb)
            for x,y,z in zip(xb[0],xb[1],out):
                inputs.append(learn.data.train_ds.x.reconstruct(x.cpu()))
                targets.append(learn.data.train_ds.y.reconstruct(y.cpu()))
                outputs.append(learn.data.train_ds.y.reconstruct(z.cpu().argmax(1)))
    return inputs, targets, outputs

inputs, targets, outputs = get_predictions(learn)

def print_results(inputs, targets, outputs, method_spm, comment_spm, n = 10):
    """Just a little helper function for printing out the results from your model."""
    for i in range(n):
        print("Input:", " ".join(decode_spec_tokens(method_spm.DecodePieces(str(inputs[i]).split(" ")).split(" "))), "\n")
        print("Target:", " ".join(decode_spec_tokens(comment_spm.DecodePieces(str(targets[i]).split(" ")).split(" "))), "\n")
        print("Predicted:", " ".join(decode_spec_tokens(comment_spm.DecodePieces(str(outputs[i]).split(" ")).split(" "))), "\n")
        
print_results(inputs, targets, outputs, method_spm, comment_spm)

Input: xxbos @doesservicerequest private void putrangeinternal(final filerange range, final filerangeoperationtype operationtype, final byte[] data, final long length, final String md5, final accesscondition accesscondition, final filerequestoptions options, final operationcontext opcontext) throws storageexception { executionengine.executewithretry(this.fileserviceclient, this, putrangeimpl(range, operationtype, data, length, md5, accesscondition, options, opcontext), options.getretrypolicyfactory(), opcontext); } xxeos 

Target: xxbos Used for both uploadrange and clearrange. xxeos 

Predicted: xxbos Put to creating thes( s( xxeos 

Input: xxbos public static byte[] encodesequence(byte[]... encodedvalues) { int length = 0; for (byte[] encodedvalue : encodedvalues) { length += encodedvalue.length; } byte[] lengthencoded = encodelength(length); bytearraydataoutput out = bytestreams.newdataoutput(1 + lengthencoded.length + length); out.write(sequence_tag); out.write(lengthencoded); for (byte[] entry : encodedvalues) { out.write(entry); } return out.tobytearray(); } xxeos 

Target: xxbos Encodes a sequence of encoded values. xxeos 

Predicted: xxbos Encodes a byte of bytes bytes into xxeos 

Input: xxbos @override public String dnsresolveex(string host) { stringbuilder result = new stringbuilder(); try { inetaddress[] list = inetaddress.getallbyname(host); for (inetaddress inetaddress : list) { result.append(inetaddress.gethostaddress()); result.append("; "); } } catch (unknownhostexception e) { log.log(level.fine, "DNS name not resolvable {0}.", host); } return result.tostring(); } xxeos 

Target: xxbos *********************************************************************** dnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveex xxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeos 

Predicted: xxbos xxmap 51 * namepp  =pxxeos 

Input: xxbos protected void removeallfromattributevalueset() { final collection<abstracthtml5sharedobject> sharedobjects = getsharedobjects(); boolean listenerinvoked = false; final collection<writelock> writelocks = lockandgetwritelocks(); try { getattributevalueset().clear(); setmodified(true); invokevaluechangelisteners(sharedobjects); listenerinvoked = true; } finally { for (final Lock lock : writelocks) { lock.unlock(); } } pushqueues(sharedobjects, listenerinvoked); } xxeos 

Target: xxbos clears all values from the value set. xxeos 

Predicted: xxbos s all attribute from the    xxeos 

Input: xxbos public void registercheckwithnotes(string checkid, String name, String script, long interval, @suppresswarnings("sameparametervalue") String notes) { Check check = new Check(); check.setid(checkid); check.setname(name); check.setscript(script); check.setinterval(string.format(" ⁇ ss", interval)); check.setnotes(notes); registercheck(check); } xxeos 

Target: xxbos Registers a Health Check with the Agent. xxeos 

Predicted: xxbos Registers a xxupj ealth checkxxmaj  check a givenxxmaj  name xxeos 

Input: xxbos public void assertequalsignoringcase(@nullable Description description, @nullable String actual, @nullable String expected) { if (!areequalignoringcase(actual, expected)) { String format = "expecting:< ⁇ s> to be equal to:< ⁇ s>, ignoring case considerations"; throw failures.failure(description, new basicerrormessagefactory(format, actual, expected)); } } xxeos 

Target: xxbos Verifies that two s are equal, ignoring case considerations. xxeos 

Predicted: xxbos Assert that the stringsxx are equal, oring the... xxeos 

Input: xxbos protected cronschedulebuilder createcronschedulebuilder(string cronexpr) { int i = cronexpr.indexof("["); int j = cronexpr.indexof("]"); timezone timezone = defaulttimezone; if (i > -1 && j > -1) { timezone = timezone.gettimezone(cronexpr.substring(i+1, j)); cronexpr = cronexpr.substring(0, i).trim(); } return cronschedulebuilder.cronschedule(cronexpr).intimezone(timezone); } xxeos 

Target: xxbos Allow timezone to be configured on a per-cron basis with [timezonename] appended to the cron format xxeos 

Predicted: xxbos Create to to create used to the mtimei-s of a0],]. to the givenath.xxeos 

Input: xxbos private <T> fakeencodeditem readnextitem(class<t> clazz) { fakeencodeditem item = data[dataposition]; if (item == null) { / / While Parcel will treat these as zeros, in tests, this is almost always an error. throw new unreliablebehaviorerror("reading uninitialized data at position " + dataposition); } checkconsistentreadandincrementposition(clazz, item); return item; } xxeos 

Target: xxbos Reads a complete item in the byte buffer. xxeos 

Predicted: xxbos Read the   from the given array. xxeos 

Input: xxbos private void hidesuggestionsifnecessary(final @nonnull querytoken currentquery, final @nonnull tokensource source) { String queryts = currentquery.gettokenstring(); String currentts = source.getcurrenttokenstring(); if (!iswaitingforresults(currentquery) && queryts != null && queryts.equals(currentts)) { msuggestionsvisibilitymanager.displaysuggestions(false); } } xxeos 

Target: xxbos Hides the suggestions if there are no more incoming queries. xxeos 

Predicted: xxbos Check the givenion of of the is no more  . xxeos 

Input: xxbos public list<uirow> getvalues() throws efapsexception { list<uirow> ret = new arraylist<>(); if (isfiltered()) { for (final uirow row : this.values) { boolean filtered = false; for (final tablefilter filter : this.filters.values()) { filtered = filter.filterrow(row); if (filtered) { break; } } if (!filtered) { ret.add(row); } } } else { ret = this.values; } setsize(ret.size()); return ret; } xxeos 

Target: xxbos This is the getter method for the instance variable . xxeos 

Predicted: xxbos Returns method a first method for the row of. xxeos

It is common to pick a point a bit before the suggested point.

max_lr = 5e-4

max_lr = 5e-4

DRUM ROLL PLEASE!!!!! You are now going to finally start training your model! Specifically for 8 epochs because that was what was in the original code in the NLP course and it also happened to work the best during my training. However, you are also implementing a few call backs, namely automatically saving the best performing model, early stopping, and showing the training and validation loss graph. Since you are using early stopping, feel free to try out a higher epoch number and the training will stop once it starts not improving.

def train_model(learn, epochs, model_name, max_lr = 5e-4):
    """Trains a model using save model, early stopping, and show graph call backs."""
    callback_fns = [
        callbacks.SaveModelCallback(
            learn, every='improvement',
            monitor='valid_loss', name=f'{model_name}_save_model'
        ),
        callbacks.EarlyStoppingCallback(
            learn, monitor='valid_loss', min_delta = 0.01,
            patience = 3
        ),
        ShowGraph(learn)
    ]
    
    learn.fit_one_cycle(epochs, max_lr, div_factor=5, callbacks = callback_fns)

def train_model(learn, epochs, model_name, max_lr = 5e-4):
    """Trains a model using save model, early stopping, and show graph call backs."""
    callback_fns = [
        callbacks.SaveModelCallback(
            learn, every='improvement',
            monitor='valid_loss', name=f'{model_name}_save_model'
        ),
        callbacks.EarlyStoppingCallback(
            learn, monitor='valid_loss', min_delta = 0.01,
            patience = 3
        ),
        ShowGraph(learn)
    ]
    
    learn.fit_one_cycle(epochs, max_lr, div_factor=5, callbacks = callback_fns)

epochs = 8
model_name = 'comment_gen'

epochs = 8
model_name = 'comment_gen'

Training on Google Colab can take anywhere from ~20 to 60 minutes depending on the type of GPU they give you. So, relax, get an IV caffeine drip going, and let your model cook in peace :).

train_model(learn, epochs, model_name, max_lr = max_lr)

train_model(learn, epochs, model_name, max_lr = max_lr)

Better model found at epoch 0 with valid_loss value: 1.133453130722046.

Better model found at epoch 1 with valid_loss value: 0.9542644023895264.
Better model found at epoch 2 with valid_loss value: 0.8755126595497131.
Better model found at epoch 3 with valid_loss value: 0.8288350701332092.
Better model found at epoch 4 with valid_loss value: 0.7948615550994873.
Better model found at epoch 5 with valid_loss value: 0.7777946591377258.
Better model found at epoch 6 with valid_loss value: 0.7700592279434204.
Better model found at epoch 7 with valid_loss value: 0.7698812484741211.

Evaluate your model

Let us now evaluated your trained model on some of your validation set so see how well your model is generating comments.

def get_predictions(learn, ds_type=DatasetType.Valid):
    learn.model.eval()
    inputs, targets, outputs = [],[],[]
    with torch.no_grad():
        for xb,yb in progress_bar(learn.dl(ds_type)):
            out = learn.model(*xb)
            for x,y,z in zip(xb[0],xb[1],out):
                inputs.append(learn.data.train_ds.x.reconstruct(x.cpu()))
                targets.append(learn.data.train_ds.y.reconstruct(y.cpu()))
                outputs.append(learn.data.train_ds.y.reconstruct(z.cpu().argmax(1)))
    return inputs, targets, outputs

inputs, targets, outputs = get_predictions(learn)

def get_predictions(learn, ds_type=DatasetType.Valid):
    learn.model.eval()
    inputs, targets, outputs = [],[],[]
    with torch.no_grad():
        for xb,yb in progress_bar(learn.dl(ds_type)):
            out = learn.model(*xb)
            for x,y,z in zip(xb[0],xb[1],out):
                inputs.append(learn.data.train_ds.x.reconstruct(x.cpu()))
                targets.append(learn.data.train_ds.y.reconstruct(y.cpu()))
                outputs.append(learn.data.train_ds.y.reconstruct(z.cpu().argmax(1)))
    return inputs, targets, outputs

inputs, targets, outputs = get_predictions(learn)

def print_results(inputs, targets, outputs, method_spm, comment_spm, n = 10):
    """Just a little helper function for printing out the results from your model."""
    for i in range(n):
        print("Input:", " ".join(decode_spec_tokens(method_spm.DecodePieces(str(inputs[i]).split(" ")).split(" "))), "\n")
        print("Target:", " ".join(decode_spec_tokens(comment_spm.DecodePieces(str(targets[i]).split(" ")).split(" "))), "\n")
        print("Predicted:", " ".join(decode_spec_tokens(comment_spm.DecodePieces(str(outputs[i]).split(" ")).split(" "))), "\n")
        
print_results(inputs, targets, outputs, method_spm, comment_spm)

def print_results(inputs, targets, outputs, method_spm, comment_spm, n = 10):
    """Just a little helper function for printing out the results from your model."""
    for i in range(n):
        print("Input:", " ".join(decode_spec_tokens(method_spm.DecodePieces(str(inputs[i]).split(" ")).split(" "))), "\n")
        print("Target:", " ".join(decode_spec_tokens(comment_spm.DecodePieces(str(targets[i]).split(" ")).split(" "))), "\n")
        print("Predicted:", " ".join(decode_spec_tokens(comment_spm.DecodePieces(str(outputs[i]).split(" ")).split(" "))), "\n")
        
print_results(inputs, targets, outputs, method_spm, comment_spm)

Input: xxbos @doesservicerequest private void putrangeinternal(final filerange range, final filerangeoperationtype operationtype, final byte[] data, final long length, final String md5, final accesscondition accesscondition, final filerequestoptions options, final operationcontext opcontext) throws storageexception { executionengine.executewithretry(this.fileserviceclient, this, putrangeimpl(range, operationtype, data, length, md5, accesscondition, options, opcontext), options.getretrypolicyfactory(), opcontext); } xxeos 

Target: xxbos Used for both uploadrange and clearrange. xxeos 

Predicted: xxbos Put to creating thes( s( xxeos 

Input: xxbos public static byte[] encodesequence(byte[]... encodedvalues) { int length = 0; for (byte[] encodedvalue : encodedvalues) { length += encodedvalue.length; } byte[] lengthencoded = encodelength(length); bytearraydataoutput out = bytestreams.newdataoutput(1 + lengthencoded.length + length); out.write(sequence_tag); out.write(lengthencoded); for (byte[] entry : encodedvalues) { out.write(entry); } return out.tobytearray(); } xxeos 

Target: xxbos Encodes a sequence of encoded values. xxeos 

Predicted: xxbos Encodes a byte of bytes bytes into xxeos 

Input: xxbos @override public String dnsresolveex(string host) { stringbuilder result = new stringbuilder(); try { inetaddress[] list = inetaddress.getallbyname(host); for (inetaddress inetaddress : list) { result.append(inetaddress.gethostaddress()); result.append("; "); } } catch (unknownhostexception e) { log.log(level.fine, "DNS name not resolvable {0}.", host); } return result.tostring(); } xxeos 

Target: xxbos *********************************************************************** dnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveexdnsresolveex xxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeosxxeos 

Predicted: xxbos xxmap 51 * namepp  =pxxeos 

Input: xxbos protected void removeallfromattributevalueset() { final collection<abstracthtml5sharedobject> sharedobjects = getsharedobjects(); boolean listenerinvoked = false; final collection<writelock> writelocks = lockandgetwritelocks(); try { getattributevalueset().clear(); setmodified(true); invokevaluechangelisteners(sharedobjects); listenerinvoked = true; } finally { for (final Lock lock : writelocks) { lock.unlock(); } } pushqueues(sharedobjects, listenerinvoked); } xxeos 

Target: xxbos clears all values from the value set. xxeos 

Predicted: xxbos s all attribute from the    xxeos 

Input: xxbos public void registercheckwithnotes(string checkid, String name, String script, long interval, @suppresswarnings("sameparametervalue") String notes) { Check check = new Check(); check.setid(checkid); check.setname(name); check.setscript(script); check.setinterval(string.format(" ⁇ ss", interval)); check.setnotes(notes); registercheck(check); } xxeos 

Target: xxbos Registers a Health Check with the Agent. xxeos 

Predicted: xxbos Registers a xxupj ealth checkxxmaj  check a givenxxmaj  name xxeos 

Input: xxbos public void assertequalsignoringcase(@nullable Description description, @nullable String actual, @nullable String expected) { if (!areequalignoringcase(actual, expected)) { String format = "expecting:< ⁇ s> to be equal to:< ⁇ s>, ignoring case considerations"; throw failures.failure(description, new basicerrormessagefactory(format, actual, expected)); } } xxeos 

Target: xxbos Verifies that two s are equal, ignoring case considerations. xxeos 

Predicted: xxbos Assert that the stringsxx are equal, oring the... xxeos 

Input: xxbos protected cronschedulebuilder createcronschedulebuilder(string cronexpr) { int i = cronexpr.indexof("["); int j = cronexpr.indexof("]"); timezone timezone = defaulttimezone; if (i > -1 && j > -1) { timezone = timezone.gettimezone(cronexpr.substring(i+1, j)); cronexpr = cronexpr.substring(0, i).trim(); } return cronschedulebuilder.cronschedule(cronexpr).intimezone(timezone); } xxeos 

Target: xxbos Allow timezone to be configured on a per-cron basis with [timezonename] appended to the cron format xxeos 

Predicted: xxbos Create to to create used to the mtimei-s of a0],]. to the givenath.xxeos 

Input: xxbos private <T> fakeencodeditem readnextitem(class<t> clazz) { fakeencodeditem item = data[dataposition]; if (item == null) { / / While Parcel will treat these as zeros, in tests, this is almost always an error. throw new unreliablebehaviorerror("reading uninitialized data at position " + dataposition); } checkconsistentreadandincrementposition(clazz, item); return item; } xxeos 

Target: xxbos Reads a complete item in the byte buffer. xxeos 

Predicted: xxbos Read the   from the given array. xxeos 

Input: xxbos private void hidesuggestionsifnecessary(final @nonnull querytoken currentquery, final @nonnull tokensource source) { String queryts = currentquery.gettokenstring(); String currentts = source.getcurrenttokenstring(); if (!iswaitingforresults(currentquery) && queryts != null && queryts.equals(currentts)) { msuggestionsvisibilitymanager.displaysuggestions(false); } } xxeos 

Target: xxbos Hides the suggestions if there are no more incoming queries. xxeos 

Predicted: xxbos Check the givenion of of the is no more  . xxeos 

Input: xxbos public list<uirow> getvalues() throws efapsexception { list<uirow> ret = new arraylist<>(); if (isfiltered()) { for (final uirow row : this.values) { boolean filtered = false; for (final tablefilter filter : this.filters.values()) { filtered = filter.filterrow(row); if (filtered) { break; } } if (!filtered) { ret.add(row); } } } else { ret = this.values; } setsize(ret.size()); return ret; } xxeos 

Target: xxbos This is the getter method for the instance variable . xxeos 

Predicted: xxbos Returns method a first method for the row of. xxeos

This is great and all. However, you can see the text looks a bit off. Your model sort of starts generating some word and then switches half way through sometimes. This is because the way you are currently trying to generate tokens is using Teacher Forcing, which means you are giving the model the groundtruth for what it should have produced even if it did not. This is very helpful during training, however, it expects to have both the x and y of an input. In a real world setting, you aren't going to be given the y, obviously!

Therefore, I found a hacky way of bypassing this need for the y so that it is no longer needed. This involves using an empty array that fakes being the y, but has only ones and is updated everytime the model makes a prediction and is then fed back into the model so that is knows what it has generated before.

Heads Up The way I coded this is extremely inefficient and so running it will take a long time to generate predictions. Therefore I recommend only generating a few comments (I set it up to only do 10).

TODO For You: Come up with a more efficient solution that performs similarly to the Teacher Forcing approach of the above code.

P.S.

For other language learners provided by FastAI, you can simply use the predict function and pass some text and ask for the model to predict the next set of tokens. However, I have been unsuccessful in implementing this predict function for Sequence to Sequence models. So, another TODO For You is to see if you can implement a predict function for Sequence to Sequence models so that you can easily generate comments for methods that you just pass to the function!

If you do figure out a way to do this, I would be extremely interested! So, feel free to leave a comment about it.

def get_preds(learn, db_tst, max_seq = 128, n = 10):
    learn.model.eval()
    inpts, trgts, preds = [], [], []
    for i, (xb,yb) in enumerate(progress_bar(db_tst.dl(DatasetType.Train))):
        if i >= n: break
        res = torch.zeros(len(xb[0]), max_seq, device = torch.device('cuda')).long() + 1
        for i in range(max_seq - 1):
            outs = learn.model(xb[0], res)
            for j, out in enumerate(outs):
                res[j][i + 1] =  out.argmax(1)[i]
        for x, y, z in zip(xb[0], yb, res):
            inpts.append(str(learn.data.train_ds.x.reconstruct(x.cpu())))
            trgts.append(str(db_tst.train_ds.y.reconstruct(y.cpu())))
            preds.append(str(learn.data.train_ds.y.reconstruct(z.cpu())))
    return inpts, trgts, preds

inputs, targets, outputs = get_preds(learn, db_tst)
print_results(inputs, targets, outputs, method_spm, comment_spm)

def get_preds(learn, db_tst, max_seq = 128, n = 10):
    learn.model.eval()
    inpts, trgts, preds = [], [], []
    for i, (xb,yb) in enumerate(progress_bar(db_tst.dl(DatasetType.Train))):
        if i >= n: break
        res = torch.zeros(len(xb[0]), max_seq, device = torch.device('cuda')).long() + 1
        for i in range(max_seq - 1):
            outs = learn.model(xb[0], res)
            for j, out in enumerate(outs):
                res[j][i + 1] =  out.argmax(1)[i]
        for x, y, z in zip(xb[0], yb, res):
            inpts.append(str(learn.data.train_ds.x.reconstruct(x.cpu())))
            trgts.append(str(db_tst.train_ds.y.reconstruct(y.cpu())))
            preds.append(str(learn.data.train_ds.y.reconstruct(z.cpu())))
    return inpts, trgts, preds

inputs, targets, outputs = get_preds(learn, db_tst)
print_results(inputs, targets, outputs, method_spm, comment_spm)

Input: xxbos protected String parseunquotedstringcontent() { final int startndx = ndx; while (true) { final char c = input[ndx]; if (c <= ' ' || charutil.equalsone(c, UNQUOTED_DELIMETERS)) { final int currentndx = ndx; / / done skipwhitespaces(); return new String(input, startndx, currentndx - startndx); } ndx++; } } xxeos 

Target: xxbos Parses un-quoted string content. xxeos 

Predicted: xxbos Parses the text from the HTML text. xxeos 

Input: xxbos private static void checkfilecopy(final File srcfile, final File destfile) throws ioexception { checkexists(srcfile); checkisfile(srcfile); if (equals(srcfile, destfile)) { throw new ioexception("files '" + srcfile + "' and '" + destfile + "' are equal"); } File destparent = destfile.getparentfile(); if (destparent != null && !destparent.exists()) { checkcreatedirectory(destparent); } } xxeos 

Target: xxbos Checks that file copy can occur. xxeos 

Predicted: xxbos Checks if the file is a file. xxeos 

Input: xxbos long analyze() { Arc a; Arc aa; if (pre.outs == null) { return flags.reg_uimpossible; } for (a = pre.outs; a != null; a = a.outchain) { for (aa = a.to.outs; aa != null; aa = aa.outchain) { if (aa.to == post) { return flags.reg_uemptymatch; } } } return 0; } xxeos 

Target: xxbos analyze - ascertain potentially-useful facts about an optimized NFA xxeos 

Predicted: xxbos Returns the Syoooo Syna Syna Syna Sa Sa Sa Sa Sa Sa Sa Sa syna Syna xxeos 

Input: xxbos @suppresswarnings("unchecked") public REC next() { checkdirection(true); orecord record; / / ITERATE UNTIL THE NEXT GOOD RECORD while (hasnext()) { / / FOUND if (currentrecord != null) { try { return (REC) currentrecord; } finally { currentrecord = null; } } record = gettransactionentry(); if (record != null) return (REC) record; } return null; } xxeos 

Target: xxbos Return the element at the current position and move forward the cursor to the next position available. xxeos 

Predicted: xxbos Returns the next record in the queue. xxeos 

Input: xxbos public static void addtransitivematches(hollowreadstateengine stateengine, map<string, bitset> matches) { list<hollowschema> schemalist = hollowschemasorter.dependencyorderedschemalist(stateengine); collections.reverse(schemalist); for(hollowschema schema : schemalist) { bitset currentmatches = matches.get(schema.getname()); if(currentmatches != null) { addtransitivematches(stateengine, schema.getname(), matches); } } } xxeos 

Target: xxbos Augment the given selection by adding the references, and the <i>transitive< / i> references, of our selection. xxeos 

Predicted: xxbos Add a variable to the list of s. xxeos 

Input: xxbos protected void resolvenestedproperties(final beanproperty bp) { String name = bp.name; int dotndx; while ((dotndx = indexofdot(name)) != -1) { bp.last = false; bp.setname(name.substring(0, dotndx)); bp.updatebean(getindexproperty(bp)); name = name.substring(dotndx + 1); } bp.last = true; bp.setname(name); } xxeos 

Target: xxbos Resolves nested property name to the very last indexed property. If forced, <code>null< / code> or non-existing properties will be created. xxeos 

Predicted: xxbos Resolve the property name. xxeos 

Input: xxbos public xmlconfig declarenamespace(string prefix, String namespaceuri) { validate.notempty(prefix, "prefix cannot be empty"); validate.notempty(namespaceuri, "namespace URI cannot be empty"); map<string, String> updatednamespaces = new hashmap<string, string>(declarednamespaces); updatednamespaces.put(prefix, namespaceuri); return new xmlconfig(features, updatednamespaces, properties, validating, true, allowdoctypedeclaration, true); } xxeos 

Target: xxbos Declares a namespace and also sets } to <code>true< / code>. <p / > <p>note that you cannot use this to add namespaces for the matcher. This has to be done by providing a to the matcher instance.< / p> xxeos 

Predicted: xxbos Creates a new instance of the given name and the given name. xxeos 

Input: xxbos protected static int getshadowradius(drawable shadow, Drawable circle) { int radius = 0; if (shadow != null && circle != null) { Rect rect = new Rect(); radius = (circle.getintrinsicwidth() + (shadow.getpadding(rect) ? rect.left + rect.right : 0)) / 2; } return Math.max(1, radius); } xxeos 

Target: xxbos Calculates required radius of shadow. xxeos 

Predicted: xxbos Get the Syyyy Syyy Syyy Syna Syna Syna Syna Syna Sna Syna Syna Syna xxeos 

Input: xxbos void addadviceclinitmethod(final String name) { if (adviceclinits == null) { adviceclinits = new arraylist<>(); } adviceclinits.add(name); } xxeos 

Target: xxbos Saves used static initialization blocks (clinit) of advices. xxeos 

Predicted: xxbos Adds a Java class to the Saa Sa Sa Syyetch Bean. xxeos 

Input: xxbos public static String padleft(string s, int desiredlength, String padstring) { while (s.length() < desiredlength) { s = padstring + s; } return s; } xxeos 

Target: xxbos Pad the given string with padstring on the left up to the given length. xxeos 

Predicted: xxbos Compares two strings, and returns the first character in the string. xxeos

Not too shabby if I do say so myself! It seems to actually be learning about what a comment is supposed to have in it for documenting what the method is doing. Of course there are a lot of tweaks you could do such as adding the ability to generate inline comments instead of just method level, using more data, using different sampling schemes for generating the comments such as top-k or nucleus, and any other awesome things you could think of! And if you do feel free to leave a comment about your adventure.

Tip: I have done a lot of fiddling to get this to work, however, most of my models ended up overfitting. The way I fixed this issue was just being more careful in how I clean the data and also increase the data size. I know this seems simple, but it is quite effective.

Conclusion

In this tutorial, you created an Automatic Code Comment Generator! You learned about how to clean, explore, and process data and how to use the awesome Pytorch and FastAI to define and train the awesome Transformer architecture. The use of Deep Learning in the field of Software Engineering is what I am studying for my Ph.D., so I hope I have inspired you to think about some other ways you could use Deep Learning to help Software Engineering!

I hope you enjoyed this tutorial and look out for future blog posts from me about all kinds of topics as I have announced a challenge for myself for learning new things this year!

Okay, I'm posing a #challenge to myself: Each month devote an hour a day to learning about a subject I am weak at. At the end of the month, post a blog summarizing what you've learned.

March is devoted to #neuroscience! Already signed up for #edX course!#ChallengeAccepted 😎
— Nathan Cooper (@ncooper57) March 5, 2020

CodeSearchNet citation


@article{husain_codesearchnet_2019,
    title = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
    shorttitle = {{CodeSearchNet} {Challenge}},
    url = {http://arxiv.org/abs/1909.09436},
    urldate = {2020-03-12},
    journal = {arXiv:1909.09436 [cs, stat]},
    author = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
    month = sep,
    year = {2019},
    note = {arXiv: 1909.09436},
}

	mthd	cmt
5360	@Override\n public GetLexiconResult getLexi...	<p>\nReturns the content of the specified pron...
9365	public static void checkJavaInternalAccess(ILo...	Prints warning to given {@link ILogger} if Haz...
10145	private IAtom createAtom(Element element) {\n ...	Create a new atom for the provided symbol. The...
9008	public void marshall(Scte20PlusEmbeddedDestina...	Marshall the given parameter object.
24498	@Override\n public void prefetchToken(final F...	/*\nGets hadoop tokens for a user to run mapre...

	mthd	cmt
0	public static void checkjavainternalaccess(ilo...	prints warning to given if hazelcast is not pr...
1	public void marshall(scte20plusembeddeddestina...	marshall the given parameter object.
2	@override public void prefetchtoken(final file...	/* gets hadoop tokens for a user to run mapred...
3	@override public <y> singularattribute<x, y> g...	/* (non-javadoc)
4	public void sync(boolean syncallsegments) { co...	forces a disk flush on the commit log files th...

	response	context	context/0	context/1	context/2	context/3	context/4	context/5	context/6	context/7	context/8	context/9
36917	Es tan simple.	No se que te detiene.	¡Muy persuasiva!	No es crimen librarse de una alimaña.	¿Por qué tener lastima de un hombre tan vil?	Además, también ha puesto sus ojos en Okayo.	Hace 4 años que soy victima de mi marido.	Solo estoy siendo franca contigo.	Calmate.	¡Eres peor que el diablo!	¿Comprendes?	Okayo me recuerda constantemente mi fracaso.
5449	Muy torpe, Joyce.	A la sala de interrogación rápido. ¡Muévanse!	De pie, muchachos.	A la sala de interrogación rápido. ¡Muévanse!	De pie, muchachos.	¡Use su cuchillo, hombre!	¡Adelántese, Thomson!	¡Bien hecho, Jenkins!	Gracias.	Muy bien.	El bungaló del mayor Warden está al final del ...	Continúe, conductor.
37004	Pídemelo.	Sólo lo que quieras tú.	Ya no soy yo.	Eres preciosa y maravillosa.	¿No?	Así te gustaré.	Haré y diré lo que quieras.	Nunca.	Así nunca querrás estar con otras, ¿verdad?	Siempre diré lo que tú desees y haré lo que tú...	Pero yo sí.	Pero...
47077	¡Boris!	¡Nicolás, que alegría a mi corazón, volviste!	¡Regresan los Vencedores!	¡Miren!	¡Ahí vienen!	Está vivo.	Boris está vivo.	Dasha prometió avisarme cuando regrese.	Pero, en la fábrica dicen que él está en una u...	Tampoco hay noticias de Stepan.	¡Quién sabe!	¿Por qué entonces, no hay noticias de él?
41450	Entonces por qué no estamos en mejor situación...	Dora Hartley era una buena prueba.	Mire, lo que hace usted creer ¿Qué los indios ...	Aleja esa arma.	Buenas noches.	Es hora de ir a la cama.	Seguro.	Sí. recuerde que es un secreto.	Es bonita.	Está bien.	¿Ann Martin?	Hola, Bax.

	code	docstring
0	protected final void bindIndexed(Configuration...	Bind indexed elements to the supplied collecti...
1	public void setServletRegistrationBeans(\n\t\t...	Set {@link ServletRegistrationBean}s that the ...
2	public void addServletRegistrationBeans(\n\t\t...	Add {@link ServletRegistrationBean}s for the f...
3	public void setServletNames(Collection<String>...	Set servlet names that the filter will be regi...
4	public void addServletNames(String... servletN...	Add servlet names for the filter.\n@param serv...

text	target
▁ xx b os ▁boolean ▁res er ve ( int ▁column , ▁int ▁size ) ▁{ ▁if ▁( ( column ▁< ▁0) ▁\|\| ▁( ( column ▁+ ▁size ) ▁> ▁columns )) ▁throw ▁new ▁index out of bound s exception (" res er ve ▁- ▁ inc or rec t ▁column ▁/ ▁size "); ▁for ( int ▁i = column ; ▁i ▁< ▁column ▁+ ▁size ; ▁i ++) ▁{	▁ xx b os ▁ xx ma j ▁re s er ve s ▁a ▁< code > ce ll < ▁/ ▁ xx up ▁code > ▁in ▁the ▁< code > ro w < ▁/ ▁ xx up ▁code > . ▁ xx e os
▁ xx b os ▁@ help ( ▁ help ▁= ▁" get ▁all ▁the ▁ virtual network function descriptor ▁ xx m a j ▁dependency ▁of ▁a ▁network service descriptor ▁with ▁specific ▁id " ▁) ▁public ▁list < v n f d ep end en cy > ▁get v n f dependencies ( final ▁ xx m a j ▁string ▁id ns d ) ▁throws ▁s d k exception ▁{	▁ xx b os ▁ xx ma j ▁return ▁a ▁ xx ma j ▁list ▁with ▁all ▁the ▁v n f de p end en c ies ▁that ▁are ▁contain ed ▁in ▁a ▁specific ▁network service descriptor . ▁ xx e os
▁ xx b os ▁@ override ▁public ▁void ▁delete as set and attachment s ( final ▁ xx m a j ▁string ▁as set id ) ▁throws ▁ io exception , ▁request failure exception ▁{ ▁ xx m a j ▁as set ▁as s ▁= ▁get un v er ified as set ( as set id ); ▁list < attachment > ▁attachment s ▁= ▁as s . get attachment s	▁ xx b os ▁ xx ma j ▁this ▁will ▁delete ▁an ▁asset ▁and ▁all ▁its ▁attachments ▁ xx e os
▁ xx b os ▁public ▁list < character book mark fold er s response > ▁get character s character id book mark s fold er s ( integer ▁character id , ▁ xx m a j ▁string ▁data source , ▁ xx m a j ▁string ▁if n one match , ▁ xx m a j ▁ integer ▁page , ▁ xx m a j ▁string ▁token ) ▁throws ▁api	▁ xx b os ▁ xx ma j ▁list ▁ bookmark ▁folders ▁a ▁list ▁of ▁your ▁character & ' s ▁personal ▁ bookmark ▁folders ▁--- ▁ xx ma j ▁this ▁route ▁is ▁cached ▁for ▁up ▁to ▁36 00 ▁seconds ▁ xx up ▁ s so ▁ xx ma j ▁scope : ▁ esi - bookmark s . read _ character _ bookmark s . v 1 ▁ xx e os
▁ xx b os ▁@ de pre c ated ▁protected ▁final ▁map < db id , ▁k n n list > ▁batch n n ( n ▁node , ▁db id s ▁ids , ▁int ▁k max ) ▁{ ▁map < db id , ▁k n n list > ▁res ▁= ▁new ▁hash map <>( id s . size ()); ▁for ( db id iter ▁iter ▁= ▁ids . iter ();	▁ xx b os ▁ xx ma j ▁perform s ▁a ▁batch ▁k - ne a rest ▁neighbor ▁query ▁for ▁a ▁list ▁of ▁query ▁objects . ▁ xx e os

epoch	train_loss	valid_loss	accuracy	bleu	time
0	1.182219	1.133453	0.828182	0.791774	06:46
1	0.920205	0.954264	0.841556	0.799681	06:47
2	0.812330	0.875513	0.849487	0.804000	06:44
3	0.752023	0.828835	0.853668	0.807183	06:45
4	0.679716	0.794862	0.856593	0.809325	06:43
5	0.653454	0.777795	0.859418	0.811010	06:42
6	0.611860	0.770059	0.860419	0.812164	06:49
7	0.605370	0.769881	0.860601	0.812119	06:45

IAmANerd

DeepSpeed Investigation: What I Learned

What is DeepSpeed?

Putting DeepSpeed to the Test!

Conclusion Time

Improved Code Summarization

About

Background

Data

CodeBERT

Data

Data Cleaning

Data Exploring

Training

What's Next?

Open-Dialog Chatbots for Learning New Languages [Part 1]

This notebook was adapted from the following project:

LICENSE

About

Background

What is GPT2?

GPT2 as a chatbot

The Data!

Training and Evaluating

Chatting with our Model

Conclusion

PS

What I Learned (WIL) Neuroscience Month [Part 1]

High Level Stuff

System 1 and System 2

Brain Structure

Conclusions

How to Create an Automatic Code Comment Generator using Deep Learning!

About

Collecting, preparing and exploring the data

Exploring your data!

Loading the data using FastAI

Defining your model

Evaluate your model

Conclusion

CodeSearchNet citation

Awesome Things I Learned Creating My Own Website

Niche Features and Tips

Resume of Coding Projects:

Blog:

[Deprecated]

Material Design:

Development:

Conclusion