Paper results for the performance of code models on different code constructs!
Dataset Preparation
In order to construct our dataset, we used the GPL 3.0 version of the codeparrot/github-code dataset as the models we are evaluating explicitly remove GPL licensed code, so there is a less chance of data leakage. Additionally, we filter out code that goes beyond 4096 characters long as the AST parser we use can be slow for very long code snippets. We also filter out code that is not written in Python, as we are only evaluating Python code models.
Downloading and preparing dataset json/bigcode--the-stack-smol to /work/.cache/huggingface/datasets/bigcode___json/bigcode--the-stack-smol-7b51f8bde3058781/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Dataset json downloaded and prepared to /work/.cache/huggingface/datasets/bigcode___json/bigcode--the-stack-smol-7b51f8bde3058781/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
6080
Next we filter out repositories that do not have multiple files in the dataset as we are primarly concerned with internal vs. external method call prediction performance. We then extract all the method definitions and invocations and filter out any examples that do not have internal method invocations. Will also filter out internal methods that are very common such as get and set. Lastly, we ensure there is an even representation of internal and external method calls in the dataset.
def find_duplicates(items):# Create an empty set to store the items that we have already seen seen =set()# Create an empty list to store the duplicates that we find duplicates = []# Loop through each item in the listfor item in items:# If the item is already in the "seen" set, then it must be a duplicateif item in seen:# Add the duplicate to the list duplicates.append(item)# If the item is not in the "seen" set, then add it to the setelse: seen.add(item)# Return the list of duplicatesreturn duplicatesrepo_names = find_duplicates(filtered_ds["repository_name"])repo_files = {}for repo_name in repo_names: rows_w_repo = filtered_ds.filter(lambda example: example["repository_name"] == repo_name )iflen(rows_w_repo) >1: repo_files[repo_name] = [row["content"] for row in rows_w_repo]iflen(repo_files) >400:break# filter out repos with only one filefiltered_ds = filtered_ds.filter(lambda example: example["repository_name"] in repo_files)len(filtered_ds)
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
import matplotlib.pyplot as pltimport numpy as npimport seaborn as snssns.set_theme(style="whitegrid")def visualize_perplexities(perplexities, tokens, title, filename): fig, ax = plt.subplots(figsize=(10, 6)) ax = sns.boxplot(data=perplexities, palette="Set2") ax.set_xticklabels(tokens) ax.set_title(title) plt.xticks(rotation=45, ha="right") plt.show()
most_common = token_cnt.most_common()method_invocations = [ tfor t in most_commonif t[0].startswith("<argument_list") or t[0].startswith("<call")]internals = [t for t in method_invocations if"internal"in t[0]]externals = [t for t in method_invocations if"internal"notin t[0]]
ast_tokens = [ tfor t in most_commonif (t[0].startswith("<") or t[0].endswith(">")) and"->"in t[0]][:10]bpe_tokens = [ tfor t in most_commonif (not t[0].startswith("<") andnot t[0].endswith(">")) and"->"notin t[0]][:10]
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
ast_tokens_least = [ tfor t in most_commonif (t[0].startswith("<") or t[0].endswith(">")) and"->"in t[0]][::-1][:10]bpe_tokens_least = [ tfor t in most_commonif (not t[0].startswith("<") andnot t[0].endswith(">")) and"->"notin t[0]][::-1][:10]
ast_crosses_least = [cross_dist[token] for token, _ in ast_tokens_least]bpe_crosses_least = [cross_dist[token] for token, _ in bpe_tokens_least]
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.