Code LM
WalkthroughBuild a GPT-style language model, an adapted byte-level BPE tokenizer, pre-train it on custom HuggingFace datasets, and use it to generate Python code. How fun!
What you'll do
In rough order:
- Build and train a byte-level BPE tokenizer from-scratch, custom fit to avoid splitting common keyword.
- pre- and post-processing pipeline, targeting special tokens such as indents, dedents, comments, and Fill-in-the-middle training sample adaptation.
- Create a custom dataset specifically designed for technical comprehension and code completion.
- Train and sample from a GPT2-style language model on the custom dataset.
- Circle back and wrap it all up to ship on HuggingFace.
Check it out
Get started here.