Skip to main content
Made with by me

Code LM

Walkthrough
WalkthroughPyTorch

Build a GPT-style language model, an adapted byte-level BPE tokenizer, pre-train it on custom HuggingFace datasets, and use it to generate Python code. How fun!

What you'll do

In rough order:

  • Build and train a byte-level BPE tokenizer from-scratch, custom fit to avoid splitting common keyword.
  • pre- and post-processing pipeline, targeting special tokens such as indents, dedents, comments, and Fill-in-the-middle training sample adaptation.
  • Create a custom dataset specifically designed for technical comprehension and code completion.
  • Train and sample from a GPT2-style language model on the custom dataset.
  • Circle back and wrap it all up to ship on HuggingFace.

Check it out

Get started here.

Code LM · Tanner O'Rourke