Code LM

Walkthrough

WalkthroughPyTorch

Build a GPT-style language model, an adapted byte-level BPE tokenizer, pre-train it on custom HuggingFace datasets, and use it to generate Python code. How fun!

What you'll do

In rough order:

Build and train a byte-level BPE tokenizer from-scratch, custom fit to avoid splitting common keyword.
pre- and post-processing pipeline, targeting special tokens such as indents, dedents, comments, and Fill-in-the-middle training sample adaptation.
Create a custom dataset specifically designed for technical comprehension and code completion.
Train and sample from a GPT2-style language model on the custom dataset.
Circle back and wrap it all up to ship on HuggingFace.

Check it out

Get started here.

Mechanistic Interpretability for Clinical JEPAs

Multi-modal wildfire ignition modeling

Syntactic Negation Probing

Cross-Abstractive Alignment in Fact-Checking

Code LM

Code LM

What you'll do

Check it out