An Idiot's Guide to Lead Optimisation for Proteins: Understanding the Cradle Pipeline

⬅️ Back to Projects

I saved this blog post because it explains a real industrial ML system without dumbing it down or assuming you already work in pharma. Magnus Ross wrote it after bugging his friends who actually do this for a living, and it shows. David Miller helped review it. The result is one of those pieces where you can feel the author recently went through the “I have no idea what I’m doing” phase and bothered to take notes.

What Is Lead Optimisation, Actually

Drug discovery has this step where you’ve got a molecule that sort of works and you need to make it actually work. That is lead optimisation. Most real-world design campaigns live or die here.

In the protein version: you have a chain of amino acids folded into a 3D shape that does something useful, just not well enough. Maybe it binds to a disease target but not tightly. Maybe it falls apart too fast. You need to make it better.

Problem: proteins fold the way they do because evolution spent a billion years tuning them. Change too many amino acids and the thing stops folding. You get a useless blob. So the whole game is: change the protein to improve it, but not so much that you break it.

The old approach is directed evolution. Make random mutations, test them in the lab, keep whatever works, repeat. Frances Arnold won half a Nobel Prize for figuring out how to make this practical. The new approach uses machine learning to make smarter bets about what to change.

The Pipeline, Briefly

Cradle is a biotech startup that sells an ML system for this. Unusual thing about them: they run their own wet lab. So the loop between “model says try this” and “lab says it scored X” stays tight. They work with Novo Nordisk, Bayer, J&J.

They published a white paper in 2026 describing how the system works. The diagram is intimidating - coloured boxes, arrows everywhere. But it breaks into five pieces:

  1. Base model: learn what natural proteins look like
  2. Evotuning: focus on your protein’s family
  3. g-DPO: learn from actual lab results
  4. Predictor: estimate how good a candidate is before testing
  5. Generation: actually propose new sequences (covered in part 2, not out yet)

The Base Model: It’s Just a Language Model

The foundation is a transformer trained on protein sequences. If you get how ChatGPT works, you get this. Instead of predicting the next word from 50,000 tokens, it predicts the next amino acid from 20.

Proteins are strings. Twenty standard amino acids, each represented by a letter. Here’s myoglobin (carries oxygen in your muscles) as a string:

MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNKALELFRKDMASNYKELGFQG

The model trains with masked language modelling: hide an amino acid, make the model guess what goes there. Do this millions of times and the model develops a feel for what natural proteins look like.

M K T A [?] G L S E R ...
         |
   L  ██████ 0.42
   V  ███    0.28
   I  █      0.15

If the model says tryptophan at that slot has a 0.001% chance but valine has 40%, that tells you something. A W there would look unnatural. The model has never seen a protein like that. And since evolution has been filtering for function for billions of years, unnatural usually means non-functional.

None of this is guaranteed. There could be functional proteins that look nothing like natural ones. But we have almost no data on those, so the model can’t help with them.

Evotuning: Zoom In on the Right Neighbourhood

A model trained on all proteins is too generic. It knows what a protein is, but not what your protein needs.

The trick: find proteins evolutionarily related to yours. Align them residue by residue using something called a Multiple Sequence Alignment (MSA):

Query    : M K T A Y G L S E R N
Hit 1    : M K S A Y G L T E R N    (91%)
Hit 2    : L K T A Y G L S D R N    (81%)
Hit 3    : M R T A Y G I S E K N    (73%)
           ─────────────────────
Conserv. : x x x ✔ ✔ ✔ x x x x ✔

Positions 4, 5, 6, and 11 are the same across the whole family. You probably shouldn’t touch those.

Fine-tune the model on these related sequences. That’s evotuning (evolutionary fine-tuning). The model now knows: forget about bacterial enzymes and weird deep-sea proteins. Focus on this neighbourhood.

g-DPO: Put the Lab Data to Work

Now you start running experiments. Test proteins in the lab. Get measurements like:

Sequence                          Activity   Stability
M K T A Y G L S E R N ...          0.82       54.1
M K T A Y G L T E R N ...          0.79       53.8
M K S A Y G L S E R N ...          0.91       52.4

You want the model to suggest more sequences like the high-scoring ones and fewer like the low-scoring ones.

This is a preference optimisation problem. The standard LLM approach is Direct Preference Optimisation (DPO): give the model pairs of good and bad responses, push it toward the good ones. Clean, works great.

Except DPO needs clean good/bad pairs to work well. Assay data gives you continuous measurements, not binary labels. Cradle’s fix is grouped DPO (g-DPO) : cluster similar sequences, form preference pairs within clusters. Comparing sequences that differ by a few positions teaches the model about subtle effects. Comparing very different sequences teaches it about obvious ones. Both are useful.

After this, the model is called the logiter (it outputs logits like raw scores before probability normalisation). This is the thing that actually proposes amino acid changes.

The Predictor: How Good Will This Be?

The logiter makes suggestions. But you need to evaluate them without running a lab test every time.

The predictor sits on top of the evotuned model’s learned representations. Inside a transformer, each position in a protein gets turned into a vector that captures its role in context. Those vectors encode useful information; the model learned them during evotuning.

You take those vectors and feed them into a simple regression model (called a “head”) that predicts the assay values. Most of the heavy lifting is done. The evotuned model already understands protein structure. You’re just pointing that understanding at a different target.

The predictor lets you filter the logiter’s suggestions and decide what to test first.

The Missing Bits

There were two pieces I couldn’t cover because I only clipped part 1. First, the masking model: the logiter tells you what amino acid to put at a position, but it doesn’t tell you which positions to change. That’s its own problem. Second, the actual generation strategy, especially since you’re optimising for multiple properties at once that often trade off.

Will cover these when part 2 comes out.

What Stuck With Me

The modularity stands out. Each piece has one job. The base model learns protein grammar. Evotuning focuses on the right family. g-DPO injects experimental feedback. The predictor gates what gets tested. It’s designed, not improvised.

The whole pipeline depends on tight iteration between predictions and wet lab. The model suggests, the lab tests, the results feed back. That loop matters more than any single model improvement. Most applied ML projects I’ve seen have that pattern, the infrastructure around the model ends up mattering as much as the model itself.

Links

Crepi il lupo! 🐺