Lance Martin

Preface: I have a background in computational biology, but spent the past 5 years working on computer vision for self-driving cars. I’ve been fascinated by the emerging intersection of AI and the biosciences. Feel free to comment and / or contact me!

What biological problems can machine learning solve?

Successful applications of machine learning (ML) tend to have a few general properties. John Jumper, the lead author on the AlphaFold work at Deepmind, mentions several of these in several recent talks about AlphaFold here and here. In particular, I like this observation that he makes:

I would argue that training data is the second most important thing in ML. Evaluation is most important; it specifies what you want and measures progress towards the goal.”

In addition to public data, clear problem formulation and evaluation via open competition seem to be hallmarks of success in machine learning because they contribute to rapid collaboration and knowledge sharing. Protein structure prediction is one of the canonical application of ML in the biosciences with these features, paralleling other areas of ML (e.g., language, computer vision).

Computer Vision Protein Structure Prediction
Public datasets ImageNet (14M images
LAION (400M images) PDB (200k structures)
UniParc (~200M sequences)
Evaluation criteria ImageNet CASP
Problem formulation Rich academic lineage
in feature extraction Anfinsen hypothesis: sequence determine structure
Architecture CNN: multi-scale learned feature generation Transformer: model long-range 3D interactions between residues

Open questions:


Protein Structure Prediction

Predicting protein structure (3D coordinates of all atoms) from the amino acid sequence is a central computational biology challenge. Historically, models were small (100s of parameters) and used knowledge of biophysics for this task (e.g., the Baker lab’s pioneering work on Rosetta). Learning this relationship directly from data was appealing because deep learning provides a powerful and general toolkit for general function approximation. Two approaches have emerged:

  1. Learn from protein structures
  2. Learn from protein sequences