AI for Protein Engineering

Lance Martin

Preface: I have a background in computational biology, but spent the past 5 years working on computer vision for self-driving cars. I’ve been fascinated by the emerging intersection of AI and the biosciences. Feel free to comment and / or contact me!

What biological problems can machine learning solve?

Successful applications of machine learning (ML) tend to have a few general properties. John Jumper, the lead author on the AlphaFold work at Deepmind, mentions several of these in several recent talks about AlphaFold here and here. In particular, I like this observation that he makes:

“I would argue that training data is the second most important thing in ML. Evaluation is most important; it specifies what you want and measures progress towards the goal.”

In addition to public data, clear problem formulation and evaluation via open competition seem to be hallmarks of success in machine learning because they contribute to rapid collaboration and knowledge sharing. Protein structure prediction is one of the canonical application of ML in the biosciences with these features, paralleling other areas of ML (e.g., language, computer vision).

	Computer Vision	Protein Structure Prediction
Public datasets	ImageNet (14M images
LAION (400M images)	PDB (200k structures)
UniParc (~200M sequences)
Evaluation criteria	ImageNet	CASP
Problem formulation	Rich academic lineage
in feature extraction	Anfinsen hypothesis: sequence determine structure
Architecture	CNN: multi-scale learned feature generation	Transformer: model long-range 3D interactions between residues

Open questions:

What other open data competitions exist (or should exist) in the biosciences?

Protein Structure Prediction

Predicting protein structure (3D coordinates of all atoms) from the amino acid sequence is a central computational biology challenge. Historically, models were small (100s of parameters) and used knowledge of biophysics for this task (e.g., the Baker lab’s pioneering work on Rosetta). Learning this relationship directly from data was appealing because deep learning provides a powerful and general toolkit for general function approximation. Two approaches have emerged:

Learn from protein structures
Learn from protein sequences