Preface: I have a background in computational biology, but spent the past 5 years working on computer vision for self-driving cars. I’ve been fascinated by the emerging intersection of AI and the biosciences. Feel free to comment and / or contact me!
Successful applications of machine learning (ML) tend to have a few general properties. John Jumper, the lead author on the AlphaFold work at Deepmind, mentions several of these in several recent talks about AlphaFold here and here. In particular, I like this observation that he makes:
“I would argue that training data is the second most important thing in ML. Evaluation is most important; it specifies what you want and measures progress towards the goal.”
In addition to public data, clear problem formulation and evaluation via open competition seem to be hallmarks of success in machine learning because they contribute to rapid collaboration and knowledge sharing. Protein structure prediction is one of the canonical application of ML in the biosciences with these features, paralleling other areas of ML (e.g., language, computer vision).
| Computer Vision | Protein Structure Prediction | |
|---|---|---|
| Public datasets | ImageNet (14M images | |
| LAION (400M images) | PDB (200k structures) | |
| UniParc (~200M sequences) | ||
| Evaluation criteria | ImageNet | CASP |
| Problem formulation | Rich academic lineage | |
| in feature extraction | Anfinsen hypothesis: sequence determine structure | |
| Architecture | CNN: multi-scale learned feature generation | Transformer: model long-range 3D interactions between residues |
Open questions:
Predicting protein structure (3D coordinates of all atoms) from the amino acid sequence is a central computational biology challenge. Historically, models were small (100s of parameters) and used knowledge of biophysics for this task (e.g., the Baker lab’s pioneering work on Rosetta). Learning this relationship directly from data was appealing because deep learning provides a powerful and general toolkit for general function approximation. Two approaches have emerged: