Introduction

Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce Logo ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.

Inverse Folding

We evaluate the performance of various inverse-folding models for structure-based sequence design, focusing on two distinct objectives: natural evolutionary fitness (in-distribution proteins) and de novo designed backbone-based sequence design. The latter represents an out-of-distribution problem that tests the robustness of the methods, as these structures typically contain some noise different from high-resolution structure deposited in PDBs.

The following table shows the performance of structure-based sequence design models on inverse folding tasks. The reported results are the median of repetitive experiments. ’N/A’ stands for not applicable. ESMIF1 and ESM3 use all native structures and sequences for model training, therefore, they not measured in the evolution distribution fitting objective.

Model Fitting Evolution Distribution De novo backbones based sequence design
CASP AAR ↑ CAMEO AAR ↑ length 100 scTM ↑ length 100 pLDDT ↑ length 200 scTM ↑ length 200 pLDDT ↑ length 300 scTM ↑ length 300 pLDDT ↑ length 400 scTM ↑ length 400 pLDDT ↑ length 500 scTM ↑ length 500 pLDDT ↑
ProteinMPNN 0.450 0.468 0.962 94.14 0.945 89.34 0.962 90.28 0.875 83.76 0.568 67.09
ESM-IF1 N/A N/A 0.810 88.83 0.635 69.67 0.336 74.36 0.449 64.59 0.462 58.97
LM-Design 0.516 0.570 0.834 78.45 0.373 58.41 0.481 69.86 0.565 59.87 0.397 56.35
ESM3 N/A N/A 0.942 86.60 0.486 60.69 0.632 70.78 0.564 62.63 0.452 59.37

Structure Design

We evaluate the performance of protein foundation models for backbone design. Our analysis focuses on the quality, novelty, and diversity of the generated structures across various chain lengths.

The following table shows the performance of backbone design models evaluated using various lengths ranging from 50 to 500. The reported results are the median of repetitive experiments. We highlight the best performance in bold and the second-best with the underline. For the novelty and diversity metrics, we only highlight results with the corresponding scTM score higher than 0.5. 'N/A' stands for not applicable.

Model length 50 length 100
Quality Novelty Diversity Quality Novelty Diversity
scTM ↑ scRMSD ↓ Max TM ↓ pairwise TM ↓ Max Clust. ↑ scTM ↑ scRMSD ↓ Max TM ↓ pairwise TM ↓ Max Clust. ↑
Native PDBs 0.91 0.74 N/A 0.29 0.66 0.96 0.67 N/A 0.30 0.77
RFdiffusion 0.95 0.45 0.65 0.58 0.67 0.98 0.48 0.76 0.41 0.32
FrameFlow 0.91 0.58 0.75 0.68 0.39 0.94 0.70 0.72 0.55 0.49
Chroma 0.85 1.05 0.59 0.29 0.48 0.89 1.27 0.70 0.35 0.59
FrameDiff(latest) 0.85 1.00 0.67 0.35 0.64 0.90 1.23 0.71 0.52 0.11
FoldFlow1(sfm) 0.90 0.67 0.68 0.63 0.48 0.87 1.34 0.65 0.49 0.83
FoldFlow1(base) 0.79 1.19 0.66 0.53 0.76 0.81 1.70 0.62 0.48 0.83
FoldFlow1(ot) 0.83 1.10 0.65 0.53 0.77 0.83 1.60 0.64 0.48 0.81
Genie 0.57 3.12 0.57 0.32 0.90 0.69 3.38 0.59 0.31 0.96
Model length 300 length 500
Quality Novelty Diversity Quality Novelty Diversity
scTM ↑ scRMSD ↓ Max TM ↓ pairwise TM ↓ Max Clust. ↑ scTM ↑ scRMSD ↓ Max TM ↓ pairwise TM ↓ Max Clust. ↑
Native PDBs 0.97 0.82 N/A 0.28 0.77 0.97 1.07 N/A 0.29 0.80
RFdiffusion 0.96 1.03 0.64 0.36 0.65 0.79 5.60 0.62 0.33 0.89
FrameFlow 0.92 1.95 0.65 0.43 0.88 0.61 7.92 0.61 0.40 0.92
Chroma 0.87 2.47 0.66 0.36 0.67 0.72 6.71 0.60 0.29 0.99
FrameDiff(latest) 0.87 2.73 0.69 0.48 0.21 0.63 9.52 0.58 0.40 0.52
FoldFlow1(sfm) 0.45 9.04 0.54 0.39 1.00 0.37 13.04 0.53 0.37 1.00
FoldFlow1(base) 0.43 9.56 0.54 0.39 0.98 0.35 13.20 0.52 0.39 1.00
FoldFlow1(ot) 0.54 8.21 0.58 0.41 0.94 0.37 12.48 0.51 0.35 1.00
Genie 0.27 20.37 0.30 0.23 1.00 0.25 26.08 0.22 0.23 1.00

Sequence Design

We assess the performance of various protein sequence generative models based on the quality, diversity, and novelty of their generated sequences across different chain lengths.

The following table shows the performance of protein sequence generative models/language models on sequence generation tasks. The reported results are the average of repetitive experiments. The pLDDT score is the output of AlphaFold2. Max TM is an abbreviation for Maximum TM-score to PDB database. 'N/A' stands for not applicable. We highlight the best performance in bold.

Model length 100 length 200
Quality Diversity Novelty Quality Diversity Novelty
ppl ↓ pLDDT ↑ pairwise TM ↓ Max Clust. ↑ Max TM ↓ ppl ↓ pLDDT ↑ pairwise TM ↓ Max Clust. ↑ Max TM ↓
Native Seqs 68.46 0.55 0.75 N/A 61.91 0.49 0.78 N/A
Progen 2 (700M) 8.28 64.00 0.42 0.94 0.64 5.68 69.91 0.40 0.91 0.69
EvoDiff 16.89 50.20 0.43 0.98 0.69 17.28 50.66 0.36 1.00 0.71
DPLM (650M) 6.21 85.38 0.50 0.80 0.74 4.61 93.54 0.54 0.70 0.91
ESM3 (1.4B) 14.79 54.26 0.45 0.90 0.68 12.96 58.45 0.35 1.00 0.80
Model length 300 length 500
Quality Diversity Novelty Quality Diversity Novelty
ppl ↓ pLDDT ↑ pairwise TM ↓ Max Clust. ↑ Max TM ↓ ppl ↓ pLDDT ↑ pairwise TM ↓ Max Clust. ↑ Max TM ↓
Native Seqs 61.49 0.51 0.85 N/A 62.95 0.51 0.78 N/A
Progen 2 (700M) 6.25 65.69 0.42 0.93 0.66 4.27 61.45 0.32 0.95 0.68
EvoDiff 17.13 45.14 0.31 1.00 0.68 16.51 43.14 0.31 1.00 0.69
DPLM (650M) 3.47 93.07 0.57 0.63 0.91 3.33 87.73 0.43 0.85 0.85
ESM3 (1.4B) 14.59 48.08 0.32 1.00 0.75 11.10 52.17 0.30 1.00 0.54

Sequence and Structure Co-Design

We assess the performance of various protein sequence generative models based on the quality, diversity, and novelty of their generated sequences across different chain lengths. The evaluation metrics include AlphaFold2 (AF2) predicted pLDDT scores for structural plausibility (quality), maximum TM-score and maximum cluster values for structural diversity, and maximum TM-score to PDB structures for structural novelty.

The following table shows the performance of protein sequence generative models/language models on sequence generation tasks. The reported results are the average of repetitive experiments with the standard derivation. The pLDDT score is the output of AlphaFold2. Max TM is an abbreviation for Maximum TM-score to PDB database. 'N/A' stands for not applicable. We highlight the best performance in bold.

Model Length 100 Length 200
Quality Diversity Novelty Quality Diversity Novelty
scTM ↑ scRMSD ↓ Max Clust. ↑ Max TM ↓ scTM ↑ scRMSD ↓ Max Clust. ↑ Max TM ↓
Native PDBs 0.91 2.98 0.75 N/A 0.88 3.24 0.77 N/A
ProteinGenerator 0.91 3.75 0.24 0.73 0.88 6.24 0.25 0.72
ProtPardelle* 0.56 12.9 0.57 0.66 0.64 13.67 0.10 0.69
Multiflow 0.96 1.10 0.33 0.71 0.95 1.61 0.42 0.71
ESM3* 0.72 13.80 0.64 0.41 0.63 21.18 0.63 0.61
Model Length 300 Length 500
Quality Diversity Novelty Quality Diversity Novelty
scTM ↑ scRMSD ↓ Max Clust. ↑ Max TM ↓ scTM ↑ scRMSD ↓ Max Clust. ↑ Max TM ↓
Native PDBs 0.92 3.94 0.75 NaN 0.9 9.64 0.8 NaN
ProteinGenerator 0.81 9.26 0.22 0.71 0.41 33.91 0.18 0.73
ProtPardelle* 0.69 14.91 0.04 0.72 0.4 41.23 0.6 0.69
Multiflow 0.96 2.14 0.58 0.71 0.83 8.48 0.67 0.68
ESM3 (1.4B)* 0.59 25.5 0.52 0.73 0.54 33.7 0.37 0.77

Motif-Scaffolding

We evaluate the performance of various motif-scaffolding methods across different scaffolds, focusing on their effectiveness in designing scaffold structures. The primary objective of this evaluation is to compare the efficacy of structure-based and sequence-based approaches in generating designable scaffolds.

The following table shows the performance of motif-scaffolding of structure-based and sequence-based methods.

motif_scaffolding

Antibody Design

Using 13 different metrics from 4 perspectives, we comprehensively evaluate the performance of CDR-H3 (including both sequence and structure) generated by each method for given antigens, thereby revealing their true performance. All methods were trained on the same dataset with parameters reported in their respective papers and tested on a unified set of 55 complexes from the RAbD dataset.

The following table shows the specific performance and preferences of each method (methods capable of generating multiple antibodies are marked with *). The results also reveal that conventional evaluation methods, such as AAR and RMSD, are insufficient for evaluating antibody design methodologies.

Method Accuracy Functionality Specificity
AAR ↑ RMSD ↓ TM-score ↑ Binding Energy ↓ SeqSim-outer ↓ SeqSim-inner ↑ PHR ↓
RAbD (natural) 100.00% 0.00 1.00 -15.33 0.26 N/A 45.78%
HERN 33.17% 9.86 0.16 1242.77 0.41 N/A 39.83%
MEAN 33.47% 1.82 0.25 263.90 0.65 N/A 40.74%
dyMEAN 40.95% 2.36 0.36 889.28 0.58 N/A 42.04%
*dyMEAN-FixFR 40.05% 2.37 0.35 612.75 0.60 0.96 43.75%
*DiffAb 35.04% 2.53 0.37 489.42 0.37 0.45 40.68%
*AbDPO 31.29% 2.79 0.35 116.06 0.38 0.60 69.69%
*AbDPO++ 36.25% 2.48 0.35 223.73 0.39 0.54 44.51%
Method Rationality
CN-score ↑ Clashes-inner ↓ Clashes-outer ↓ SeqNat ↑ Total Energy ↓ scRMSD ↓
RAbD (natural) 50.19 0.07 0.00 -1.74 -16.76 1.77
HERN 0.04 0.04 3.25 -1.47 5408.74 9.89
MEAN 1.33 11.65 0.29 -1.83 1077.32 2.77
dyMEAN 1.49 9.15 0.47 -1.79 1642.65 2.11
*dyMEAN-FixFR 1.14 8.88 0.48 -1.82 1239.29 2.48
*DiffAb 2.02 1.84 0.19 -1.88 495.69 2.57
*AbDPO 1.33 4.14 0.10 -1.99 270.12 2.79
*AbDPO++ 2.34 1.66 0.08 -1.78 338.14 2.50

Protein Folding: Single State Prediction

We assess the performance of protein folding as single-state conformation prediction. Protein folding models have played a pivotal role in understanding sequence-structure relationships and serve as a foundational component for protein conformation models. Therefore, it is essential to benchmark their performance in discussions of protein conformation prediction.

The following table shows the performance of protein folding on CAMEO2022. Results are reported in mean/median over 183 proteins. We highlight the best performance in bold and the second-best with the underline. "N/A" stands for not applicable. *Unknown amino acids ("X") in the sequence need to be removed for EigenFold and might introduce small difference in metric evaluation.

Accuracy Quality
TM-score ↑ RMSD ↓ GDT-TS ↑ lDDT ↑ CA clash (%) ↓ CA break (%) ↓ PepBond break (%) ↓
AlphaFold2 0.871 3.21 0.860 0.900 0.3 0.0 4.8
OpenFold 0.870 3.21 0.856 0.895 0.4 0.0 2.0
RoseTTAFold2 0.859 3.52 0.845 0.888 0.3 0.2 5.5
ESMFold 0.847 3.98 0.826 0.870 0.3 0.0 4.7
EigenFold* 0.743 7.65 0.703 0.737 8.0 0.5 N/A

Multiple State Prediction

In the following, we evaluate the performance of predicting multiple conformational states.

The following table shows the performance on the multi-state prediction of BPTI. Accuracy metrics (RMSDens, RMSD Cluster 3) are shown in the mean from 20 bootstrap sampling with replacement with different number of conformation samples (N=10~1000). Diversity and Quality scores are evaluated on 1000 conformations for each model. We highlight the best performance in bold and the second-best with the underline. "N/A" stands for not applicable due to model resolution. RMSD unit: Å.

RMSDens ↓ RMSD Cluster 3 ↓ Diversity Quality
N=10 N=100 N=500 N=1000 N=10 N=100 N=500 N=1000 Pairwise RMSD CA clash (%) ↓ CA break (%) ↓ PepBond break (%) ↓
EigenFold 1.56 1.50 1.47 1.46 2.54 2.48 2.46 2.46 0.85 1.4 4.3 N/A
MSA-depth256 1.57 1.54 1.52 1.52 2.51 2.47 2.45 2.45 0.20 0.0 0.0 9.2
MSA-depth64 1.60 1.54 1.51 1.50 2.48 2.40 2.35 2.33 0.55 0.0 0.0 7.9
MSA-depth32 1.67 1.53 1.45 1.41 2.39 2.21 1.93 1.87 2.14 0.6 0.0 10.6
Str2Str-ODE (Tmax=0.15) 2.36 2.19 2.10 2.08 3.03 2.68 2.60 2.56 1.86 0.0 0.0 13.9
Str2Str-SDE (Tmax=0.15) 2.83 2.48 2.28 2.25 3.42 2.92 2.52 2.48 3.60 0.3 0.0 16.0
AlphaFlow-PDB 1.53 1.45 1.42 1.41 2.48 2.43 2.41 2.40 0.86 0.0 0.0 13.2
AlphaFlow-MD 1.74 1.51 1.45 1.43 2.44 2.32 2.28 2.24 1.26 0.0 0.1 26.2
ESMFlow-PDB 1.61 1.49 1.44 1.42 2.47 2.41 2.37 2.35 0.74 0.0 0.0 6.0
ESMFlow-MD 1.66 1.50 1.41 1.40 2.49 2.29 2.20 2.18 1.17 0.0 0.0 14.3
ConfDiff-Open-ClsFree 1.65 1.48 1.41 1.37 2.56 2.30 2.16 2.03 1.77 0.5 0.0 5.5
ConfDiff-Open-MD 1.64 1.50 1.44 1.42 2.49 2.39 2.32 2.31 1.37 0.2 0.0 4.6
ConfDiff-ESM-ClsFree 1.58 1.45 1.41 1.39 2.50 2.39 2.35 2.33 1.52 0.5 0.0 7.5
ConfDiff-ESM-MD 1.61 1.47 1.42 1.40 2.45 2.32 2.26 2.24 1.42 0.1 0.0 5.0
ConfDiff-ESM-Energy 1.63 1.47 1.43 1.42 2.55 2.43 2.41 2.40 1.26 0.1 0.0 7.5
ConfDiff-ESM-Force 1.58 1.44 1.37 1.36 2.45 2.33 2.23 2.22 1.76 0.1 0.0 8.9

The following table show the performance on conformation prediction task of apo-holo dataset. apo/holo-TM are the maximum TMscore of samples to the reference apo/holo structure. 20 conformations are sampled for each protein and results are reported in mean across 91 proteins. We highlight the best performance in bold and the second-best with the underline. "N/A" stands for not applicable due to model resolution.

Accuracy Diversity Quality
apo-TM ↑ holo-TM ↑ TMens ↑ Pairwise TM CA clash (%) ↓ CA break (%) ↓ PepBond break (%) ↓
apo model 1.000 0.790 0.895 N/A N/A N/A N/A
EigenFold 0.831 0.864 0.847 0.907 3.6 0.3 N/A
MSA-depth256 0.845 0.889 0.867 0.978 0.2 0.0 4.6
MSA-depth64 0.844 0.883 0.863 0.950 0.2 0.0 5.7
MSA-depth32 0.824 0.857 0.841 0.864 0.2 0.0 8.9
Str2Str-ODE (Tmax=0.1) 0.762 0.778 0.770 0.954 0.2 0.0 14.0
Str2Str-ODE (Tmax=0.3) 0.766 0.781 0.774 0.872 0.2 0.0 14.7
Str2Str-SDE (Tmax=0.1) 0.682 0.693 0.688 0.760 0.2 1.5 22.6
Str2Str-SDE (Tmax=0.3) 0.680 0.689 0.684 0.639 0.2 1.4 21.1
AlphaFlow-PDB 0.855 0.891 0.873 0.924 0.3 0.0 6.6
AlphaFlow-MD 0.857 0.863 0.860 0.894 0.2 0.0 20.8
ESMFlow-PDB 0.849 0.882 0.866 0.935 0.3 0.0 4.8
ESMFlow-MD 0.851 0.864 0.858 0.897 0.1 0.0 10.9
ConfDiff-Open-ClsFree 0.838 0.879 0.859 0.870 0.8 0.0 5.8
ConfDiff-Open-MD 0.839 0.874 0.857 0.863 0.4 0.0 6.8
ConfDiff-ESM-ClsFree 0.837 0.864 0.850 0.846 0.7 0.0 4.6
ConfDiff-ESM-MD 0.836 0.862 0.849 0.846 0.3 0.0 4.1

Distribution Prediction

The following table shows the performance on distribution prediction for the ATLAS test set. A total of 250 structures were sampled for each protein, and the median values across 82 proteins are reported. The best performance is highlighted in bold, and the second-best is underlined. *These metrics require all-atom or backbone predictions; therefore, EigenFold and Str2Str do not have sufficient resolution for evaluation (indicated as "N/A").

Diversity Flexibility: Pearson r on Distributional accuracy
Pairwise RMSD *RMSF Pairwise RMSD ↑ *Global RMSF ↑ *Per target RMSF ↑ *RMWD ↓ MD PCA W2 ↓ Joint PCA W2 ↓ PC sim > 0.5 % ↑
MD iid 2.76 1.63 0.96 0.97 0.99 0.71 0.76 0.70 93.9
MD 2.5 ns 1.54 0.98 0.89 0.85 0.85 2.21 1.57 1.93 36.6
EigenFold 5.96 N/A -0.04 N/A N/A N/A 2.35 7.96 12.2
MSA-depth256 0.84 0.53 0.25 0.34 0.59 3.63 1.83 2.90 29.3
MSA-depth64 2.03 1.51 0.24 0.30 0.57 4.00 1.87 3.32 18.3
MSA-depth32 5.71 7.96 0.07 0.17 0.53 6.12 2.50 5.67 17.1
Str2Str-ODE (t=0.1) 1.66 N/A 0.13 N/A N/A N/A 2.12 4.42 6.1
Str2Str-ODE (t=0.3) 3.15 N/A 0.12 N/A N/A N/A 2.23 4.75 9.8
Str2Str-SDE (t=0.1) 4.74 N/A 0.10 N/A N/A N/A 2.54 8.84 9.8
Str2Str-SDE (t=0.3) 7.54 N/A 0.00 N/A N/A N/A 3.29 12.28 7.3
AlphaFlow-PDB 2.58 1.20 0.27 0.46 0.81 2.96 1.66 2.60 37.8
AlphaFlow-MD 2.88 1.63 0.53 0.66 0.85 2.68 1.53 2.28 39.0
ESMFlow-PDB 3.00 1.68 0.14 0.27 0.71 4.20 1.77 3.54 28.0
ESMFlow-MD 3.34 2.13 0.19 0.30 0.76 3.63 1.54 3.15 25.6
ConfDiff-Open-ClsFree 3.68 2.12 0.40 0.54 0.83 2.92 1.50 2.54 46.3
ConfDiff-Open-PDB 2.90 1.43 0.38 0.51 0.82 2.97 1.57 2.51 34.1
ConfDiff-Open-MD 3.43 2.21 0.59 0.67 0.85 2.76 1.44 2.25 35.4
ConfDiff-ESM-ClsFree 4.04 2.84 0.31 0.43 0.82 3.82 1.72 3.06 37.8
ConfDiff-ESM-PDB 3.42 2.06 0.29 0.40 0.80 3.67 1.70 3.17 34.1
ConfDiff-ESM-MD 3.91 2.79 0.35 0.48 0.82 3.67 1.66 2.89 39.0
Ensemble observables Quality
Weak contacts J ↑ Transient contacts J ↑ *Exposed residue J ↑ *Exposed MI matrix ρ ↑ CA break % ↓ CA clash % ↓ PepBond break % ↓
MD iid 0.90 0.80 0.93 0.56 0.0 0.1 3.4
MD 2.5 ns 0.62 0.45 0.64 0.24 0.0 0.1 3.4
EigenFold 0.36 0.18 N/A N/A 0.7 9.6 N/A
MSA-depth256 0.30 0.28 0.33 0.06 0.0 0.2 5.9
MSA-depth64 0.38 0.27 0.38 0.12 0.0 0.2 8.4
MSA-depth32 0.39 0.24 0.36 0.15 0.1 0.5 13.0
Str2Str-ODE (t=0.1) 0.42 0.17 N/A N/A 0.0 0.1 13.7
Str2Str-ODE (t=0.3) 0.41 0.17 N/A N/A 0.0 0.1 14.8
Str2Str-SDE (t=0.1) 0.40 0.13 N/A N/A 1.6 0.2 23.0
Str2Str-SDE (t=0.3) 0.35 0.13 N/A N/A 1.5 0.2 21.4
AlphaFlow-PDB 0.44 0.33 0.42 0.18 0.0 0.2 6.6
AlphaFlow-MD 0.57 0.38 0.50 0.24 0.0 0.2 21.7
ESMFlow-PDB 0.42 0.29 0.41 0.16 0.0 0.6 5.4
ESMFlow-MD 0.51 0.33 0.47 0.21 0.0 0.3 10.9
ConfDiff-Open-PDB 0.47 0.34 0.43 0.18 0.0 0.9 5.7
ConfDiff-Open-ClsFree 0.54 0.33 0.47 0.21 0.0 1.2 5.7
ConfDiff-Open-MD 0.59 0.36 0.50 0.24 0.0 0.8 6.3
ConfDiff-ESM-PDB 0.48 0.31 0.42 0.18 0.0 1.6 3.9
ConfDiff-ESM-ClsFree 0.54 0.31 0.47 0.18 0.0 1.8 4.3
ConfDiff-ESM-MD 0.56 0.34 0.48 0.23 0.0 1.5 4.0

BibTeX

@@misc{ye2024proteinbench,
      title={ProteinBench: A Holistic Evaluation of Protein Foundation Models}, 
      author={Fei Ye and Zaixiang Zheng and Dongyu Xue and Yuning Shen and Lihao Wang and Yiming Ma and Yan Wang and Xinyou Wang and Xiangxin Zhou and Quanquan Gu},
      year={2024},
      eprint={2409.06744},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2409.06744}, 
}