ProteinBench: A Holistic Evaluation of Protein Foundation Models

Introduction

Recent years have witnessed a surge in the development of protein foundation models, significantly improving performance in protein prediction and generative tasks ranging from 3D structure prediction and protein design to conformational dynamics. However, the capabilities and limitations associated with these models remain poorly understood due to the absence of a unified evaluation framework. To fill this gap, we introduce Logo ProteinBench, a holistic evaluation framework designed to enhance the transparency of protein foundation models. Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance. Our comprehensive evaluation of protein foundation models reveals several key findings that shed light on their current capabilities and limitations. To promote transparency and facilitate further research, we release the evaluation dataset, code, and a public leaderboard publicly for further analysis and a general modular toolkit. We intend for ProteinBench to be a living benchmark for establishing a standardized, in-depth evaluation framework for protein foundation models, driving their development and application while fostering collaboration within the field.

Inverse Folding

We evaluate the performance of various inverse-folding models for structure-based sequence design, focusing on two distinct objectives: natural evolutionary fitness (in-distribution proteins) and de novo designed backbone-based sequence design. The latter represents an out-of-distribution problem that tests the robustness of the methods, as these structures typically contain some noise different from high-resolution structure deposited in PDBs.

The following table shows the performance of structure-based sequence design models on inverse folding tasks. The reported results are the median of repetitive experiments. ’N/A’ stands for not applicable. ESMIF1 and ESM3 use all native structures and sequences for model training, therefore, they not measured in the evolution distribution fitting objective.

Model	Fitting Evolution Distribution		De novo backbones based sequence design
	CASP AAR ↑	CAMEO AAR ↑	length 100 scTM ↑	length 100 pLDDT ↑	length 200 scTM ↑	length 200 pLDDT ↑	length 300 scTM ↑	length 300 pLDDT ↑	length 400 scTM ↑	length 400 pLDDT ↑	length 500 scTM ↑	length 500 pLDDT ↑
ProteinMPNN	0.450	0.468	0.962	94.14	0.945	89.34	0.962	90.28	0.875	83.76	0.568	67.09
ESM-IF1	N/A	N/A	0.810	88.83	0.635	69.67	0.336	74.36	0.449	64.59	0.462	58.97
LM-Design	0.516	0.570	0.834	78.45	0.373	58.41	0.481	69.86	0.565	59.87	0.397	56.35
ESM3	N/A	N/A	0.942	86.60	0.486	60.69	0.632	70.78	0.564	62.63	0.452	59.37

Structure Design

We evaluate the performance of protein foundation models for backbone design. Our analysis focuses on the quality, novelty, and diversity of the generated structures across various chain lengths.

The following table shows the performance of backbone design models evaluated using various lengths ranging from 50 to 500. The reported results are the median of repetitive experiments. We highlight the best performance in bold and the second-best with the underline. For the novelty and diversity metrics, we only highlight results with the corresponding scTM score higher than 0.5. 'N/A' stands for not applicable.

Model	length 50					length 100
	Quality		Novelty	Diversity		Quality		Novelty	Diversity
	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust. ↑	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust. ↑
Native PDBs	0.91	0.74	N/A	0.29	0.66	0.96	0.67	N/A	0.30	0.77
RFdiffusion	0.95	0.45	0.65	0.58	0.67	0.98	0.48	0.76	0.41	0.32
FrameFlow	0.91	0.58	0.75	0.68	0.39	0.94	0.70	0.72	0.55	0.49
Chroma	0.85	1.05	0.59	0.29	0.48	0.89	1.27	0.70	0.35	0.59
FrameDiff(latest)	0.85	1.00	0.67	0.35	0.64	0.90	1.23	0.71	0.52	0.11
FoldFlow1(sfm)	0.90	0.67	0.68	0.63	0.48	0.87	1.34	0.65	0.49	0.83
FoldFlow1(base)	0.79	1.19	0.66	0.53	0.76	0.81	1.70	0.62	0.48	0.83
FoldFlow1(ot)	0.83	1.10	0.65	0.53	0.77	0.83	1.60	0.64	0.48	0.81
Genie	0.57	3.12	0.57	0.32	0.90	0.69	3.38	0.59	0.31	0.96

Model	length 300					length 500
	Quality		Novelty	Diversity		Quality		Novelty	Diversity
	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust. ↑	scTM ↑	scRMSD ↓	Max TM ↓	pairwise TM ↓	Max Clust. ↑
Native PDBs	0.97	0.82	N/A	0.28	0.77	0.97	1.07	N/A	0.29	0.80
RFdiffusion	0.96	1.03	0.64	0.36	0.65	0.79	5.60	0.62	0.33	0.89
FrameFlow	0.92	1.95	0.65	0.43	0.88	0.61	7.92	0.61	0.40	0.92
Chroma	0.87	2.47	0.66	0.36	0.67	0.72	6.71	0.60	0.29	0.99
FrameDiff(latest)	0.87	2.73	0.69	0.48	0.21	0.63	9.52	0.58	0.40	0.52
FoldFlow1(sfm)	0.45	9.04	0.54	0.39	1.00	0.37	13.04	0.53	0.37	1.00
FoldFlow1(base)	0.43	9.56	0.54	0.39	0.98	0.35	13.20	0.52	0.39	1.00
FoldFlow1(ot)	0.54	8.21	0.58	0.41	0.94	0.37	12.48	0.51	0.35	1.00
Genie	0.27	20.37	0.30	0.23	1.00	0.25	26.08	0.22	0.23	1.00

Sequence Design

We assess the performance of various protein sequence generative models based on the quality, diversity, and novelty of their generated sequences across different chain lengths.

The following table shows the performance of protein sequence generative models/language models on sequence generation tasks. The reported results are the average of repetitive experiments. The pLDDT score is the output of AlphaFold2. Max TM is an abbreviation for Maximum TM-score to PDB database. 'N/A' stands for not applicable. We highlight the best performance in bold.

Model	length 100					length 200
	Quality		Diversity		Novelty	Quality		Diversity		Novelty
	ppl ↓	pLDDT ↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓	ppl ↓	pLDDT ↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓
Native Seqs		68.46	0.55	0.75	N/A		61.91	0.49	0.78	N/A
Progen 2 (700M)	8.28	64.00	0.42	0.94	0.64	5.68	69.91	0.40	0.91	0.69
EvoDiff	16.89	50.20	0.43	0.98	0.69	17.28	50.66	0.36	1.00	0.71
DPLM (650M)	6.21	85.38	0.50	0.80	0.74	4.61	93.54	0.54	0.70	0.91
ESM3 (1.4B)	14.79	54.26	0.45	0.90	0.68	12.96	58.45	0.35	1.00	0.80

Model	length 300					length 500
	Quality		Diversity		Novelty	Quality		Diversity		Novelty
	ppl ↓	pLDDT ↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓	ppl ↓	pLDDT ↑	pairwise TM ↓	Max Clust. ↑	Max TM ↓
Native Seqs		61.49	0.51	0.85	N/A		62.95	0.51	0.78	N/A
Progen 2 (700M)	6.25	65.69	0.42	0.93	0.66	4.27	61.45	0.32	0.95	0.68
EvoDiff	17.13	45.14	0.31	1.00	0.68	16.51	43.14	0.31	1.00	0.69
DPLM (650M)	3.47	93.07	0.57	0.63	0.91	3.33	87.73	0.43	0.85	0.85
ESM3 (1.4B)	14.59	48.08	0.32	1.00	0.75	11.10	52.17	0.30	1.00	0.54

Sequence and Structure Co-Design

We assess the performance of various protein sequence generative models based on the quality, diversity, and novelty of their generated sequences across different chain lengths. The evaluation metrics include AlphaFold2 (AF2) predicted pLDDT scores for structural plausibility (quality), maximum TM-score and maximum cluster values for structural diversity, and maximum TM-score to PDB structures for structural novelty.

The following table shows the performance of protein sequence generative models/language models on sequence generation tasks. The reported results are the average of repetitive experiments with the standard derivation. The pLDDT score is the output of AlphaFold2. Max TM is an abbreviation for Maximum TM-score to PDB database. 'N/A' stands for not applicable. We highlight the best performance in bold.

Model	Length 100				Length 200
	Quality		Diversity	Novelty	Quality		Diversity	Novelty
	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓
Native PDBs	0.91	2.98	0.75	N/A	0.88	3.24	0.77	N/A
ProteinGenerator	0.91	3.75	0.24	0.73	0.88	6.24	0.25	0.72
ProtPardelle*	0.56	12.9	0.57	0.66	0.64	13.67	0.10	0.69
Multiflow	0.96	1.10	0.33	0.71	0.95	1.61	0.42	0.71
ESM3*	0.72	13.80	0.64	0.41	0.63	21.18	0.63	0.61

Model	Length 300				Length 500
	Quality		Diversity	Novelty	Quality		Diversity	Novelty
	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓	scTM ↑	scRMSD ↓	Max Clust. ↑	Max TM ↓
Native PDBs	0.92	3.94	0.75	NaN	0.9	9.64	0.8	NaN
ProteinGenerator	0.81	9.26	0.22	0.71	0.41	33.91	0.18	0.73
ProtPardelle*	0.69	14.91	0.04	0.72	0.4	41.23	0.6	0.69
Multiflow	0.96	2.14	0.58	0.71	0.83	8.48	0.67	0.68
ESM3 (1.4B)*	0.59	25.5	0.52	0.73	0.54	33.7	0.37	0.77

Motif-Scaffolding

We evaluate the performance of various motif-scaffolding methods across different scaffolds, focusing on their effectiveness in designing scaffold structures. The primary objective of this evaluation is to compare the efficacy of structure-based and sequence-based approaches in generating designable scaffolds.

The following table shows the performance of motif-scaffolding of structure-based and sequence-based methods.

Antibody Design

Using 13 different metrics from 4 perspectives, we comprehensively evaluate the performance of CDR-H3 (including both sequence and structure) generated by each method for given antigens, thereby revealing their true performance. All methods were trained on the same dataset with parameters reported in their respective papers and tested on a unified set of 55 complexes from the RAbD dataset.

The following table shows the specific performance and preferences of each method (methods capable of generating multiple antibodies are marked with *). The results also reveal that conventional evaluation methods, such as AAR and RMSD, are insufficient for evaluating antibody design methodologies.

Method	Accuracy			Functionality	Specificity
	AAR ↑	RMSD ↓	TM-score ↑	Binding Energy ↓	SeqSim-outer ↓	SeqSim-inner ↑	PHR ↓
RAbD (natural)	100.00%	0.00	1.00	-15.33	0.26	N/A	45.78%
HERN	33.17%	9.86	0.16	1242.77	0.41	N/A	39.83%
MEAN	33.47%	1.82	0.25	263.90	0.65	N/A	40.74%
dyMEAN	40.95%	2.36	0.36	889.28	0.58	N/A	42.04%
*dyMEAN-FixFR	40.05%	2.37	0.35	612.75	0.60	0.96	43.75%
*DiffAb	35.04%	2.53	0.37	489.42	0.37	0.45	40.68%
*AbDPO	31.29%	2.79	0.35	116.06	0.38	0.60	69.69%
*AbDPO++	36.25%	2.48	0.35	223.73	0.39	0.54	44.51%

Method	Rationality
	CN-score ↑	Clashes-inner ↓	Clashes-outer ↓	SeqNat ↑	Total Energy ↓	scRMSD ↓
RAbD (natural)	50.19	0.07	0.00	-1.74	-16.76	1.77
HERN	0.04	0.04	3.25	-1.47	5408.74	9.89
MEAN	1.33	11.65	0.29	-1.83	1077.32	2.77
dyMEAN	1.49	9.15	0.47	-1.79	1642.65	2.11
*dyMEAN-FixFR	1.14	8.88	0.48	-1.82	1239.29	2.48
*DiffAb	2.02	1.84	0.19	-1.88	495.69	2.57
*AbDPO	1.33	4.14	0.10	-1.99	270.12	2.79
*AbDPO++	2.34	1.66	0.08	-1.78	338.14	2.50

Protein Folding: Single-State Prediction

We assess the performance of protein folding as single-state conformation prediction. Protein folding models have played a pivotal role in understanding sequence-structure relationships and serve as a foundational component for protein conformation models. Therefore, it is essential to benchmark their performance in discussions of protein conformation prediction.

The following table shows the performance of protein folding on CAMEO2022. Results are reported in mean/median over 183 proteins. We highlight the best performance in bold and the second-best with the underline. "N/A" stands for not applicable.

Method	Accuracy				Quality
	TM-score ↑	RMSD ↓	GDT-TS ↑	lDDT ↑	CA clash (%) ↓	PepBond break (%) ↓
AlphaFold2	0.871	3.21	0.860	0.904	0.3	4.8
OpenFold	0.870	3.21	0.856	0.899	0.4	2.0
RoseTTAFold2	0.859	3.52	0.845	0.892	0.3	5.5
ESMFold	0.847	3.98	0.826	0.870	0.3	4.7
EigenFold	0.743	7.65	0.703	0.737	8.0	N/A

Multiple-State Prediction

In the following, we evaluate the performance of predicting multiple conformational states.

The following table shows the performance on the multi-state prediction of BPTI. Accuracy metrics (RMSDens, RMSD Cluster 3) are shown in the mean from 20 bootstrap sampling with replacement with different number of conformation samples (N=10~1000). Diversity and Quality scores are evaluated on 1000 conformations for each model. We highlight the best performance in bold and the second-best with the underline. "N/A" stands for not applicable due to model resolution. RMSD unit: Å.

Method	RMSDens ↓			RMSD Cluster 3 ↓			Diversity	Quality
	N=10	N=100	N=1000	N=10	N=100	N=1000	Pairwise RMSD	CA clash (%) ↓	PepBond break (%) ↓
EigenFold	1.56	1.50	1.46	2.54	2.48	2.46	0.85	1.4	N/A
MSA-depth32	1.66	1.54	1.41	2.43	2.19	1.85	2.14	0.6	10.6
Str2Str-ODE (Tmax=0.15)	2.40	2.20	2.09	3.00	2.73	2.58	1.86	0.0	13.9
ESMFlow-MD	1.68	1.47	1.39	2.44	2.27	2.18	1.17	0.0	14.3
ConfDiff-ESM-Force	1.58	1.43	1.36	2.44	2.35	2.24	1.76	0.1	8.9

The following table show the performance on conformation prediction task of apo-holo dataset. apo/holo-TM are the maximum TMscore of samples to the reference apo/holo structure. 20 conformations are sampled for each protein and results are reported in mean across 91 proteins. We highlight the best performance in bold and the second-best with the underline. "N/A" stands for not applicable due to model resolution.

Method	Accuracy			Diversity	Quality
	apo-TM ↑	holo-TM ↑	TMens ↑	Pairwise TM	CA clash (%) ↓	PepBond break (%) ↓
apo model	1.000	0.790	0.895	N/A	N/A	N/A
EigenFold	0.831	0.864	0.847	0.907	3.6	N/A
MSA-depth256	0.845	0.889	0.867	0.978	0.2	4.6
Str2Str-ODE (Tmax=0.3)	0.766	0.781	0.774	0.872	0.2	14.7
AlphaFlow-PDB	0.855	0.891	0.873	0.924	0.3	6.6
ConfDiff-Open-PDB	0.847	0.886	0.867	0.909	0.5	5.5

Distribution Prediction

The following table shows the performance on distribution prediction for the ATLAS. A total of 250 structures were sampled for each protein, and the median values across 82 test set proteins are reported. The best performance is highlighted in bold, and the second-best is underlined. *These metrics require all-atom or backbone predictions; therefore, EigenFold and Str2Str do not have sufficient resolution for evaluation (indicated as "N/A").

Method	Diversity		*Flexibility: Pearson* r on**			Distributional accuracy
	Pairwise RMSD	*RMSF	Pairwise RMSD ↑	*Global RMSF ↑	*Per target RMSF ↑	*RMWD ↓	MD PCA W2 ↓	Joint PCA W2 ↓	PC sim > 0.5 % ↑
MD iid	2.76	1.63	0.96	0.97	0.99	0.67	0.73	0.71	93.9
MD 2.5ns	1.54	0.98	0.89	0.85	0.85	2.22	1.55	1.89	36.6
EigenFold	5.96	N/A	-0.03	N/A	N/A	N/A	2.31	7.96	12.2
MSA-depth256	0.83	0.53	0.25	0.34	0.59	3.60	1.79	2.91	29.3
Str2Str-ODE (Tmax=0.1)	1.66	N/A	0.13	N/A	N/A	N/A	2.14	4.39	6.1
AlphaFlow-MD	2.87	1.63	0.53	0.66	0.85	2.64	1.55	2.29	39.0
ConfDiff-Open-MD	3.43	2.21	0.59	0.67	0.85	2.75	1.41	2.27	35.4

Method	Ensemble observables				Quality
	Weak contacts J ↑	Transient contacts J ↑	*Exposed residue J ↑	*Exposed MI matrix ρ ↑	CA clash % ↓	*PepBond break % ↓
MD iid	0.90	0.80	0.93	0.56	0.0	3.4
MD 2.5ns	0.62	0.45	0.64	0.25	0.0	3.4
EigenFold	0.36	0.19	N/A	N/A	5.6	N/A
MSA-depth256	0.30	0.29	0.36	0.06	0.0	5.5
Str2Str-ODE (Tmax=0.1)	0.42	0.18	N/A	N/A	0.0	12.1
AlphaFlow-MD	0.62	0.41	0.69	0.35	0.0	22.2
ConfDiff-Open-MD	0.63	0.39	0.65	0.33	0.5	6.5

BibTeX

@@misc{ye2024proteinbench,
      title={ProteinBench: A Holistic Evaluation of Protein Foundation Models}, 
      author={Fei Ye and Zaixiang Zheng and Dongyu Xue and Yuning Shen and Lihao Wang and Yiming Ma and Yan Wang and Xinyou Wang and Xiangxin Zhou and Quanquan Gu},
      year={2024},
      eprint={2409.06744},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM},
      url={https://arxiv.org/abs/2409.06744}, 
}