A trainable PyTorch reproduction of AlphaFold 3.
For more information on the model’s performance and capabilities, see our technical report.

Follow these steps to set up and run Protenix:
Install Docker (with GPU Support)
Ensure that Docker is installed and configured with GPU support. Follow these steps:
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Pull the Docker image, which was built based on this Dockerfile
docker pull ai4s-cn-beijing.cr.volces.com/infra/protenix:v0.0.1
Clone this repository and cd into it
git clone https://github.com/bytedance/protenix.git
cd ./protenix
pip install -e .
Run Docker with an interactive shell
docker run --gpus all -it -v $(pwd):/workspace -v /dev/shm:/dev/shm ai4s-cn-beijing.cr.volces.com/infra/protenix:v0.0.1 /bin/bash
After running above commands, you’ll be inside the container’s environment and can execute commands as you would on a normal Linux terminal.
export LAYERNORM_TYPE=fast_layernorm
If the environment variable LAYERNORM_TYPE is set to fast_layernorm, the model will employ the layernorm we have developed; otherwise, the naive PyTorch layernorm will be adopted. The kernels will be compiled when fast_layernorm is called for the first time.--use_deepspeed_evo_attention true
into the command line. DS4Sci_EvoformerAttention is implemented based on CUTLASS. You need to clone the CUTLASS repository and specify the path to it in the environment variable CUTLASS_PATH. The Dockerfile has already include this setting:RUN git clone -b v3.5.1 https://github.com/NVIDIA/cutlass.git /opt/cutlass
ENV CUTLASS_PATH=/opt/cutlass
The kernels will be compiled when DS4Sci_EvoformerAttention is called for the first time.To download the wwPDB dataset and proprecessed training data, you need at least 1T disk space.
Use the following command to download the preprocessed wwpdb training databases:
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data.tar.gz
tar -xzvf /af3-dev/release_data/release_data.tar.gz -C /af3-dev/release_data/
rm /af3-dev/release_data/release_data.tar.gz
The data should be placed in the /af3-dev/release_data/ directory. You can also download it to a different directory, but remember to modify the DATA_ROOT_DIR in configs/configs_data.py correspondingly. Data hierarchy after extraction is as follows:
├── components.v20240608.cif [408M] # ccd source file
├── components.v20240608.cif.rdkit_mol.pkl [121M] # rdkit Mol object generated by ccd source file
├── indices [33M] # chain or interface entries
├── mmcif [283G] # raw mmcif data
├── mmcif_bioassembly [36G] # preprocessed wwPDB structural data
├── mmcif_msa [450G] # msa files
├── posebusters_bioassembly [42M] # preprocessed posebusters structural data
├── posebusters_mmcif [361M] # raw mmcif data
├── recentPDB_bioassembly [1.5G] # preprocessed recentPDB structural data
└── seq_to_pdb_index.json [45M] # sequence to pdb id mapping file
With the above data, you can run the training demo from scratch. components.v20240608.cif and components.v20240608.cif.rdkit_mol.pkl is also used in inference pipeline for generating ccd reference feature. If you only want to run inference, the full released data is not necessary, you can download these two files separately.
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif
wget -P /af3-dev/release_data/ https://af3-dev.tos-cn-beijing.volces.com/release_data/components.v20240608.cif.rdkit_mol.pkl
Data processing scripts are still being organized and prepared, and distillation data will be released in the future.
Use the following command to download pretrained checkpoint [1.4G]:
wget -P /af3-dev/release_model/ https://af3-dev.tos-cn-beijing.volces.com/release_model/model_v1.pt
the checkpoint should be placed in the /af3-dev/release_model/ directory.
You can use notebooks/protenix_inference.ipynb to run the model inference.
You can run the script inference_demo.sh to do model inference:
bash inference_demo.sh
Arguments in this scripts are explained as follows:
load_checkpoint_path: path to the model checkpoints.input_json_path: path to a JSON file that fully describes the input.dump_dir: path to a directory where the results of the inference will be saved.dtype: data type used in inference. Valid options include "bf16" and "fp32".use_deepspeed_evo_attention: whether use the EvoformerAttention provided by DeepSpeed.use_msa: whether to use the MSA feature, the default is true. If you want to disable the MSA feature, add --use_msa false to the inference_demo.sh script.undefinedDetailed information on the format of the input JSON file and the output files can be found here.
After the installation and data preparations, you can run the following command to train the model from scratch:
bash train_demo.sh
Key arguments in this scripts are explained as follows:
dtype: data type used in training. Valid options include "bf16" and "fp32".
--dtype fp32: the model will be trained in full FP32 precision.--dtype bf16: the model will be trained in BF16 Mixed precision, by default, the SampleDiffusion,ConfidenceHead, Mini-rollout and Loss part will still be training in FP32 precision. if you want to train and infer the model in full BF16 Mixed precision, pass the following arguments to the train_demo.sh:--skip_amp.sample_diffusion_training false \
--skip_amp.confidence_head false \
--skip_amp.sample_diffusion false \
--skip_amp.loss false \
use_deepspeed_evo_attention: whether use the EvoformerAttention provided by DeepSpeed as mentioned above.
ema_decay: the decay rate of the EMA, default is 0.999.
sample_diffusion.N_step: during evalutaion, the number of steps for the diffusion process is reduced to 20 to improve efficiency.
data.train_sets/data.test_sets: the datasets used for training and evaluation. If there are multiple datasets, separate them with commas.
Some settings follow those in the AlphaFold 3 paper, The table below shows the training settings for different fine-tuning stages:
| Arguments | Initial training | Fine tuning 1 | Fine tuning 2 | Fine tuning 3 |
|---|---|---|---|---|
train_crop_size |
384 | 640 | 768 | 768 |
diffusion_batch_size |
48 | 32 | 32 | 32 |
loss.weight.alpha_pae |
0 | 0 | 0 | 1.0 |
loss.weight.alpha_bond |
0 | 1.0 | 1.0 | 0 |
loss.weight.smooth_lddt |
1.0 | 0 | 0 | 0 |
loss.weight.alpha_confidence |
1e-4 | 1e-4 | 1e-4 | 1e-4 |
loss.weight.alpha_diffusion |
4.0 | 4.0 | 4.0 | 0 |
loss.weight.alpha_distogram |
0.03 | 0.03 | 0.03 | 0 |
train_confidence_only |
False | False | False | True |
| full BF16-mixed speed(A100, s/step) | ~12 | ~30 | ~44 | ~13 |
| full BF16-mixed peak memory (G) | ~34 | ~35 | ~48 | ~24 |
We recommend carrying out the training on A100-80G or H20/H100 GPUs. If utilizing full BF16-Mixed precision training, the initial training stage can also be performed on A800-40G GPUs. GPUs with smaller memory, such as A30, you’ll need to reduce the model size, such as decreasing model.pairformer.nblocks and diffusion_batch_size.
In this version, we do not use the template and RNA MSA feature for training. As the default settings in configs/configs_base.py and configs/configs_data.py:
--model.template_embedder.n_blocks 0 \
--data.msa.enable_rna_msa false \
This will be considered in our future work.
The model also supports distributed training with PyTorch’s torchrun. For example, if you’re running distributed training on a single node with 4 GPUs, you can use:
torchrun --nproc_per_node=4 runner/train.py
You can also pass other arguments with --<ARGS_KEY> <ARGS_VALUE> as you want.
If you want to fine-tune the model on a specific subset, such as an antibody dataset, you only need to provide a PDB list file and load the pretrained weights as finetune_demo.sh shows:
checkpoint_path="/af3-dev/release_model/model_v1.pt"
...
--load_checkpoint_path ${checkpoint_path} \
--load_checkpoint_ema_path ${checkpoint_path} \
--data.weightedPDB_before2109_wopb_nometalc_0925.base_info.pdb_list examples/subset.txt \
, where the subset.txt is a file containing the PDB IDs like:
6hvq
5mqc
5zin
3ew0
5akv
Implementation of the layernorm operators referred to OneFlow and FastFold. We used OpenFold for some module implementations, except the LayerNorm.
Please check Contributing for more details.
Please check Code of Conduct for more details.
If you discover a potential security issue in this project, or think you may
have discovered a security issue, we ask that you notify Bytedance Security via our security center or vulnerability reporting email.
Please do not create a public GitHub issue.
This project, including code and model parameters are made available under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/
For commercial use, please reach out to us at ai4s-bio@bytedance.com for the commercial license. We welcome all types of collaborations.
We use cookies
We use cookies to analyze traffic and improve your experience. You can accept or reject analytics cookies.