Before You Begin
AlphaFold 3 requires users to accept terms of use. Click here for the request form and to accept the terms of use.
You will be notified once you have access to AlphaFold 3.
AlphaFold 3 Input and Database
Unlike AlphaFold 2, AlphaFold 3 requires the input file to be formatted as a JSON file.
For example,
{
"name": "2PV7",
"sequences": [
{
"protein": {
"id": ["A", "B"],
"sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
}
}
],
"modelSeeds": [1],
"dialect": "alphafold3",
"version": 1
}
For the full documentation on formatting input files, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md
The database is located at the following location on both Ceres and Atlas:
/reference/data/alphafold/3.0.0
Note: The version may be updated to reflect the most stable version; please update the path to the database in your scripts as needed.
Running AlphaFold 3 on Ceres and Atlas
AlphaFold 3 provides options to split the workflow into separate CPU and GPU tasks. This is useful because it allows you to run the “data pipeline”, which does not use GPUs, on normal compute nodes and then run the model inference on GPU nodes.
- To run only the data pipeline use the
--norun_inference
option. - To run only the model inference use the
--norun_data_pipeline
option.
If you do not specify the above options, AlphaFold 3 will try to run the full pipeline, which means it will only run on the GPU nodes.
Please note that by default, the maximum protein sequence length that AlphaFold 3 can analyze is limited by the amount of GPU memory available. SCINet’s A100 GPUs have 80 GB of memory each, which limits the maximum protein sequence length to 5,120 residues. Longer protein sequences can be analyzed by enabling features that allow the GPU to utilize system memory; please see the AlphaFold 3 documentation for more information.
Below are suggested scripts to run AlphaFold 3 on the clusters.
CPU-only data pipeline (Ceres and Atlas)
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -A <Account>
module load alphafold/3.0.0
export DATA_DIR=/reference/data/alphafold/3.0.0
run_alphafold.py \
--db_dir=$DATA_DIR \
--model_dir=$DATA_DIR/model_parameters \
--json_path=/full/path/to/input.json \
--output_dir=/full/path/to/output_dir \
--norun_inference
GPU-only model inference (Atlas only)
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 8
#SBATCH -A <Account>
#SBATCH -p gpu-a100
#SBATCH --gres=gpu:1
module load alphafold/3.0.0
export DATA_DIR=/reference/data/alphafold/3.0.0
run_alphafold.py \
--db_dir=$DATA_DIR \
--model_dir=$DATA_DIR/model_parameters \
--json_path=/full/path/to/input.json \
--output_dir=/full/path/to/output_dir \
--norun_data_pipeline
The GPU-only model inference task requires the JSON file generated from the above --norun_inference
job.
Full pipeline (Atlas only)
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 8
#SBATCH -A <Account>
#SBATCH -p gpu-a100
#SBATCH --gres=gpu:1
module load alphafold/3.0.0
export DATA_DIR=/reference/data/alphafold/3.0.0
run_alphafold.py \
--db_dir=$DATA_DIR \
--model_dir=$DATA_DIR/model_parameters \
--json_path=/full/path/to/input.json \
--output_dir=/full/path/to/output_dir