Skip to content

enveda/guild

Repository files navigation

Guild Logo

Guild

Version 1.1.4 — Python ≥3.10, <3.11

Guild is an open-source Protein-Ligand Binding Tools orchestrator that covers the end-to-end pipeline while leveraging multiple docking methods in each step.

Table of Contents

Docker

The recommended way to run Guild is via Docker, which bundles all dependencies (Vina, OpenBabel, LocalColabFold, KarmaDock, DiffDock, Boltz).

Build the image

make docker-local

Run docking

All run targets accept the same set of parameters, passed as Make variables. The local repository is volume-mounted into the container at /workspace, so changes to guild/ are reflected immediately without rebuilding.

# All three methods, first 100 rows, batch size 2, with known binders, clean start
make run-guild \
  COMBINATIONS=/workspace/notebooks/data_prep/full_combinations_table.csv \
  METHODS="vina boltz diffdock" \
  HEAD=100 \
  BATCH_SIZE=2 \
  KNOWN_BINDERS=1 \
  CLEAN=1

# Boltz only (GPU)
make run-boltz \
  COMBINATIONS=/workspace/path/to/combos.csv \
  PROJECT=myproject

# Vina only (CPU, no GPU required)
make run-vina \
  COMBINATIONS=/workspace/path/to/combos.csv \
  PROJECT=myproject \
  BATCH_SIZE=5

Parameters

Parameter Default Description
COMBINATIONS (required) Path to the protein–ligand pairs CSV/TSV (use /workspace/… paths)
PROJECT imagerun Output folder name under data/ (no underscores allowed)
METHODS boltz Space-separated list: boltz, vina, karmadock, diffdock, gnina
BATCH_SIZE 2 Number of combinations per batch
HEAD 0 Take only the first N rows from the combinations table (0 = all)
DECOYS (script default) Path to the decoys file; omit to use built-in default (chembl_36_decoys_2.tsv)
NO_DECOYS (empty) Set to 1 to skip decoy expansion entirely (useful for single-protein runs where you only want to score the supplied ligands)
CLEAN (empty) Set to 1 to delete the project output folder before running
KNOWN_BINDERS (empty) Set to 1 to enable known-binders expansion
N_WORKERS 1 Vina parallel-worker processes. Vina internally also threads — values >1 may oversubscribe on high-core hosts but are typically fine.
BOX (empty) Global fallback Vina box file (center_{x,y,z} + size_{x,y,z}). Used for combinations whose CSV box_location cell is empty; per-row values always take precedence. See Custom binding pocket.
USE_GPU 1 Set empty (USE_GPU=) to drop --gpus all from docker run and forward --no-gpu to the python script. Use on no-GPU hosts. gnina falls back to CPU; vina and diffdock are unaffected. Do not combine with METHODS=boltz — Boltz is genuinely GPU-bound.
GNINA_INPUT_MODE (empty) Set to sdf to skip OpenBabel PDBQT prep entirely when gnina is the only docking method requested — gnina then reads the RDKit-generated SDF + cleaned PDB directly. Co-requesting Vina or any Vina-rescore (boltz/diffdock auto-add a Vina-rescore) silently falls back to PDBQT with a warning, since OpenBabel still has to run for those methods.
MIN_MOL_WT 250 Minimum molecular weight filter for known-binder expansion
MAX_MOL_WT 450 Maximum molecular weight filter for known-binder expansion
CHEMBL_VERSION chembl_36 ChEMBL version string used for known-binder lookup

Targets

Target GPU Description
run-guild Yes Generic target — pass any combination of METHODS
run-boltz Yes Shortcut for boltz docking
run-vina No Shortcut for vina docking (CPU only)
run-diffdock No Shortcut for diffdock docking
run-gnina Yes* Shortcut for gnina docking (*GPU used for CNN rescoring; pass USE_GPU= for CPU-only)
run-plip No Re-run only the PLIP interactions step over an existing data/<project>/ tree

Direct script invocation

You can also call the master script directly inside a container:

python scripts/run_guild.py \
    --project my_project \
    --combinations /workspace/path/to/combos.csv \
    --methods boltz vina diffdock \
    --batch-size 5 \
    --head 100 \
    --decoys /workspace/path/to/decoys.tsv \
    --min-mol-wt 250 \
    --max-mol-wt 450 \
    --chembl-version chembl_36 \
    --use-known-binders \
    --clean

Requirements

Consuming PLIP interactions output

Every make run-* invocation runs PLIP after scoring and writes data/<project>/plip_interactions.tsv — a tab-separated file with one row per docked complex that PLIP could analyze. The file is always written, even header-only if no method produced complex PDBs, so downstream code paths are deterministic. External notebooks should read this file directly instead of installing plip locally (its sdist tries to build openbabel from source, which fails on most CI / lab hosts):

import pandas as pd

plip = pd.read_csv(f"data/{project}/plip_interactions.tsv", sep="\t")

Schema (see guild/constants/plip.py for the canonical list):

column meaning
protein_config_id matches the row in combinations.csv
smiles the docked ligand
n_hbonds hydrogen bonds
n_hydrophobic hydrophobic contacts
n_pistacking, n_pication π-stacking and π-cation
n_saltbridges, n_halogen, n_waterbridges, n_metal other interaction types
total_interactions sum across all categories
n_unique_residues unique residue contacts

To skip PLIP for a docking run, pass --no-plip to run_guild.py. To re-run only PLIP over an existing project (no re-docking), use:

make run-plip PROJECT=myproject COMBINATIONS=/workspace/path/to/combos.csv METHODS="vina diffdock"

That iterates the existing data/<project>/batches/*/ tree and regenerates plip_interactions.tsv from whatever complex PDBs are present.

Troubleshooting a failed combination

When a docking method fails for a given combination, the project's batch_progress.log and each batch's output.log carry a FAILED ... line that points at a dedicated subprocess transcript:

Method Per-combination log path
Boltz batches/<batch>/boltz/<run_id>.subprocess.log
gnina batches/<batch>/gnina/<protein>_<ligand>.subprocess.log
DiffDock batches/<batch>/diffdock/_batch.subprocess.log (batch-level — DiffDock runs once per batch)
Vina (uses the Python API — failure trace lands in output.log)

The file contains the full argv, exit code, stdout and stderr — written on every invocation, success or failure. Example FAILED line you'd grep for:

FAILED Boltz 6CTA-A-protein_1 (empty manifest after retry) — see /workspace/data/myproject/batches/batch_1/boltz/6CTA-A-protein_1.subprocess.log

Open that file to see exactly what Boltz / gnina / DiffDock printed before exiting — no grep archaeology in the batch-wide log needed.


How to run

uv run python guild/run.py

If using a notebook to run the code, make sure you pass the home_path as well.

Installations

This set of installations aims to allow the full usage of Guild, even if the user does not leverage all its capacities. If you have a CPU-only machine, delete the pyproject.toml, rename the pyproject_cpu.toml as pyproject.toml and only then run uv sync.

Pre-requisites:

git clone https://github.com/openbabel/openbabel.git
mkdir openbabel/build
sudo apt install -y cmake
cmake -DBUILD_GUI=OFF -S openbabel -B openbabel/build
make -C openbabel/build
sudo make install
sudo ldconfig /usr/local/lib64/
obabel -V

PLIP dependencies (beyond openbabel):

sudo apt-get update
sudo apt-get install -y swig
sudo apt-get install -y libopenbabel-dev

P2Rank (binding site prediction):

sudo apt update
sudo apt install openjdk-17-jre

wget https://github.com/rdk/p2rank/releases/download/2.4.2/p2rank_2.4.2.tar.gz
tar -xvzf p2rank_2.4.2.tar.gz

Usage

Single run

The Guild object is the focal point of this tool. It takes the input protein and ligand and generates the appropriate folder structure to run all tools. Furthermore, it generates replicates or versions of the input files appropriate for all the tools.

Basic Example:

from guild.run import Guild

# Initialize Guild with protein and ligand information
dock_wizard_object = Guild(
    ligand_smile="CC(=COC=O)CCC1=C(C)CCCC1(C)C",  # SMILES string of the ligand
    ligand_idx="ligand1",                          # Unique identifier for the ligand
    protein_idx="3pbl",                            # Unique identifier for the protein
    protein_file="/path/to/protein.pdb",           # Path to the protein PDB file
    project_name="my_project",                     # Name of the project
    protein_chain="A",                             # Optional: specific chain to use
    original_ligand="3C0",                          # Optional: original ligand ID in PDB
    original_ligand_chain="A",                      # Optional: chain of original ligand
)

# Run docking with all available methods
# Note: box_location is required for AutoDock Vina
dock_wizard_object.dock(
    box_location="/path/to/autodock_box.txt",      # Required for AutoDock Vina
    methods=["vina", "karmadock", "diffdock", "boltz"]  # Optional: specify methods
)

# Run individual docking methods
dock_wizard_object.run_autodock_vina()  # Requires box_location to be set
dock_wizard_object.run_karmadock()
dock_wizard_object.run_diffdock()
dock_wizard_object.run_boltz()

# Analyze docking results (PLIP interaction profiling)
dock_wizard_object.analyze()

Complete Example with Box File:

The box file is necessary to run AutoDock Vina. It defines the search space for docking. An example box file format can be found in the files folder. The box file should contain center coordinates (x, y, z) and size dimensions.

from guild.run import Guild

# Example with all parameters
dock_wizard_object = Guild(
    ligand_smile="CCCC",
    ligand_idx="test1",
    protein_idx="5c1m",
    protein_file="/home/user/Guild/5c1m.pdb",
    project_name="debug_project",
    protein_chain="A",
    original_ligand="LIG",
    original_ligand_chain="A",
)

# Run docking with box file
dock_wizard_object.dock(box_location="/home/user/Guild/autodock_box.txt")

BulkRun

Running in bulk is necessary to leverage the rank percentile score, as it is empirically derived by comparing a ligand of interest against a panel of proteins. The bulk run automatically handles multiple protein-ligand combinations, generates decoys, and computes rank percentile scores.

Input Table Format:

Your input table should be a pandas DataFrame with the following columns:

protein_config_id protein_id protein_path protein_chain original_ligand original_ligand_chain ligand_id smiles ligand_category is_pdb box_location
5zk8-A-3C0-A 5zk8 path/to/file.pdb A 3C0 A drug_1 CCCC LOI 1 (optional)

Column Descriptions:

  • protein_config_id: Unique identifier for the protein configuration (e.g., {protein_id}-{chain}-{ligand}-{ligand_chain})
  • protein_id: PDB ID or identifier for the protein
  • protein_path: Full path to the protein PDB file
  • protein_chain: Chain identifier to use for docking. To dock into a pocket that spans multiple chains (e.g. a dimer interface), give a comma-separated list such as A,B (any number of chains). See Multi-chain binding pocket.
  • original_ligand: Ligand identifier from the PDB file
  • original_ligand_chain: Chain of the original ligand
  • ligand_id: Unique identifier for the ligand
  • smiles: SMILES string of the ligand
  • ligand_category: Category of ligand (e.g., "LOI" for ligand of interest, "known_binder", etc.) - required for plotting
  • is_pdb: Binary indicator (1 if PDB file, 0 otherwise)
  • box_location (optional): Path to a Vina box file (center_{x,y,z} + size_{x,y,z}) that defines the binding pocket for both Vina and Boltz. Supplied per row but conceptually per protein — see Custom binding pocket.

Basic Example:

import pandas as pd
from guild.bulk import BulkRun

# Create or load your input table
input_table = pd.DataFrame({
    'protein_config_id': ['5zk8-A-3C0-A'],
    'protein_id': ['5zk8'],
    'protein_path': ['/path/to/5zk8.pdb'],
    'protein_chain': ['A'],
    'original_ligand': ['3C0'],
    'original_ligand_chain': ['A'],
    'ligand_id': ['drug_1'],
    'smiles': ['CCCC'],
    'ligand_category': ['LOI'],
    'is_pdb': [1]
})

# Initialize BulkRun
bulk_analysis_object = BulkRun(
    input_table=input_table,
    project_name="my_bulk_project",              # Project name (cannot contain underscores)
    methods_to_run=["vina", "karmadock", "diffdock", "boltz"],  # Optional: specify methods
    batch_size=1000,                             # Number of combinations per batch
    decoys=None,                                 # Optional: path to custom decoy file
    min_mol_wt=250,                              # Minimum molecular weight for known binders
    max_mol_wt=450,                              # Maximum molecular weight for known binders
    chembl_version="chembl_36",                  # ChEMBL version for known binders
)

# Run docking for all combinations
bulk_analysis_object.run_docking()

# Compute guild scores (normalizes scores across methods)
bulk_analysis_object.run_guild_scoring(n_processes=None)  # None = use all CPUs

# Generate plots
bulk_analysis_object.plot_guild_scoring()
bulk_analysis_object.plot_unique_proteins_scorings(top_n_hits=5)

# Run PLIP interaction profiling for a specific batch
bulk_analysis_object.run_plip(current_batch="batch_1", verbose=True)

# Plot PLIP interaction comparison
bulk_analysis_object.plot_plip_comparison()

Advanced Example with Custom Settings:

import pandas as pd
from guild.bulk import BulkRun

# Load input table from CSV
input_table = pd.read_csv("input_combinations.csv")

# Initialize with custom settings
bulk_analysis_object = BulkRun(
    input_table=input_table,
    project_name="large_scale_screening",
    methods_to_run=["vina", "karmadock"],  # Only run specific methods
    batch_size=500,                                  # Smaller batches for memory management
    decoys="/path/to/custom_decoys.tsv",            # Custom decoy dataset
    min_mol_wt=200,
    max_mol_wt=500,
    chembl_version="chembl_36",
)

# Run docking (processes all batches)
bulk_analysis_object.run_docking()

# Run scoring with multiprocessing
bulk_analysis_object.run_guild_scoring(n_processes=8)  # Use 8 CPU cores

# Access results
print(bulk_analysis_object.guild_scores_df)  # DataFrame with all scores

Methods

Docking

The docking methods available via Guild are, to date:

  • Autodock Vina Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010 Jan 30;31(2):455-61. doi: 10.1002/jcc.21334. PMID: 19499576; PMCID: PMC3041641.
  • Karmadock Zhang, X., Zhang, O., Shen, C., Qu, W., Chen, S., Cao, H., Kang, Y., Wang, Z., Wang, E., Zhang, J., Deng, Y., Liu, F., Wang, T., Du, H., Wang, L., Pan, P., Chen, G., Hsieh, C. Y., & Hou, T.. Efficient and accurate large library ligand docking with KarmaDock. Nature computational science, 2023, 3(9), 789–804. https://doi.org/10.1038/s43588-023-00511-5
  • DiffDock Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola, DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. arxiv: https://arxiv.org/abs/2210.01776.
  • Boltz2 Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, Regina Barzilay, Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. biorxiv: https://www.biorxiv.org/content/10.1101/2025.06.14.659707v1

If you use results from any of these tools, please make sure to cite the authors as indicated in the hyperlinks.

Vina rescore (automatic with DiffDock and Boltz)

When diffdock or boltz is included in the methods list, Guild automatically adds a matching Vina rescore step. The rescore applies Vina's physics-based scoring function to the predicted pose (score-only, no re-docking), giving a kcal/mol ΔG estimate that's comparable across methods.

The two rescore tracks are independent — each produces its own column, so a run that uses both DiffDock and Boltz gets two distinct rescore scores:

Upstream method Auto-enabled rescore Score column
boltz vina_rescore_boltz vina_rescore_boltz_score
diffdock vina_rescore_diffdock vina_rescore_diffdock_score

Both score columns are in kcal/mol (lower = stronger predicted binding). Each is independently ranked per protein and folded into the global_rp_score.

Note: boltz_score itself is the protein-ligand ipTM confidence (range [0, 1], higher = more confident structure) — not a binding score. For a binding-strength signal from Boltz, use vina_rescore_boltz_score. Likewise gnina's gnina_score is the Vina-style affinity (kcal/mol, lower = better) while gnina_cnn_score is a pose-confidence side channel that does not participate in guild's rank-percentile aggregation.

Coordinate-frame caveat

Boltz often recentres its predicted complex into its own internal frame, so a receptor PDBQT prepared from the template PDB will not be in the same physical space as the Boltz-output ligand. Guild handles this by extracting both the receptor and the ligand from Boltz's complex PDB on every rescore call. If you call rescore_boltz_pose directly outside the bulk pipeline, do the same — do not reuse the template-frame receptor.

Custom binding pocket

By default, Guild derives the binding pocket from the co-crystal ligand declared in original_ligand / original_ligand_chain (for Boltz, residues within 4 Å of that ligand become pocket_contacts; for Vina, a box is built from its centre). When that information isn't available — apo structures, recombinant assemblies, or pockets predicted by external tools like fpocket / P2Rank — supply a Vina box file instead:

center_x = -7.470
center_y = -15.230
center_z =   5.970
size_x = 13.830
size_y = 15.070
size_z = 15.230

There are two ways to wire it in:

  1. Per-row via the optional box_location column in the combinations CSV — best for multi-protein runs where each protein has its own pocket file.
  2. Global via the BOX= Makefile flag (or --box on the script) — fills the column on rows where it is empty. Per-row values always win.

Precedence per combination: box_location (explicit) > P2Rank prediction (when predict_binding_pocket=True) > derived from original_ligand.

For Boltz, the box is converted to a residue list at runtime: every residue whose Cα is inside the axis-aligned box becomes a contact constraint in the YAML.

Multi-chain binding pocket

Some pockets sit at the interface of two (or more) protein chains — a dimer interface, a recombinant assembly, an allosteric site between subunits. To dock a small-molecule ligand into such a pocket, list every chain in the protein_chain column, comma-separated:

protein_config_id protein_id protein_path protein_chain ...
6CTA-A,B--lig1 6CTA /path/6CTA.pdb A,B ...

This works for an arbitrary number of chains (A, A,B, A,B,C,D, …). A single chain (A) behaves exactly as before, so existing tables are unaffected.

What changes under the hood when more than one chain is listed:

  • Receptor (Vina, gnina, KarmaDock, DiffDock): all listed chains are kept in the prepared receptor, so docking and scoring see the full interface. Residues are renumbered 1-based per chain.
  • Boltz: emits one protein block per chain (each with its own MSA) and a template/pocket constraint spanning every chain. Per-chain MSA generation and the larger complex make Boltz runs proportionally more expensive with more chains.
  • Pocket contacts: collected from every listed chain, each indexed independently from 1, so an interface pocket is fully constrained.

Recommended: supply a box_location (see Custom binding pocket) that covers the interface. The co-crystal-ligand pocket derivation (original_ligand) still uses the primary (first) chain only; a box covers the whole interface cleanly. If you encode the chain segment of protein_config_id, use the same comma form (6CTA-A,B-...) so DiffDock/complex receptor extraction keeps both chains.

Post-analysis

Post-analysis allows guild to leverage the results from the multiple docking approaches.

PLIP

PLIP (Protein-Ligand Interaction Profiler) allows evaluating structural interactions between proteins and ligands, including hydrogen bonds, hydrophobic contacts, salt bridges, π-stacking, and more. To cite PLIP use:

  • PLIP Sebastian Salentin, Sven Schreiber, V. Joachim Haupt, Melissa F. Adasme, Michael Schroeder, PLIP: fully automated protein-ligand interaction profiler. Nucleic Acids Res. 2015 Jul 1;43(W1):W443-7. doi: 10.1093/nar/gkv315. PMID: 25873628.

Guild score

Guild score is derived by:

  1. Comparing a ligand of interest against a panel of random molecules, selected from ChEMBL.
  2. When available, compare the results with known binders.
  3. Rank the ligand of interest according to the random molecules, by the the specific docking method score. This provides an empirical way to uniformize the different scoring systems.

Karmadock fix

There is a mismatch with rdkit version that creates different input files and causes a downstream dimension failure between mol2 and sdf. In KarmaDock/dataset/ligand_feature.py, find these two blocks (there are four places where edge_feature_new is defined in get_ligand_feature()):

edge_feature_new = torch.zeros((edge_index_new.size(1), 20))
edge_feature_new[:, [4, 5, 18]] = 1

Replace their occurrences with:

feat_dim = edge_feature.size(1)
edge_feature_new = torch.zeros((edge_index_new.size(1), feat_dim),
                               dtype=edge_feature.dtype,
                               device=edge_feature.device)

and find this line in the forward() method of the GraphTransformer Block (around line 436) in KarmaDock/architecture/GraphTransformer_Block.py:

edge_feats = self.edge_encoder(edge_s)

Insert the following block immediately before it:

if edge_s.size(1) > self.edge_encoder.in_features:
    edge_s = edge_s[:, :self.edge_encoder.in_features]
elif edge_s.size(1) < self.edge_encoder.in_features:
    pad = th.zeros(edge_s.size(0),
                      self.edge_encoder.in_features - edge_s.size(1),
                      device=edge_s.device,
                      dtype=edge_s.dtype)
    edge_s = th.cat([edge_s, pad], dim=1)

About

Source code and data repository for Guild

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors