πŸ“‹ Methods

Citable description of the ESM2-AMP discovery pipeline.pipeline v1.1.0

⚠️ Important Disclaimers

  • β€’All candidates are computationally predicted. No experimental validation has been performed.
  • β€’ESM-2 scores and predicted properties are model outputs subject to false positive/negative rates inherent to ML models.
  • β€’Structure predictions (AlphaFold2, ESMFold) do not reflect true experimental structures.
  • β€’Novelty is assessed against database snapshots from early 2026; novel sequences may already exist in unpublished genomes.
  • β€’This dashboard is intended for research hypothesis generation only. Clinical or industrial use requires independent experimental validation.

Pipeline Overview

1. smORF Extraction

Short open reading frames (smORFs) extracted from metagenomic assemblies. Sequences filtered by length (10–50 aa) and valid amino acid composition.

πŸ”§ Custom Python (Biopython)

2. ESM-2 Embedding + Classification

Each candidate sequence embedded using ESM-2 (650M parameter model, facebook/esm2_t33_650M_UR50D). A fine-tuned binary classifier predicts AMP probability.

πŸ”§ ESM-2 (Meta AI)πŸ“„ Lin et al., 2023. Science 379(6637):1123–1130. doi:10.1126/science.ade2574

3. Biophysical Filter

Candidates must pass charge, hydrophobic moment, helix propensity, and hydrophobic fraction thresholds. Values derived from known AMPs in APD3/DRAMP.

πŸ”§ modlAMPπŸ“„ MΓΌller et al., 2017. Bioinformatics 33(17):2753–2755

4. Novelty Screen

CD-HIT pairwise identity against APD3, DRAMP, AMPSphere, NCBI NR, and UniProt. Three tiers: Known (β‰₯90%), Near-Novel (50–90%), Database-Novel (<50%).

πŸ”§ CD-HIT 4.8.1πŸ“„ Fu et al., 2012. Bioinformatics 28(23):3150–3152

5. Hemolysis + Toxicity Prediction

HemoPi-3 SVM model predicts hemolysis probability. ToxinPred2 predicts toxicity. High-risk candidates (hemolysis >0.85) are flagged but not removed.

πŸ”§ HemoPi-3, ToxinPred2πŸ“„ Agrawal et al. (HemoPi), Sharma et al. (ToxinPred2)

6. 3-Axis Evaluation

Each candidate scored on: Axis 1 (Plausibility β€” ESM score, charge, hydrophobic moment), Axis 2 (Safety β€” hemolysis/toxicity risk), Axis 3 (Feasibility β€” length, synthesis complexity).

πŸ”§ Custom scoring (config/pipeline_truth.yaml)

7. Structure Prediction

Top candidates predicted with ColabFold (AlphaFold2 MMseqs2 pipeline) and/or ESMFold. Confidence assessed via pLDDT score.

πŸ”§ ColabFold 1.5, ESMFoldπŸ“„ Mirdita et al., 2022. Nature Methods 19:679–682. doi:10.1038/s41592-022-01488-1

Screening Thresholds

Source: config/pipeline_truth.yaml v1.1.0

ParameterValue
amp_score_threshold0.41
shortlist_score_threshold0.5
wetlab_score_threshold0.8
score_columnranking_score_raw
length_min_aa10
length_max_aa50
charge_min2
charge_max12
hydrophobic_moment_min1
helix_propensity_min1
hemolysis_threshold0.5
toxicity_threshold0.5

Novelty Reference Databases

DatabaseVersionSnapshot Date
APD33.02026-01-15
DRAMP3.02026-01-20
AMPSphere2024-Zenodo2026-02-01
NCBI_NRβ€”2026-02-15
UniProtβ€”2026-02-01

Identity thresholds: β‰₯90% β†’ Known, 50–90% β†’ Near-Novel, <50% β†’ Database-Novel

Tool: cd-hit 4.8.1

Metagenomic Datasets

Hot Springs (Diamante Crater, Costa Rica)

Source: MGnify Β· MGYA00594154

Permafrost (Boreal Western Canada)

Source: MGnify Β· MGYA00563809

Deep Sea Vents (Axial Seamount)

Source: MGnify Β· MGYA00652855

Abyssal Plain (Malaspina Expedition)

Source: MGnify Β· MGYA00722119

Model Configuration

esm2_model_id: facebook/esm2_t33_650M_UR50D
esm2_revision: main
classifier_checkpoint: production/v4.3_precision.pt
tokenizer_version: facebook/esm2_t33_650M_UR50D
random_seed: 42
device: cpu

How to Cite

If you use AMPHunter or these candidates in your research:

AMPHunter Dashboard

Nearik42/AMPHunter (2026). ESM-2 based antimicrobial peptide discovery pipeline for extreme environment metagenomes. GitHub. https://github.com/Nearik42/AMPHunter

ESM-2 Model

Lin, Z. et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130. doi:10.1126/science.ade2574