π Methods
Citable description of the ESM2-AMP discovery pipeline.pipeline v1.1.0
β οΈ Important Disclaimers
- β’All candidates are computationally predicted. No experimental validation has been performed.
- β’ESM-2 scores and predicted properties are model outputs subject to false positive/negative rates inherent to ML models.
- β’Structure predictions (AlphaFold2, ESMFold) do not reflect true experimental structures.
- β’Novelty is assessed against database snapshots from early 2026; novel sequences may already exist in unpublished genomes.
- β’This dashboard is intended for research hypothesis generation only. Clinical or industrial use requires independent experimental validation.
Pipeline Overview
1. smORF Extraction
Short open reading frames (smORFs) extracted from metagenomic assemblies. Sequences filtered by length (10β50 aa) and valid amino acid composition.
2. ESM-2 Embedding + Classification
Each candidate sequence embedded using ESM-2 (650M parameter model, facebook/esm2_t33_650M_UR50D). A fine-tuned binary classifier predicts AMP probability.
3. Biophysical Filter
Candidates must pass charge, hydrophobic moment, helix propensity, and hydrophobic fraction thresholds. Values derived from known AMPs in APD3/DRAMP.
4. Novelty Screen
CD-HIT pairwise identity against APD3, DRAMP, AMPSphere, NCBI NR, and UniProt. Three tiers: Known (β₯90%), Near-Novel (50β90%), Database-Novel (<50%).
5. Hemolysis + Toxicity Prediction
HemoPi-3 SVM model predicts hemolysis probability. ToxinPred2 predicts toxicity. High-risk candidates (hemolysis >0.85) are flagged but not removed.
6. 3-Axis Evaluation
Each candidate scored on: Axis 1 (Plausibility β ESM score, charge, hydrophobic moment), Axis 2 (Safety β hemolysis/toxicity risk), Axis 3 (Feasibility β length, synthesis complexity).
7. Structure Prediction
Top candidates predicted with ColabFold (AlphaFold2 MMseqs2 pipeline) and/or ESMFold. Confidence assessed via pLDDT score.
Screening Thresholds
Source: config/pipeline_truth.yaml v1.1.0
| Parameter | Value |
|---|---|
| amp_score_threshold | 0.41 |
| shortlist_score_threshold | 0.5 |
| wetlab_score_threshold | 0.8 |
| score_column | ranking_score_raw |
| length_min_aa | 10 |
| length_max_aa | 50 |
| charge_min | 2 |
| charge_max | 12 |
| hydrophobic_moment_min | 1 |
| helix_propensity_min | 1 |
| hemolysis_threshold | 0.5 |
| toxicity_threshold | 0.5 |
Novelty Reference Databases
Identity thresholds: β₯90% β Known, 50β90% β Near-Novel, <50% β Database-Novel
Tool: cd-hit 4.8.1
Metagenomic Datasets
Hot Springs (Diamante Crater, Costa Rica)
Source: MGnify Β· MGYA00594154
Permafrost (Boreal Western Canada)
Source: MGnify Β· MGYA00563809
Deep Sea Vents (Axial Seamount)
Source: MGnify Β· MGYA00652855
Abyssal Plain (Malaspina Expedition)
Source: MGnify Β· MGYA00722119
Model Configuration
How to Cite
If you use AMPHunter or these candidates in your research:
AMPHunter Dashboard
Nearik42/AMPHunter (2026). ESM-2 based antimicrobial peptide discovery pipeline for extreme environment metagenomes. GitHub. https://github.com/Nearik42/AMPHunter
ESM-2 Model
Lin, Z. et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123β1130. doi:10.1126/science.ade2574