Skip to main content
snıff
The dataset

Sniff Atlas.

9.67 million canine variants. 188 breeds. Calibrated AI pathogenicity. One open dump.

The first open, breed-stratified catalogue of common canine coding variants with calibrated protein-language-model pathogenicity (ESM2 AUC 0.935 vs OMIA, n=115) and an evidence-graded knowledge graph. CC-BY 4.0. Validated against directly-sequenced genomes from NHGRI. CanFam4. Cite as Gehring 2026.

DOI (concept, always latest)
10.5281/zenodo.20566358
Version
v1.0.1 — 10.5281/zenodo.20572692
License
CC-BY 4.0
Reference assembly
CanFam4 (UU_Cfam_GSD_1.0)
Author
Matt Gehring (ORCID)
Published
2026-06-06
What is in the release

What is in the release

Four layers of canine biology, stitched together so the same dog row in one file lines up with the same dog row in every other file.

Variants.

9,667,790 variants per dog, breed-stratified ALT-allele frequencies across 188 breeds, popmax with both a permissive (N greater than or equal to 20) and a robust (N greater than or equal to 50) sample-size floor, and an aggregate carrier index for the 39,147 functional variants (HIGH or MODERATE impact plus splice). Sourced from the imputed CanVAS callset (Brundage et al. 2026) at Beagle 5.4 DR2 greater than or equal to 0.3 and MAF greater than or equal to 0.01. Files: variant_master.parquet, breed_af.parquet, variant_frequencies.parquet, carrier_index.parquet (aggregate only; no individual dog IDs).

A bcftools-queryable sites VCF.

The interoperability artifact. sniff_atlas.sites.vcf.gz is bgzipped and tabix-indexed; a single position lookup from any pipeline that already speaks VCF returns global AF, popmax breed, ESM2 score, Pangolin splice prediction, phyloP 241-way conservation, SnpEff consequence and impact, and the low-DR2 flag. Drop-in canine analog to the gnomAD sites VCF. Header carries source, version, license, and DOI.

Deleteriousness scores, calibrated.

ESM2-650M log-likelihood ratios scored on every coding variant, with explicit calibration against OMIA pathogenic variants: AUC 0.935 (95% CI 0.908 to 0.959; n=115); ACMG-pathogenic subset AUC 0.942. Pangolin splice scores on splice-region variants. Zoonomia 241-mammal phyloP conservation. Files: deleteriousness/esm2.parquet, pangolin_splice.parquet, phylop241.parquet, plus esm2_calibration.csv with the full ROC + ACMG-subset breakdown.

An evidence-graded knowledge graph.

647 nodes and 66,015 edges in Biolink-compatible KGX parquet, with provenance + ClinGen-word evidence grade (Definitive, Strong, Moderate, Limited, Predicted) + confounding-risk flag (LOW, MED, HIGH) on every node and edge. The v1.0.1 release is the Donner 2023 breed by Mendelian-variant carrier-frequency layer; the OMIA disease layer is held out pending curator confirmation and ships in v1.1. Files: knowledge_graph/nodes.parquet, edges.parquet, provenance.parquet, evidence.parquet, plus release.json with the CURIE map and Biolink version.

Validation

Validation

The atlas was put through three external tests before release. Two it passed. One reframed the whole product. We tell you about all three.

ESM2 pathogenicity calibration.

The ESM2 log-likelihood ratio was calibrated against OMIA pathogenic variants as ground truth. Overall AUC 0.935 (95% confidence interval 0.908 to 0.959, n=115). The ACMG-pathogenic subset reached AUC 0.942. These are first-pass calibration numbers for a protein-language-model on the canine genome; there is no prior canine baseline to compare against because nobody has shipped one.

Non-circular independent replication: NHGRI 722.

The Sniff allele frequencies are imputed against the Dog10K reference panel, which means Dog10K cannot be used to "replicate" Sniff without circularity. The NHGRI Dog Genome Project's 722-dog directly-sequenced cohort (Plassais et al. 2019) predates Dog10K and was used as a genuinely independent replication target. Of the 668 candidates that lift cleanly from CanFam4 to CanFam3.1: 97.5% present, 95.7% allele-concordant, allele-frequency r = 0.760. The MAF-matched control replicates at r = 0.742, meaning the candidates replicate as well as frequency-matched background, and the common-variant background reaches r = 0.953. The 12 absent breed-common candidates are liftover-precision artifacts (three sit on unplaced contigs).

The positive control that reframed the project.

The four-way candidate-discovery funnel was run against the five known canine pathogenic variants that fall within the MAF greater than or equal to 1% scope. It recovered two of them, and both were coat-trait variants (FGF5, TYRP1). It missed the famous breed-common disease variants SOD1-degenerative-myelopathy and PRCD-progressive-retinal-atrophy because the ESM scores for both are too benign-looking (negative 3.5 and positive 0.6 respectively). The funnel is therefore an ESM-damaging-variant detector, not a disease detector. Every output of the discovery example in the dump carries this caveat. The product is reframed from "deleterious variant resource" to "common breed-segregating candidate variant resource" — a precise scope, not a downgrade.

Scope and limits

Scope and limits

What the atlas is for, and what it is not for. Stated plainly.

Common variants only, MAF greater than or equal to 1%.

The CanVAS imputation pipeline runs at a 1% minor-allele-frequency floor. Of the 94 OMIA-catalogued canine pathogenic variants that lift cleanly to CanFam4, 89 fall below this floor and are not in the resource. Rare Mendelian disease alleles are the explicit scope gap. The atlas is a common breed-segregating variant resource, not a rare-disease catalog.

Predictions are computational, not clinical.

ESM2 pathogenicity is a calibrated computational prediction. Every record in the dump that surfaces a pathogenicity score carries a predicted_disease_relevance: "UNPROVEN" field. The four-way discovery example is explicitly flagged as candidates, with disease relevance unproven. Nothing in this dump is a clinical diagnosis or veterinary recommendation. The OMIA layer pending v1.1 is curated and citable; the rest is research-grade.

Imputation quality varies by region.

Chromosomes 27 and 32 carry lower-DR2 regions in the underlying imputation. Every variant in those regions is flagged with the low_dr2_region field. Downstream users should weight or exclude those calls accordingly.

Individual genotypes are not redistributed.

The carrier index in the dump is aggregate only (carrier count, homozygous-alt count, top carrier breeds). Individual dog IDs are not included. The raw CanVAS genotypes are not redistributed here; the canonical source is Brundage et al. 2026 at doi:10.5281/zenodo.19186944.

How to use it

How to use it

Query the sites VCF directly.

Anyone with bcftools or tabix gets variant-level frequency, pathogenicity, and conservation in one call:

tabix sniff_atlas.sites.vcf.gz 5:56189113

Returns AF, popmax breed, ESM2 LLR, Pangolin score, phyloP 241-way, SnpEff consequence, and the low-DR2 flag in the INFO column. The header carries source, version, license, and DOI.

Filter the parquet from pandas or DuckDB.

Range queries, predicate filters, and joins work directly against the parquet files. Read variant_master.parquet for the single-row-per-variant summary; breed_af.parquet for the full 9.67M-by-188 frequency matrix. Sample workflows live in the Zenodo record README.

Agent-callable.

The atlas is wired into a unified query layer exposing it as a public REST API at api.sniff.world, an MCP server at mcp.sniff.world any Claude, ChatGPT, Cursor, or Continue session can call by name, and a chat-style search bar on this site. Thirteen RPCs over Streamable HTTP; sub-millisecond joined queries; every response carries the citation chain by design. Registered in the official MCP Registry as sniff-mcp; one-command install via uvx/npx.

How to cite

How to cite

If you use the Sniff Atlas in published work, cite the concept DOI (it always resolves to the latest version) and the upstream sources whose data the atlas builds on.

Plain text

Gehring, M. (2026). Sniff Atlas v1.0.1: an open, breed-stratified catalogue of common canine coding variants with calibrated protein-language-model pathogenicity. Zenodo. https://doi.org/10.5281/zenodo.20566358. CC-BY 4.0.

BibTeX
@dataset{gehring_2026_sniff_atlas,
  author       = {Gehring, Matt},
  title        = {Sniff Atlas v1.0.1: an open, breed-stratified
                  catalogue of common canine coding variants with
                  calibrated protein-language-model pathogenicity},
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.20566358},
  url          = {https://doi.org/10.5281/zenodo.20566358},
  license      = {CC-BY-4.0}
}

Also cite the upstream sources whose data we build on: CanVAS (Brundage et al. 2026, doi:10.5281/zenodo.19186944) for the genotype substrate, Plassais et al. 2019 (Nature Communications 10:1489) for the NHGRI 722 validation cohort, Donner et al. 2023 for the breed-by-variant carrier-frequency layer, and OMIA (Nicholas, Tammen, and the Sydney Informatics Hub, doi:10.25910/2AMR-PV70) for the disease catalogue this work draws on. If your usage hosts or republishes any portion of the Sniff Atlas, the CC-BY-4.0 attribution requires a link back to sniff.world.

Full citation formats (BibTeX, RIS, CITATION.cff, APA) plus the upstream-source citations at sniff.world/cite.

Where to go next

Where to go next

Contact the author: [email protected] (Matt Gehring, ORCID 0009-0001-9531-2861). Self-funded; compute credits acknowledged from NVIDIA Inception, AWS, Google Cloud, Microsoft Azure, and Lambda.

Last updated
Sources: Sniff Atlas v1.0.1 (Zenodo 10.5281/zenodo.20566358) · CanVAS (Brundage 2026) · NHGRI 722 (Plassais 2019) · Donner 2023 · OMIA