The benchmark
for veterinary clinical AI.
5,000 anonymized canine and feline records. Weighted rubric scoring by an independent LLM judge. Results published publicly.
From submission to leaderboard in four steps.
You never see the benchmark cases. We never touch your model weights. Trust flows in both directions.
Request & Approval
Apply for access with your organization details and use case. Our team reviews all requests and approves participation before any benchmark access is granted.
Register Your Model
Submit a secure API endpoint or containerized inference adapter. Your credentials are encrypted at rest. We call your model — you never touch our data.
Secure Evaluation
VAULT runs all 5,000 canine and feline cases in an isolated sandbox. Outputs are scored by an independent LLM judge against our 5-criterion weighted rubric. Raw outputs are never stored.
Results Published Publicly
Receive a full benchmark report. Your results are published to the public leaderboard — creating a trusted, transparent record of veterinary AI performance.
5,000 anonymized clinical records. Yours to evaluate against. Never to download.
The VAULT benchmark dataset consists of 5,000 anonymized canine and feline clinical records, rigorously de-identified and curated by our internal team. It is the shared standard against which all tools are measured — but it lives exclusively within our secure environment.
Records span: canine · feline
One platform. Every clinical AI modality.
Every track uses the same LLM-as-a-judge framework — weighted rubric, versioned methodology.
Clinical Summarization
Evaluate AI summarizers on five clinically-grounded criteria — Factual Accuracy, Clinical Relevance, Completeness, Chronological Order, and Organization — across 5,000 annotated canine and feline cases.
AI Scribing & Documentation
Benchmark AI scribing tools on their ability to generate accurate, structured clinical documentation from veterinary consultation transcripts.
Radiomics
Evaluate AI models on structured veterinary imaging data — quantitative feature extraction, lesion characterization, and prognostic modeling.
Computer Vision
Benchmark veterinary imaging AI on pathology detection, classification, and diagnostic support across radiographs, ultrasound, and histopathology.
LLM-as-a-judge. Weighted rubric scoring.
VAULT uses a validated LLM-as-a-judge framework adapted from peer-reviewed research in veterinary AI evaluation. An independent LLM judge assesses each model output against five clinically-grounded criteria using a structured rubric and extended reasoning budget.
Scores use a 0–3 integer rubric per criterion, multiplied by clinical importance weights, then normalized to a 0–1 composite. Temperature is fixed at 0.1 for reproducibility. All outputs are validated in structured JSON before scoring.
Who's leading in veterinary AI?
Canine & feline cases. Scored by an independent LLM judge using a 5-criterion weighted rubric. All results verified and traceable.
| # | Model / Tool | Composite | Factual Acc. | Clinical Rel. | Latency | Verified | Date |
|---|---|---|---|---|---|---|---|
| 1 | VetScribe Pro Meridian AI · Suite v1.3 | 0.914 | 0.941 | 0.923 | 0.98s | ✓ verified | Apr 8, 2025 |
| 2 | CliniParse v4.1 Apex Vet Systems · Suite v1.3 | 0.887 | 0.904 | 0.871 | 1.41s | ✓ verified | Apr 5, 2025 |
| 3 | Helix Summarizer v2 Canfield Labs · Suite v1.3 | 0.847 | 0.881 | 0.869 | 1.24s | ✓ verified | Apr 3, 2025 |
| 4 | VetNotes AI NovaClinical · Suite v1.2 | 0.831 | 0.856 | 0.814 | 2.11s | ✓ verified | Mar 28, 2025 |
| 5 | OpenVet-7B Lakeshore Research · Suite v1.3 | 0.742 | 0.751 | 0.723 | 3.78s | ✓ verified | Mar 20, 2025 |
Note: The leaderboard above shows example data for illustration purposes. All scores are produced using rubric v1.3 on 5,000 canine and feline cases. Results are published publicly after admin review.
Intentional access. Not bureaucracy.
VAULT requires approval before participants can run benchmarks against our 5,000 canine and feline records. This protects dataset integrity and ensures every published result reflects a legitimate, governed evaluation.
Submit an access request
Tell us about your organization, your AI tool, your use case, and how you intend to use the benchmark results. Takes ~5 minutes.
Team review (1–3 business days)
Sign participant agreement
Receive access & start benchmarking
Built for clinical data from day one.
Every architectural decision was made with data safety, auditability, and trust as non-negotiable constraints.
Zero Raw Data Exposure
The benchmark dataset never leaves our environment. Participants submit models or endpoints — we bring the data to the model, never the other way around.
Approval-Based Access
All participants are manually reviewed and approved before receiving API access. Credentials are encrypted at rest and scoped to benchmark operations only.
Immutable Audit Trail
Every benchmark run, admin action, credential use, and report publication is logged with a tamper-evident audit record. Incidents can be fully reconstructed.
LLM-as-a-Judge Evaluation
Scoring is performed by an independent LLM judge using a structured rubric across five weighted criteria. The methodology is peer-reviewed, versioned, and publicly documented.
Participant Agreements
All participants must execute a binding data access agreement before benchmarking. This covers acceptable use, confidentiality, publication rights, and violations.
Public Publication
Benchmark results are published publicly to the leaderboard. Internal admins review every report before publication to ensure quality and accuracy.
Ready to benchmark your veterinary AI?
Apply for access, register your model, and get your first benchmark report within days.
Our team is available to discuss partnership opportunities, custom evaluation requirements, and enterprise benchmark configurations.
Contact Our Team