Skip to content
v1 Beta · Clinical Summarization Open

The benchmark
for veterinary clinical AI.

5,000 anonymized canine and feline records. Weighted rubric scoring by an independent LLM judge. Results published publicly.

5,000
Clinical records
4
Benchmark tracks
0
Records exposed
5
Weighted criteria
vault-cli \u2014 benchmark run
LIVE
Zero raw data exposure
Approval-based access
Transparent methodology
Full audit trail
Reproducible results
LLM-as-a-judge scoring
Encrypted at rest + transit
Weighted rubric evaluation
Canine & feline dataset
Zero raw data exposure
Approval-based access
Transparent methodology
Full audit trail
Reproducible results
LLM-as-a-judge scoring
Encrypted at rest + transit
Weighted rubric evaluation
Canine & feline dataset

From submission to leaderboard in four steps.

You never see the benchmark cases. We never touch your model weights. Trust flows in both directions.

Step 01

Request & Approval

Apply for access with your organization details and use case. Our team reviews all requests and approves participation before any benchmark access is granted.

Step 02

Register Your Model

Submit a secure API endpoint or containerized inference adapter. Your credentials are encrypted at rest. We call your model — you never touch our data.

Step 03

Secure Evaluation

VAULT runs all 5,000 canine and feline cases in an isolated sandbox. Outputs are scored by an independent LLM judge against our 5-criterion weighted rubric. Raw outputs are never stored.

Step 04

Results Published Publicly

Receive a full benchmark report. Your results are published to the public leaderboard — creating a trusted, transparent record of veterinary AI performance.

5,000 anonymized clinical records. Yours to evaluate against. Never to download.

The VAULT benchmark dataset consists of 5,000 anonymized canine and feline clinical records, rigorously de-identified and curated by our internal team. It is the shared standard against which all tools are measured — but it lives exclusively within our secure environment.

5,000
Clinical records
2
Species — canine & feline
5
Evaluation criteria
12+
De-identification passes

Records span: canine · feline

Data Protection Guarantees
Zero export — ever
No API endpoint, report, or benchmark output ever exposes a raw clinical record. Records never leave our environment in readable form.
Network-isolated execution
Inference jobs run in sandboxed environments with no external network access. Your model receives only the structured case input.
Full audit log on every job
Every benchmark run is logged with job ID, timestamp, model version, input hash, output hash, and grader version. Immutable.
Anti-reconstruction design
Outputs are structured to prevent case reconstruction. Report metrics are aggregated and category-sliced. No per-case outputs returned.
Participant agreement required
All participants must sign a data access and acceptable use agreement before receiving benchmark access. Violations result in revocation.

One platform. Every clinical AI modality.

Every track uses the same LLM-as-a-judge framework — weighted rubric, versioned methodology.

Live

Clinical Summarization

Evaluate AI summarizers on five clinically-grounded criteria — Factual Accuracy, Clinical Relevance, Completeness, Chronological Order, and Organization — across 5,000 annotated canine and feline cases.

Start Evaluating
Coming Soon
Target: Q3 2025

AI Scribing & Documentation

Benchmark AI scribing tools on their ability to generate accurate, structured clinical documentation from veterinary consultation transcripts.

Coming Soon
Target: Q4 2025

Radiomics

Evaluate AI models on structured veterinary imaging data — quantitative feature extraction, lesion characterization, and prognostic modeling.

Coming Soon
Target: Q1 2026

Computer Vision

Benchmark veterinary imaging AI on pathology detection, classification, and diagnostic support across radiographs, ultrasound, and histopathology.

LLM-as-a-judge. Weighted rubric scoring.

VAULT uses a validated LLM-as-a-judge framework adapted from peer-reviewed research in veterinary AI evaluation. An independent LLM judge assesses each model output against five clinically-grounded criteria using a structured rubric and extended reasoning budget.

Scores use a 0–3 integer rubric per criterion, multiplied by clinical importance weights, then normalized to a 0–1 composite. Temperature is fixed at 0.1 for reproducibility. All outputs are validated in structured JSON before scoring.

LLM judge · Temperature 0.1
Rubric v1.3 — versioned
JSON schema validation
16,384-token reasoning budget
Full Methodology →
Sample Score Breakdown
helix-v2 · Suite 1.3
Factual Accuracy×2.5
0.881
Clinical Relevance×1.5
0.869
Completeness×1.2
0.823
Chronological Order×1.0
0.791
Organization×0.8
0.912
Weighted Composite0.847
Median Latency1.24s
Rank #3

Who's leading in veterinary AI?

Canine & feline cases. Scored by an independent LLM judge using a 5-criterion weighted rubric. All results verified and traceable.

Suite: Canine & Feline Summarization v1.3
Example Data
Illustrative example — not real results
Clinical Summarization — Leaderboard
#Model / ToolCompositeFactual Acc.Clinical Rel.LatencyVerifiedDate
1
VetScribe Pro
Meridian AI · Suite v1.3
0.914
0.9410.9230.98s✓ verifiedApr 8, 2025
2
CliniParse v4.1
Apex Vet Systems · Suite v1.3
0.887
0.9040.8711.41s✓ verifiedApr 5, 2025
3
Helix Summarizer v2
Canfield Labs · Suite v1.3
0.847
0.8810.8691.24s✓ verifiedApr 3, 2025
4
VetNotes AI
NovaClinical · Suite v1.2
0.831
0.8560.8142.11s✓ verifiedMar 28, 2025
5
OpenVet-7B
Lakeshore Research · Suite v1.3
0.742
0.7510.7233.78s✓ verifiedMar 20, 2025
Showing top 5 of 23 published resultsView Full Leaderboard

Note: The leaderboard above shows example data for illustration purposes. All scores are produced using rubric v1.3 on 5,000 canine and feline cases. Results are published publicly after admin review.

Intentional access. Not bureaucracy.

VAULT requires approval before participants can run benchmarks against our 5,000 canine and feline records. This protects dataset integrity and ensures every published result reflects a legitimate, governed evaluation.

01

Submit an access request

Tell us about your organization, your AI tool, your use case, and how you intend to use the benchmark results. Takes ~5 minutes.

02

Team review (1–3 business days)

03

Sign participant agreement

04

Receive access & start benchmarking

Request Benchmark Access

By submitting, you agree to our Terms of Use and Data Access Policy. We typically respond within 1–3 business days.

Built for clinical data from day one.

Every architectural decision was made with data safety, auditability, and trust as non-negotiable constraints.

Zero Raw Data Exposure

The benchmark dataset never leaves our environment. Participants submit models or endpoints — we bring the data to the model, never the other way around.

Network isolation
Sandboxed execution

Approval-Based Access

All participants are manually reviewed and approved before receiving API access. Credentials are encrypted at rest and scoped to benchmark operations only.

Role-based permissions
Instant revocation

Immutable Audit Trail

Every benchmark run, admin action, credential use, and report publication is logged with a tamper-evident audit record. Incidents can be fully reconstructed.

Immutable logs
Run traceability

LLM-as-a-Judge Evaluation

Scoring is performed by an independent LLM judge using a structured rubric across five weighted criteria. The methodology is peer-reviewed, versioned, and publicly documented.

Independent LLM judge
Versioned rubric

Participant Agreements

All participants must execute a binding data access agreement before benchmarking. This covers acceptable use, confidentiality, publication rights, and violations.

Legal agreements
Use case review

Public Publication

Benchmark results are published publicly to the leaderboard. Internal admins review every report before publication to ensure quality and accuracy.

Publicly published
Admin-reviewed
Governance & Methodology Documentation
Full benchmark governance charter, grading rubric, anonymization protocol, and publication policy.
Read Documentation →
Encrypted in transit + at rest
Canine & feline records anonymized per HIPAA-equivalent standards
IRB-reviewed dataset curation
Versioned evaluation rubric
Participant agreements required

Common questions

Have a question not answered here?

Contact our team →

Ready to benchmark your veterinary AI?

Apply for access, register your model, and get your first benchmark report within days.

Typical review: 1–3 business days
Beta access currently free
Approval required
Questions?

Our team is available to discuss partnership opportunities, custom evaluation requirements, and enterprise benchmark configurations.

Contact Our Team