GitPedia

Medalign

MedAlign is a clinician-generated dataset for instruction following with electronic medical records.

From som-shahlabยทUpdated June 18, 2026ยทView on GitHubยท

**MedAlign** is a benchmark dataset of *983 clinician-curated natural language instructions* grounded in *275 longitudinal EHRs*. It includes *303 reference responses* to support evaluation of large language models (LLMs) on clinical reasoning, timeline understanding, and multi-document synthesis. The project is distributed under the MIT License license, first published in 2023. Key topics include: benchmarking, ehr, electronic-health-records, foundation-models, healthcare.

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

MedAlign is a benchmark dataset of 983 clinician-curated natural language instructions grounded in 275 longitudinal EHRs. It includes 303 reference responses to support evaluation of large language models (LLMs) on clinical reasoning, timeline understanding, and multi-document synthesis.

[!WARNING]
MedAlign is a test-only benchmark. It is not intended for supervised model training.

๐Ÿ“Š Dataset Contents

The benchmark includes:

ComponentCount
Patients275
Clinical notes46,252
Distinct note types128
Clinical events (OMOP format)3.6 million
Instructions (deduplicated)983
Reference responses303

All EHR data is longitudinal and standardized using the OMOP Common Data Model (CDM).

โš ๏ธ Usage Restrictions

[!CAUTION]
Violations of the data use agreement will result in revoked access and may trigger institutional reporting.

  • โŒ Do not train or fine-tune models on MedAlign data (evaluation only)
  • โŒ Do not transmit MedAlign data to commercial APIs (e.g., ChatGPT, Claude, Gemini) that are not HIPAA compliant.
  • โŒ Do not redistribute dataset files or any derivative datasets.
  • โœ… Derived research artifacts (e.g., annotations, synthetic data) must be hosted on Redivis with prior approval from the MedAlign team.

All usage must strictly comply with the MedAlign DUA.

๐Ÿ” Access Requirements

To gain access, please complete the following steps:

  1. Apply via Redivis Portal
    Use an academic, government, or industry research email. Applications from personal (e.g., Gmail) accounts will be rejected.

  2. Complete HIPAA-compliant CITI training
    You must include proof of training in your application.

  3. Describe your research use case
    A short paragraph outlining your intended use is sufficient.

  4. Sign the MedAlign Data Use Agreement (DUA)
    This will be sent to you after your application is reviewed.

  5. Verify encryption and secure storage
    You must attest to storing the data on encrypted, access-controlled machines. Cloud use requires HIPAA compliance.

โฑ๏ธ Applications are reviewed within 7โ€“10 business days.

๐Ÿ“š Citation

If you use MedAlign in your work, please cite:

bibtex
@inproceedings{DBLP:conf/aaai/FlemingLHJRTBGS24, author = {Scott L. Fleming and Alejandro Lozano and William J. Haberkorn and Jenelle A. Jindal and Eduardo Reis and Rahul Thapa and Louis Blankemeier and Julian Z. Genkins and Ethan Steinberg and Ashwin Nayak and Birju S. Patel and Chia{-}Chun Chiang and Alison Callahan and Zepeng Huo and Sergios Gatidis and Scott J. Adams and Oluseyi Fayanju and Shreya J. Shah and Thomas Savage and Ethan Goh and Akshay S. Chaudhari and Nima Aghaeepour and Christopher D. Sharp and Michael A. Pfeffer and Percy Liang and Jonathan H. Chen and Keith E. Morse and Emma P. Brunskill and Jason A. Fries and Nigam H. Shah}, title = {MedAlign: {A} Clinician-Generated Dataset for Instruction Following with Electronic Medical Records}, booktitle = {Thirty-Eighth {AAAI} Conference on Artificial Intelligence}, year = {2024}, url = {https://doi.org/10.1609/aaai.v38i20.30205}, doi = {10.1609/AAAI.V38I20.30205}, }

Contributors

Showing top 3 contributors by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from som-shahlab/medalign via the GitHub API.Last fetched: 6/19/2026