LLM FHIR Eval Preview

An open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks, making research on FHIR and LLM interactions reproducible.

We're excited to announce the early preview release of @flexpa/llm-fhir-eval, an open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks. This framework aims to establish open benchmarks that make research on FHIR and LLM interactions reproducible.

Recent work, such as FHIR-GPT (Yikuan Li, et al) and HealthSageAI's Note-to-FHIR Llama 2 fine-tune, demonstrate the growing need for reproducible evaluation benchmarks in the FHIR and LLM space. @flexpa/llm-fhir-eval addresses this need by providing a standardized way to measure model performance and behaviors.

Overview

We've started by trying to define an set of tasks to evaluate for the benchmark, included in this preview release:

FHIR Resource Generation & Validation: Evaluate the ability of LLMs to generate and validate complex FHIR resources.
Summarization: Assess the proficiency of LLMs in summarizing notes into FHIR resources.
FHIR Path Evaluation: Test model capabilities in evaluating complex FHIR Path expressions. This is an exciting area of research for us because it was unexpected.
Structured & Unstructured Data Extraction: Extract specific information from both structured FHIR resources and unstructured clinical notes. This is a very well trodden area of resaerch.

The framework includes implementations of existing research benchmarks, such as the FHIR-GPT paper prompt, providing a foundation for comparative analysis and reproducibility.

Supported Models

The initial release supports evaluation of:

Anthropic Claude 3.5 Sonnet
OpenAI GPT-4o
OpenAI GPT-4o Mini

Community Involvement

Your input is crucial to the development of this framework. We welcome discussion on FHIR Chat about this preview release and in particular:

Feedback on the evaluation tasks and methodologies
Suggestions for additional benchmarks
Contributions to test cases and documentation
Sharing of evaluation results and experiences

What's Next?

We're focusing on:

Refining the benchmark based on community feedback
Implementing prior art and releasing four evaluation tasks for the benchmark
Designing and obtaining appropriate test cases for the tasks

Looking for more? Subscribe to our newsletter!

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

LLM FHIR Eval Preview

Overview

Supported Models

Community Involvement

What's Next?

Head over to the report to read our full analysis and takeaways ->

Patient-consented claims data from every health plan