Published on
November 25, 2024
·
Written by
Joshua Kelly

LLM FHIR Eval Preview

An open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks, making research on FHIR and LLM interactions reproducible.

We're excited to announce the early preview release of @flexpa/llm-fhir-eval, an open-source evaluation framework designed to benchmark the performance of Large Language Models (LLMs) on FHIR-specific tasks. This framework aims to establish open benchmarks that make research on FHIR and LLM interactions reproducible.

Recent work, such as FHIR-GPT (Yikuan Li, et al) and HealthSageAI's Note-to-FHIR Llama 2 fine-tune, demonstrate the growing need for reproducible evaluation benchmarks in the FHIR and LLM space. @flexpa/llm-fhir-eval addresses this need by providing a standardized way to measure model performance and behaviors.

Overview

We've started by trying to define an set of tasks to evaluate for the benchmark, included in this preview release:

  1. FHIR Resource Generation & Validation: Evaluate the ability of LLMs to generate and validate complex FHIR resources.
  2. Summarization: Assess the proficiency of LLMs in summarizing notes into FHIR resources.
  3. FHIR Path Evaluation: Test model capabilities in evaluating complex FHIR Path expressions. This is an exciting area of research for us because it was unexpected.
  4. Structured & Unstructured Data Extraction: Extract specific information from both structured FHIR resources and unstructured clinical notes. This is a very well trodden area of resaerch.

The framework includes implementations of existing research benchmarks, such as the FHIR-GPT paper prompt, providing a foundation for comparative analysis and reproducibility.

Supported Models

The initial release supports evaluation of:

  • Anthropic Claude 3.5 Sonnet
  • OpenAI GPT-4o
  • OpenAI GPT-4o Mini

Community Involvement

Your input is crucial to the development of this framework. We welcome discussion on FHIR Chat about this preview release and in particular:

  • Feedback on the evaluation tasks and methodologies
  • Suggestions for additional benchmarks
  • Contributions to test cases and documentation
  • Sharing of evaluation results and experiences

What's Next?

We're focusing on:

  • Refining the benchmark based on community feedback
  • Implementing prior art and releasing four evaluation tasks for the benchmark
  • Designing and obtaining appropriate test cases for the tasks
Subscribe to our newsletter to stay up to date on our posts.

Head over to the report to read our full analysis and takeaways ->