这是用户在 2024-3-5 9:44 为 https://ai.nejm.org/doi/full/10.1056/AIoa2300138 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Skip to main content
NEJM AI homepage

Monthly & Annual Options

Subscribe Now

Monthly & Annual Options

Subscribe Now


面向通用生物医学人工智能

Authors: Tao Tu, Ph.D. https://orcid.org/0000-0001-9191-7938 taotu@google.com, Shekoofeh Azizi, Ph.D. https://orcid.org/0000-0002-7447-6031 taotu@google.com, Danny Driess, M.S. https://orcid.org/0000-0002-8258-1659, Mike Schaekermann, Ph.D. https://orcid.org/0000-0002-1735-9680, Mohamed Amin, B.S. https://orcid.org/0009-0002-4874-8272, Pi-Chuan Chang, Ph.D. https://orcid.org/0000-0003-3021-6446, Andrew Carroll, Ph.D. https://orcid.org/0000-0002-4824-6689, Charles Lau, M.B.A. https://orcid.org/0000-0002-3136-9711, Ryutaro Tanno, Ph.D. https://orcid.org/0000-0002-8107-6730, Ira Ktena, Ph.D. https://orcid.org/0000-0001-6677-6547, Anil Palepu, M.S. https://orcid.org/0000-0002-4720-8787, Basil Mustafa, M.S. https://orcid.org/0000-0001-7305-7890, Aakanksha Chowdhery, Ph.D. https://orcid.org/0000-0002-0628-5225, Yun Liu, Ph.D. https://orcid.org/0000-0003-4079-8275, Simon Kornblith, Ph.D. https://orcid.org/0000-0002-9088-2443, David Fleet, Ph.D. https://orcid.org/0000-0003-0734-7114, Philip Mansfield, Ph.D. https://orcid.org/0000-0003-4969-0543, Sushant Prakash, M.S. https://orcid.org/0009-0000-4162-4600, Renee Wong, B.Sc. https://orcid.org/0009-0003-0403-7679, Sunny Virmani, M.S. https://orcid.org/0009-0008-0647-8853, Christopher Semturs, M.S. https://orcid.org/0000-0001-6108-2773, S. Sara Mahdavi, Ph.D. https://orcid.org/0000-0001-6823-598X, Bradley Green, Ph.D. https://orcid.org/0000-0001-9589-0226, Ewa Dominowska, M.S. https://orcid.org/0009-0006-5644-9685, Blaise Aguera y Arcas, M.S. https://orcid.org/0000-0003-2256-9823, Joelle Barral, Ph.D. https://orcid.org/0009-0009-0432-5148, Dale Webster, Ph.D. https://orcid.org/0000-0002-3023-8824, Greg S. Corrado, Ph.D. https://orcid.org/0000-0001-8817-0992, Yossi Matias, Ph.D. https://orcid.org/0000-0003-3960-6002, Karan Singhal, M.S. https://orcid.org/0009-0001-0286-609X, Pete Florence, Ph.D. https://orcid.org/0000-0002-7148-5645, Alan Karthikesalingam, M.D., Ph.D. https://orcid.org/0009-0000-4958-5976, and Vivek Natarajan, M.S. https://orcid.org/0000-0001-7849-2074Author Info & Affiliations

发布于2024年2月22日
NEJM AI 2024;1(3)
DOI: 10.1056/AIoa2300138

 摘要

 背景


医学本质上是多模态的,需要同时解释和整合跨越文本、成像、基因组等多种数据模态的洞察力。灵活地编码、整合和解释这些数据的通用生物医学人工智能系统可能更好地实现从科学研究到医疗保健的广泛应用。

 方法


为了催化这些模型的开发,我们编纂了MultiMedBench,这是一个新的多模态生物医学基准。MultiMedBench涵盖了14个多样化的任务,如医学问题回答、乳腺X线和皮肤病图像解释、放射学报告生成和摘要以及基因组变异调用。然后我们引入了Med-PaLM Multimodal(Med-PaLM M),这是我们对通用生物医学人工智能系统的概念验证,它使用相同的模型权重灵活地编码和解释临床语言、成像和基因组学等生物医学数据。为了进一步探究Med-PaLM M的能力和局限性,我们对模型生成的(和人类)胸部X光报告进行了放射科医师评估。

 结果


我们在不同规模的模型中观察到了令人鼓舞的表现。Med-PaLM M在所有MultiMedBench任务上的表现与或超过当前最佳水平,通常比专业模型高出很大的优势。在一项对246张回顾性胸部X光片进行并排排名的研究中,临床医生在高达40.50%的病例中对Med-PaLM多模态报告表达了成对偏好,这表明其具有潜在的临床实用性。

 结论


虽然需要大量的工作来验证这些模型在现实世界中的情况,并了解跨模态泛化是否可能,但我们的结果代表了朝着开发通用生物医学人工智能系统迈出的重要一步。(由Alphabet Inc.和/或其子公司资助。)

 介绍


医学是一门多模态学科。在常规护理中,临床医生在提供护理时,会解释来自各种模态的数据,包括临床记录、实验室检测、生命体征和观察、医学影像和基因组学。相比之下,尽管生物医学人工智能(AI)取得了重大进展,但今天大多数模型都是单模态和专门化的,用于单一任务,因此它们无法同时整合多样化的数据流。例如,乳腺X线摄影AI系统在乳腺癌筛查方面达到了最先进的(SOTA)性能,但并未包含BRCA(乳腺癌基因)状态或磁共振成像结果等信息。该系统也是根据预定义的可能分类集进行设计和限制的,它缺乏解释其预测或根据新获取的上下文信息解释其预测或推理给定升高的基线风险的能力。这些因素限制了这种狭窄、单任务、单模态、专业AI系统在现实应用中的性能和实用性。

基础模型的出现( 5 )为重新思考生物医学人工智能系统并开发具有更广泛能力的系统提供了机会。这些基础模型通常使用自监督方法在大规模数据上进行训练,其中学习强大表示的监督信号来自数据本身,而不是像在监督学习中那样依赖于外部来源(通常通过昂贵的手动过程进行整理)。基础模型可以使用上下文学习或少样本微调有效地适应许多下游任务和设置。 6,7 它们令人印象深刻的性能有潜力实现构建通用的生物医学人工智能系统,可以解释复杂的多模态数据,以改善生物医学发现和护理交付。此外,它们通常具有令人印象深刻的生成能力,可以促进有效的人机交互和协作。这些进展使得构建一个统一的生物医学人工智能系统成为可能,该系统可以有效地编码和解释具有复杂结构的多模态数据,学习强大的多模态数据表示,并解决许多具有挑战性的任务。 随着生物医学数据生成和创新步伐的加快,这些模型潜在的影响也会增加,从基础生物医学发现到医疗服务的下游应用范围广泛。同时,与这些基础模型相关的培训和计算成本增加,需要将它们开发为通才,使用共享模型参数在各种应用中实现广泛适用性,从而促进成本摊销。

在这项工作中,我们详细介绍了我们朝着通用生物医学人工智能系统方向取得的进展——一个可以解释多种生物医学数据模式并在同一模型权重下处理许多下游任务的统一模型。除了推理优势外,共享相同的模型权重使模型能够利用注意力机制,这是一种支持现代基于transformer的 8 基础模型的核心机器学习能力,以语言作为共同基础,在共享潜在表示中编码多样化的多模态生物医学数据。这反过来又可能使组合泛化到新的、以前未见的任务和模式的组合,这是智能系统的核心特征,也是我们在这项工作中进一步探讨的假设。

构建通用系统的一个关键挑战是缺乏全面的多模态基准。为了满足这一未满足的需求,我们编纂了MultiMedBench,这是一个开源的多模态生物医学基准,跨越语言、医学影像和基因组学模式,涵盖了14个不同的生物医学任务;这些任务包括问答、视觉问答、医学影像分类、放射学报告生成和摘要,以及基因组变异调用。我们利用MultiMedBench设计和开发了Med-PaLM Multimodal (Med-PaLM M),这是一种新的大规模通用生物医学人工智能系统,建立在最近的语言和多模态基础模型的进展之上。Med-PaLM M是一种灵活的多模态序列到序列架构,可以轻松地整合和交织各种类型的多模态生物医学信息。此外,模态不可知的语言解码器的表达能力使得在简单的生成框架中处理各种生物医学任务成为可能,具有统一的训练策略。

据我们所知,Med-PaLM M是第一个展示通用生物医学人工智能系统的演示,该系统可以解释多模态生物医学数据,并使用单个模型处理各种任务(图1)。Med-PaLM M在MultiMedBench的所有任务中表现出与最先进(SOTA)标准相当的性能,有时甚至超过了专门针对特定领域和任务的模型。
 图1
 Med-PaLM M概述
A generalist biomedical artificial intelligence system should be able to handle a diverse range of biomedical data modalities and tasks, ideally using a single set of model weights to enable computationally efficient usage and elegant modeling of cross-modality interactions. To enable progress toward this overarching goal, we curated MultiMedBench, a benchmark spanning 14 diverse biomedical tasks, including question answering, visual question answering, image classification, radiology report generation and summarization, and genomic variant calling. Med-PaLM Multimodal (Med-PaLM M), our proof-of-concept for such a generalist biomedical artificial intelligence system (denoted by the shaded blue area), is competitive with or exceeds prior state-of-the-art results from specialist models (denoted by dotted red lines) on all tasks in MultiMedBench.

除了自动化指标外,我们还对Med-PaLM M在不同模型规模下生成的胸部X光报告进行了放射科医师评估。在246个回顾性胸部X光的盲法并列排名中,临床医生在高达40.50%的病例中对Med-PaLM M报告表达了比对放射科医师生成的报告的偏好。此外,最好的Med-PaLM M模型平均每份报告有0.25个临床显著的错误,这些结果与先前工作的基线人类表现相当,表明潜在的临床效用。

除了对任务表现的定量评估外,我们还观察到了零样本医学推理的证据,对新医学概念和任务的推广,以及任务之间的积极转移。这些实验表明,这样的系统在跨越发现和护理交付的生物医学领域中具有有前途的潜力。

 方法


MULTIMEDBENCH:通用生物医学人工智能基准测试


MultiMedBench是一个多任务、多模态的基准测试,由12个去标识化的开源数据集和14个个体任务组成,涵盖以下轴:(1)任务类型(即问答、报告生成和摘要、视觉问答、医学影像分类和基因组变异调用);(2)模态(即文本、放射学[计算机断层扫描(CT)成像、磁共振成像和X射线]、病理学、皮肤科、乳腺X线摄影和基因组学);(3)输出格式(即所有任务的开放式生成,包括分类)。

在补充附录中的表A4中包括了MultiMedBench中数据集和任务的概述。总的来说,该基准包含超过100万个样本(详细信息在补充附录的A3节中提供)。


MED-PALM M:通用生物医学人工智能的概念验证


Med-PaLM M是通过使用MultiMedBench对非医疗通用模型(PaLM-E)进行微调和对齐以适应生物医学领域而开发的。PaLM-E是一种多模态语言模型,可以处理多模态输入序列,包括文本、视觉和传感器信号。
 图2

使用一次性示例进行指令任务说明的插图。
The top portion shows the task prompt for the chest x-ray report generation task. It consists of task-specific instructions, a text-only “one-shot exemplar” (omitting the corresponding image but preserving the target answer), and the actual question. The x-ray image is embedded and interleaved with textual context, including view orientation and reason for the study in addition to the question. The bottom portion shows the task prompt for the dermatology classification task. We formulated the skin lesion classification task as a multiple-choice question-answering task with all the class labels provided as individual answer options. Similar to the chest x-ray report generation task, skin lesion image tokens are interleaved with the patient clinical history as additional context to the question. The “<img>” tag in blue denotes the position in the prompt where the image tokens are embedded.

 评估


在每个评估中使用的方法、数据集和任务的详细信息在补充附录的A5节中进行了描述。

 结果


MED-PALM在多样化的多模态MultiMedBench任务上的表现


Med-PaLM M 被与两个基线进行了比较:(1)MultiMedBench 任务的每个任务之前的 SOTA 专业模型;和(2)没有任何生物医学领域微调的通用模型基线(PaLM-E 84B)。由于计算限制,使用了这个模型大小变体(而不是 PaLM-E 562B)。在 MultiMedBench 任务中,Med-PaLM M 的最佳结果(跨越三个模型大小)在 12 个任务中的 5 个任务上超过了之前的 SOTA 结果(对于两个任务,我们无法找到与我们设置可比的之前的 SOTA),同时在其余任务上具有竞争力。完整的结果总结在表 1 中。
 表格1
Task TypeModalityDatasetMetricSOTAPaLM-E (84B)Med-PaLM M (84B)
Question answeringTextMedical Question Answering (MedQA)Accuracy86.50%1628.83%46.11%
  Medical Multiple-Choice Question Answering (MedMCQA)Accuracy72.30%1633.35%47.60%
  PubMed Question Answering (PubMedQA)Accuracy81.80%1664.00%71.40%
Report summarizationRadiologyMedical Information Mart for Intensive Care (MIMIC)-IIIROUGE-L38.70%173.30%31.47%
   BLEU16.20%170.34%15.36%
   F1-RadGraph40.80%178.00%33.96%
Visual question answeringRadiologyVisual Question Answering Radiology (VQA-RAD)BLEU-171.03%1859.19%69.38%
   F1NA338.67%59.90%
  Semantically-Labeled Knowledge-Enhanced Visual Question Answering (Slake-VQA)BLEU-178.60%1952.65%92.70%
   F178.10%1924.53%89.28%
 PathologyPathology Visual Question Answering (Path-VQA)BLEU-170.30%1954.92%70.16%
   F158.40%1929.68%59.51%
Report generationChest x-rayMIMIC Chest X-ray (MIMIC-CXR)Micro-F1-1444.20%2015.40%53.56%
   Macro-F1-1430.70%2010.11%39.83%
   Micro-F1-556.70%215.51%57.88%
   Macro-F1-5NA34.85%51.60%
   F1-RadGraph24.40%1411.66%26.71%
   BLEU-139.48%2019.86%32.31%
   BLEU-413.30%214.60%11.31%
   ROUGE-L29.60%2216.53%27.29%
   CIDEr-D49.50%233.50%26.17%
Image classificationChest x-rayMIMIC-CXR (5 conditions)Macro-AUC81.27%2451.48%78.35%
   Macro-F1NA37.83%36.83%
 DermatologyPAD-UFES-20Macro-AUCNA263.37%97.27%
   Macro-F1NA21.38%84.32%
 MammographyVinDr-MammoMacro-AUC64.50%2551.49%71.76%
   Macro-F1NA316.06%35.70%
  Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) (mass)Macro-AUCNA47.75%73.09%
   Macro-F1NA7.77%49.98%
  Curated Breast Imaging Subset of DDSMDigital Database for Screening Mammography (CBIS-DDSM) (calcification)Macro-AUCNA340.67%82.22%
   Macro-F170.71%2611.37%63.81%
 Genomics (variant calling)PrecisionFDA (Truth Challenge V2)Indel-F199.40%2753.01%97.04%
   SNP-F199.70%2752.84%99.32%

在MultiMedBench上进行性能比较。
*
*
Med-PaLM Multimodal (Med-PaLM M) was compared with specialist state-of-the-art (SOTA) models and a generalist model (PaLM-E 84-billion-parameter [84B]) without biomedical domain fine-tuning. Across all tasks, datasets, and metric combinations in MultiMedBench, Med-PaLM M performance was shown to be near or to exceed the SOTA performance. Results for scaling experiments varying the size of Med-PaLM M are presented in Section A7 in the Supplementary Appendix. Note that these results were achieved by Med-PaLM M with the same set of model weights without any task-specific customization. Boldface indicates the best performance. MIMIC-CXR denotes MIMIC Chest X-ray.
NA, NA2, and NA3 indicate the absence of a SOTA metric due to variations in task setups, different train/test split, and the metric not being reported, respectively.

在医学问答任务中,Med-PaLM M与最新的Med-PaLM 2结果进行了比较, 16 我们观察到Med-PaLM 2的表现更高(见表1)。然而,与Med-PaLM M所建立的基准PaLM模型相比,Med-PaLM M在相同的小样本设置下,在所有三个问答数据集上都比之前的最佳PaLM结果 11 有了很大的提高。

此外,与没有生物医学领域微调的通用多模态基线PaLM-E 84B相比,Med-PaLM M在所有14个任务上均表现出性能提升,通常具有显著的优势,这表明领域自适应的重要性。综合来看,这些结果展示了Med-PaLM M的专业能力以及领域特定微调的有用性。我们在补充附录的A6节中对每个单独的任务的详细结果进行了进一步描述;Med-PaLM M在尺度上的表现讨论在A7节中。


MED-PALM M 展示了对新医疗任务和概念的零样本泛化能力


通过语言作为不同任务和模式之间的共同基础来训练通用生物医学AI系统,使得系统能够通过结合从其他任务和模式中学到的知识来处理新任务(即组合泛化)。我们强调了初步证据,表明Med-PaLM M可以以零样本的方式推广到新的医学概念和未见的任务。我们还观察到了零样本多模态推理作为Med-PaLM M的一种新兴能力。最后,我们展示了由于模型的多任务、多模态训练而带来的正向任务转移的好处。


对新医学概念进行推广的证据


我们通过评估Med-PaLM M在蒙哥马利县MD数据集中检测结核病(TB)异常的能力,探究了其对未见过的医学概念的零样本泛化能力。如表A1所示,Med-PaLM M与针对该数据集优化的专用集成模型的最先进结果(95.65%)相比表现具有竞争力。三种模型变体的性能相似(83%至88%),与MultiMedBench中其他医学图像分类任务的发现一致。由于分类任务被设置为开放性问题回答任务,并按照二元正确性进行评估,因此我们没有计算需要预测概率的曲线下面积指标。


新兴零射多模态医学推理的证据


我们还对Montgomery County TB数据集上的Med-PaLM M的零样本思维链能力进行了定性探索。与分类设置不同,我们使用纯文本示例提示模型生成一份报告,描述给定图像中的发现以及是/否分类预测。图3展示了Med-PaLM M 84B和562B变体的零样本思维链推理的定性示例。特别是,两个Med-PaLM M变体都能够正确地识别出与TB相关的病变。然而,根据专家放射学家的审查,模型生成的报告中仍然存在一些遗漏和错误。Med-PaLM M 12B无法生成连贯的响应,这表明语言模型的规模在零样本思维链多模态推理能力中起着关键作用。
 图3

Med-PaLM M 中的新兴零射多模态医学推理证据
Large Med-PaLM Multimodal (Med-PaLM M) models exhibit zero-shot chain-of-thought reasoning capability in identifying and describing tuberculosis (TB)-related findings in chest x-ray images. The model is prompted with task-specific instructions and a text-only exemplar to generate a report describing findings in the given x-ray image. Note that the “<img>” tag in black indicates a dummy token instead of a real image to ensure that this task remains zero-shot. Model predictions from Med-PaLM M 84-billion-parameter (84B) and 562B are shown together with the annotations from an expert radiologist. Both models correctly localized the major TB-related cavitary lesion in the right upper lobe. However, neither model addressed the small cavitary lesion in left upper lobe (Med-PaLM M 562B was considered better than Med-PaLM M 64B in this example, as it also alluded to the opacity in the right middle lobe and did not make the incorrect statement of left lung being clear). The “<img>” tag in blue denotes the position in the prompt where the image tokens are embedded.


对新任务的推广证据


虽然Med-PaLM M只接受了单视角胸部X光片图像输入的训练,但我们观察到该模型具有将多视角视觉输入推广到新任务设置中的能力。具体来说,在MIMIC-CXR的一个子集中,每个报告都附有前后两个视角的X光片图像,我们观察到Med-PaLM M能够获得与单视角报告生成任务中看到的零样本性能相当的表现,如表A2所示。这种能力很有前途,因为医学影像研究通常会从多个研究(例如当前研究和之前的研究)或多个视角(例如前后视角)的解释中受益。


积极任务转移的证据


为了展示跨模态和任务联合训练所带来的正向任务迁移,我们进行了一项消融研究,其中我们通过排除MIMIC-CXR分类任务来训练一个Med-PaLM M 84B变体,并将其与在完整的MultiMedBench混合数据上训练的Med-PaLM M 84B变体进行比较。如表A3所示,在报告生成和分类两个任务上进行联合训练的模型在所有报告生成指标上表现更出色。我们还发现,仅在胸部X光报告生成任务上进行训练的模型可以以零样本的方式推广到异常检测任务,并且表现令人印象深刻,这表明模型可以从更复杂的报告生成任务中学习,从而能够区分不同类型的异常,从而推广到新的任务设置中。


MED-PALM在放射学报告生成方面的性能评估,涵盖不同模型规模


为了评估Med-PaLM M的临床适用性,我们对模型生成的胸部X光报告进行了放射科医师评估(并将参考放射科医师提供的报告作为基线)。在这个评估框架中,报告来源被屏蔽,呈现顺序被随机化,Med-PaLM M生成的报告质量在模型规模方面得到了详细研究,如以下章节所述。

 并排评估


在一项对246个病例的并排评估中,四位临床医生对四种放射学报告的质量进行了排名,比较了来自MIMIC-CXR数据集的放射科医生提供的参考报告与不同Med-PaLM M模型尺度(12B、84B和562B)生成的报告。

图4A总结了每个评估者在四个候选报告中将由三个Med-PaLM M变体或参考报告生成的报告排名为最佳的频率。平均而言,在所有四个评估者中,由放射科医师提供的参考报告在37.14%的情况下被排名为最佳,其次是Med-PaLM M 84B生成的报告,在25.78%的情况下被排名为最佳,另外两个模型尺度,12B和562B,分别在19.49%和17.59%的情况下被排名为最佳。
 图4

人机并行评估。
Four clinician raters ranked the quality of four radiology reports in a side-by-side evaluation, comparing the radiologist-provided reference report from the MIMIC Chest X-ray dataset versus reports generated by different Med-PaLM Multimodal model scale variants (12-, 84-, and 562-billion-parameter). Panel A shows the best-ranked report in the four-way comparison. Panel B shows the pairwise preference of each model scale compared with the reference report. R denotes reviewer.

为了直接比较每个Med-PaLM M模型生成的报告与放射科医师提供的参考报告,我们从四元排名中派生出成对偏好,并为每个评分员和模型尺度提供了细分(图4B)。平均而言,在所有四个评分员中,Med-PaLM M 84B在40.50%的案例中被偏好,其次是另外两个模型尺度,12B和562B,它们分别在34.05%和32.00%的案例中被偏好。

 独立评估


我们报告了放射科医师在由Med-PaLM M生成的发现段落中发现的遗漏和错误的比率(图A1)。遗漏和错误的趋势不同。对于遗漏,我们观察到Med-PaLM M 12B和84B模型的平均每份报告遗漏率最低,为0.12(95%置信区间[CI],0.10至0.15),其次是562B模型的0.13(95%CI,0.11至0.16)。

相比之下,我们测量了Med-PaLM M 84B的最小平均误差率为0.25(95%置信区间,0.22至0.28),其次是Med-PaLM M 12B的0.28(95%置信区间,0.24至0.31)和562B模型的0.29(95%置信区间,0.25至0.32)。值得注意的是,这个错误率与之前一项研究中报告的人类放射科医生的基线错误率在MIMIC-CXR数据集上相当。

当前的分析仅限于临床相关的错误,确保了临床解释的特定关注点。这包括与临床发现的存在、位置或严重程度相关的错误。非临床错误的例子是指引或先前研究中不存在的引用,这些错误源于训练数据中的偏差。

这些跨模型比例的趋势对于放射科评分员标记为显著的省略和错误子集是相同的。表A13提供了错误和省略率的概述,包括非临床错误。

图A2展示了Med-PaLM M在三个模型尺寸下生成的胸部X光检查报告的定性示例,以及目标参考报告。对于这个例子,我们的放射科医生小组认为,Med-PaLM M 12B报告有两个临床显著的错误和一个遗漏,Med-PaLM M 84B报告没有错误和遗漏,Med-PaLM M 562B报告有一个临床上不重要的错误,但没有遗漏。

 讨论


本研究旨在构建和评估一个通用的生物医学人工智能系统,该系统可以使用单个权重集解释广泛的生物医学模式。据我们所知,Med-PaLM M是第一个演示这样的模型,可以在各种多模态任务上执行接近或超过先前SOTA模型的性能。人工智能的进展在很大程度上是由高质量基准的开发所推动的。虽然存在几个单任务生物医学人工智能数据集,但很少有尝试将它们统一并创建基准,以促进通用生物医学人工智能系统的开发。我们对MultiMedBench的策划是解决这一未满足需求的一步。然而,该基准存在几个重要的限制,包括个体数据集的大小有限(累计大小约为100万个样本)和模态和任务多样性有限(例如,缺乏生命科学数据集,如转录组学和蛋白质组学)。我们希望MultiMedBench构成一个开始,朝着为社区提供越来越多的此类策划的多模态生物医学数据集的方向发展。

发现单一通用模型能够使用相同的模型参数在多个临床相关任务上取得强大的性能,而不需要特定于任务的自定义,这对于现实世界的临床应用来说是一个有希望的结果,因为在临床决策中通常需要解释多种不同类型的信息。例如,多学科团队会议现在在许多医疗系统中已经成为常见(甚至被强制执行)的癌症患者护理方式,不同专业的专业知识可以结合在一起,讨论来自病理学、放射学、基因组学、医学和放射肿瘤学、手术等多种不同模态的检查结果。

狭义的任务和模式特定的人工智能系统可以用于支持多个不同的专业人员,例如,作为病理学、放射学、基因组学或电子健康记录信息分别解释的辅助工具。然而,每个单独的工具都需要额外的基础设施、委托、安全监测、培训和工作流程适应,这代表了更广泛采用人工智能在多模态工作流程中的重大潜在限制。这也防止了单个灵活的工具在不进行特定于任务的重新培训的情况下分析来自多个领域的信息,导致呼吁开发更通用的技术方法。

Med-PaLM M、PaLM-E的基础模型是一个高度有能力的多面手AI模型,这一点在它在广泛的视觉语言和实体机器人任务上的强劲表现中得到了证明。然而,它在MultiMedBench上的表现不佳;生物医学微调在大规模模型中显著提高了性能。这种改进在一定程度上可能是由于与大量非医疗任务和模式相比,领域整体呈现的分布偏移所致。

放射科医生对Med-PaLM M生成的放射学报告进行评估的结果表明,该模型在具有挑战性的多模态任务上表现出了令人鼓舞的性能。在高达40.50%的病例中,Med-PaLM M生成的报告被认为比人类生成的参考报告更优秀。此外,模型响应中临床显著错误的平均数量与以前使用相同数据集进行的研究中报告的人类生成的报告错误数量相当。这些结果证明了自动放射学报告生成任务的快速发展,并暗示了未来临床应用的潜力。

虽然通用生物医学人工智能系统提供了令人兴奋的可能性,但根据数据可用性、预训练模型、计算资源和应用场景,还有其他开发多模态生物医学人工智能系统的方法可能更适用。这些方法包括利用带有适配器层的冻结编码器来粘合多模态生物医学人工智能系统,以及通过工具使用来开发可以与专业生物医学编码器或特定任务代理进行接口的系统。

当前研究存在几个限制,需要在现实世界的实施之前解决。例如,众所周知,根据不同的敏感属性,AI系统的准确性存在差异,这有可能延续或放大现有的健康不平等,因此需要对Med-PaLM M的公平性属性进行更严格的测试,以更广泛地模拟真实使用的工作流程。此外,AI在生物医学中通常被用作辅助工具而不是自主工具,并且已知这样的系统会导致过度依赖和不足依赖。需要进一步的研究来评估Med-PaLM M的辅助效益潜力,这在本文中没有探讨。特别是,可以进行像放射学报告生成任务这样的评估,其中放射科医生被提供使用该系统作为辅助工具,并调查替代指标,如时间节省,多个环境下的临床-AI系统的准确性以及需要临床医生纠正的错误率。 我们对Med-PaLM M在新的任务中的泛化能力进行了有限的探索;未来的研究应该考虑观察到的在以前未见的医学影像病理学中的强大表现是否可以转化为更多的未见模式(例如,三维成像)。MultiMedBench代表了一个最小的可行多模态基准,未来也可以显著扩展。例如,虽然我们报告了基因组变异调用方面的表现,但仍有很大的潜力将探索的应用扩展到表达预测、药物基因组学、疾病变异等领域。一个全面的通用系统潜力探索也可能理想地探索需要解释音频信号、视频数据和许多其他类型的医疗信息的用例,以及测试对话能力并整合多个领域信息的更广泛的使用案例的任务;这个主题超出了我们的工作范围。

虽然一般能力的生物医学人工智能系统的开发令人兴奋,但要使这些系统在实践中有用或开启新的应用,它们需要匹配或超过专业、单任务模型,或者达到临床适用的性能水平。虽然超出了本工作的范围,但这里的进展需要仔细考虑安全性和公平性,以确保这些系统的开发和验证。

 笔记


作者提供的数据共享声明全文可在本文中获取。

本研究由Alphabet Inc.及其子公司(“Alphabet”)资助。

作者提供的披露声明全文可在本文全文查看。

这个项目是谷歌研究团队和谷歌DeepMind团队之间的广泛合作。我们感谢Andrew Sellergren、Yuan Liu、Michael Howell、Julie Wang、Sho Kannan、Christine Kingsley、Roy Lee、Naama Hammel、Jay Hartford、Preeti Singh、Kavita Kulkarni、Gavriel Goidel、Si Wai Man、Amy Wang、Sami Lachgar、Lauren Winer、Maggie Shiels、Annisah Um’rani、John Guilyard、Shravya Shetty和Evan Rapoport在研究期间提供了宝贵的见解和反馈。我们也很感激Karen DeSalvo、Zoubin Ghahramani、James Manyika和Jeff Dean在这个项目进行期间给予的支持。

 补充材料


补充附录(aioa2300138_appendix.pdf)

披露表格(aioa2300138_disclosures.pdf)

数据共享声明(aioa2300138_data-sharing.pdf)

References

1.

Esteva A, Kuprel B, Novoa RA, et al. 皮肤癌深度神经网络分类达到皮肤科医生水平。Nature 2017;542:115-118.
2.
Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316:2402-2410.
3.

Tomašev N, Glorot X, Rae JW, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature. 2019;572:116-119.
4.

McKinney SM, Sieniek M, Godbole V, et al. 国际评估用于乳腺癌筛查的人工智能系统。《自然》2020;577:89-94.
5.

Bommasani R, Hudson DA, Adeli E, et al. 2021年8月16日(https://arxiv.org/abs/2108.07258)。预印本。
6.
Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877-1901.
7.
Azizi S, Culp L, Freyberg J, et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat Biomed Eng 2023;7:756-779.
8.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30.
9.
Lake BM, Baroni M. Human-like systematic generalization through a meta-learning neural network. Nature 2023;623:115-121.
10.
Chowdhery A, Narang S, Devlin J, et al. PaLM: scaling language modeling with pathways. April 5, 2022 (https://arxiv.org/abs/2204.02311). Preprint.
11.
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023;620:172-180.
12.
13.
Chen X, Wang X, Changpinyo S, et al. Pali: a jointly-scaled multilingual language-image model. September 14, 2022 (https://arxiv.org/abs/2209.06794). Preprint.
14.
15.
Wei J, Bosma M, Zhao VY, et al. Finetuned language models are zero-shot learners. September 3, 2021 (https://arxiv.org/abs/2109.01652). Preprint.
16.
Singhal K, Tu T, Gottweis J, et al. Towards expert-level medical question answering with large language models. May 16, 2023 (https://arxiv.org/abs/2305.09617). Preprint.
17.
Van Veen D, Van Uden C, Attias M, et al. RadAdapt: radiology report summarization via lightweight domain adaptation of large language models. May 2, 2023 (https://arxiv.org/abs/2305.01146). Preprint.
18.
Bazi Y, Rahhal MMA, Bashmal L, Zuair M. Vision-language model for visual question answering in medical imagery. Bioengineering (Basel) 2023;10:380.
19.
van Sonsbeek T, Derakhshani MM, Najdenkoska I, Snoek CG, Worring M. Open-ended medical visual question answering through prefix tuning of language models. March 10, 2023 (https://arxiv.org/abs/2303.05977). Preprint.
20.
Nicolson A, Dowling J, Koopman B. Improving chest X-ray report generation by leveraging warm-starting. January 24, 2022 (https://arxiv.org/abs/2201.09405). Preprint.
21.
Miura Y, Zhang Y, Tsai EB, Langlotz CP, Jurafsky D. Improving factual completeness and consistency of image-to-text radiology report generation. October 20, 2020 (https://arxiv.org/abs/2010.10042). Preprint.
22.
Bannur S, Hyland S, Liu Q, et al. Learning to exploit temporal structure for biomedical vision-language processing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023:15016-15027.
23.
Tanida T, Müller P, Kaissis G, Rueckert D. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023:7433-7442.
24.
Rammuni Silva RS, Fernando P. Effective utilization of multiple convolutional neural networks for chest x-ray classification. SN Comput Sci 2022;3:492.
25.
Wantlin K, Wu C, Huang S-C, et al. BenchMD: a benchmark for modality-agnostic learning on medical images and sensors. April 17, 2023 (https://arxiv.org/abs/2304.08486). Preprint.
26.
Panambur AB, Madhu P, Maier A. Effect of random histogram equalization on breast calcification analysis using deep learning. Bildverarbeitung für die Medizin 2022: Proceedings, German Workshop on Medical Image Computing. Heidelberg, Germany: 2023:173-178
27.
Poplin R, Chang P-C, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 2018;36:983-987.
28.
Vankov II, Bowers JS. Training neural networks to encode symbols enables combinatorial generalization. Philos Trans R Soc Lond B Biol Sci 2020;375:20190309.
29.
Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models. June 15, 2022 (https://arxiv.org/abs/2206.07682). Preprint.
30.
Oloko-Oba M, Viriri S. Ensemble of EfficientNets for the diagnosis of tuberculosis. Comput Intell Neurosci 2021;2021:9790894.
31.
Kline A, Wang H, Li Y, et al. Multimodal machine learning in precision health: a scoping review. NPJ Digit Med 2022;5:171.
32.
Rollet Q, Bouvier V, Moutel G, et al. Multidisciplinary team meetings: are all patients presented and does it impact quality of care and survival — a registry-based study. BMC Health Serv Res 2021;21:1032.
33.
Expert Advisory Group on Cancer. A policy framework for commissioning cancer services: a report by the Expert Advisory Group on Cancer to the Chief Medical Officers of England and Wales: guidance for purchasers and providers of cancer services. London: Department of Health, 1995.
34.
Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 2019;17:195.
35.
Borges do Nascimento IJ, Abdulazeem H, Vasanthan LT, et al. Barriers and facilitators to utilizing digital health technologies by healthcare professionals. NPJ Digit Med 2023;6:161.
36.
Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259-265.
37.
Zhang R, Han J, Zhou A, et al. Llama-adapter: efficient fine-tuning of language models with zero-init attention. March 28, 2023 (https://arxiv.org/abs/2303.16199). Preprint.
38.
Schick T, Dwivedi-Yu J, Dess R, et al. Toolformer: language models can teach themselves to use tools. February 9, 2023 (https://arxiv.org/abs/2302.04761). Preprint.
39.
Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018;169:866-872.
40.
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022;16:26.

Information & Authors

Information

Published In

History

Submitted: September 18, 2023
Revised: November 7, 2023
Accepted: November 27, 2023
Published online: February 22, 2024
Published in issue: February 22, 2024

Topics

Authors

Affiliations

Notes

Dr. Tu can be contacted at taotu@google.com; and Dr. Azizi can be contacted at shekazizi@google.com.
Tao Tu and Shekoofeh Azizi contributed equally to this article. Alan Karthikesalingam and Vivek Natarajan jointly supervised this work.

Metrics & Citations

Metrics

Altmetrics

Citations

Export citation

Select the format you want to export the citation of this publication.

View Options

View options

PDF

View PDF

Media

Figures

Med-PaLM M Overview.
A generalist biomedical artificial intelligence system should be able to handle a diverse range of biomedical data modalities and tasks, ideally using a single set of model weights to enable computationally efficient usage and elegant modeling of cross-modality interactions. To enable progress toward this overarching goal, we curated MultiMedBench, a benchmark spanning 14 diverse biomedical tasks, including question answering, visual question answering, image classification, radiology report generation and summarization, and genomic variant calling. Med-PaLM Multimodal (Med-PaLM M), our proof-of-concept for such a generalist biomedical artificial intelligence system (denoted by the shaded blue area), is competitive with or exceeds prior state-of-the-art results from specialist models (denoted by dotted red lines) on all tasks in MultiMedBench.
Illustration of Instruction Task Prompting with One-Shot Exemplar.
The top portion shows the task prompt for the chest x-ray report generation task. It consists of task-specific instructions, a text-only “one-shot exemplar” (omitting the corresponding image but preserving the target answer), and the actual question. The x-ray image is embedded and interleaved with textual context, including view orientation and reason for the study in addition to the question. The bottom portion shows the task prompt for the dermatology classification task. We formulated the skin lesion classification task as a multiple-choice question-answering task with all the class labels provided as individual answer options. Similar to the chest x-ray report generation task, skin lesion image tokens are interleaved with the patient clinical history as additional context to the question. The “<img>” tag in blue denotes the position in the prompt where the image tokens are embedded.
Evidence of Emergent Zero-Shot Multimodal Medical Reasoning with Med-PaLM M.
Large Med-PaLM Multimodal (Med-PaLM M) models exhibit zero-shot chain-of-thought reasoning capability in identifying and describing tuberculosis (TB)-related findings in chest x-ray images. The model is prompted with task-specific instructions and a text-only exemplar to generate a report describing findings in the given x-ray image. Note that the “<img>” tag in black indicates a dummy token instead of a real image to ensure that this task remains zero-shot. Model predictions from Med-PaLM M 84-billion-parameter (84B) and 562B are shown together with the annotations from an expert radiologist. Both models correctly localized the major TB-related cavitary lesion in the right upper lobe. However, neither model addressed the small cavitary lesion in left upper lobe (Med-PaLM M 562B was considered better than Med-PaLM M 64B in this example, as it also alluded to the opacity in the right middle lobe and did not make the incorrect statement of left lung being clear). The “<img>” tag in blue denotes the position in the prompt where the image tokens are embedded.
Side-by-Side Human Evaluation.
Four clinician raters ranked the quality of four radiology reports in a side-by-side evaluation, comparing the radiologist-provided reference report from the MIMIC Chest X-ray dataset versus reports generated by different Med-PaLM Multimodal model scale variants (12-, 84-, and 562-billion-parameter). Panel A shows the best-ranked report in the four-way comparison. Panel B shows the pairwise preference of each model scale compared with the reference report. R denotes reviewer.

Other

Tables

Task TypeModalityDatasetMetricSOTAPaLM-E (84B)Med-PaLM M (84B)
Question answeringTextMedical Question Answering (MedQA)Accuracy86.50%1628.83%46.11%
  Medical Multiple-Choice Question Answering (MedMCQA)Accuracy72.30%1633.35%47.60%
  PubMed Question Answering (PubMedQA)Accuracy81.80%1664.00%71.40%
Report summarizationRadiologyMedical Information Mart for Intensive Care (MIMIC)-IIIROUGE-L38.70%173.30%31.47%
   BLEU16.20%170.34%15.36%
   F1-RadGraph40.80%178.00%33.96%
Visual question answeringRadiologyVisual Question Answering Radiology (VQA-RAD)BLEU-171.03%1859.19%69.38%
   F1NA338.67%59.90%
  Semantically-Labeled Knowledge-Enhanced Visual Question Answering (Slake-VQA)BLEU-178.60%1952.65%92.70%
   F178.10%1924.53%89.28%
 PathologyPathology Visual Question Answering (Path-VQA)BLEU-170.30%1954.92%70.16%
   F158.40%1929.68%59.51%
Report generationChest x-rayMIMIC Chest X-ray (MIMIC-CXR)Micro-F1-1444.20%2015.40%53.56%
   Macro-F1-1430.70%2010.11%39.83%
   Micro-F1-556.70%215.51%57.88%
   Macro-F1-5NA34.85%51.60%
   F1-RadGraph24.40%1411.66%26.71%
   BLEU-139.48%2019.86%32.31%
   BLEU-413.30%214.60%11.31%
   ROUGE-L29.60%2216.53%27.29%
   CIDEr-D49.50%233.50%26.17%
Image classificationChest x-rayMIMIC-CXR (5 conditions)Macro-AUC81.27%2451.48%78.35%
   Macro-F1NA37.83%36.83%
 DermatologyPAD-UFES-20Macro-AUCNA263.37%97.27%
   Macro-F1NA21.38%84.32%
 MammographyVinDr-MammoMacro-AUC64.50%2551.49%71.76%
   Macro-F1NA316.06%35.70%
  Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) (mass)Macro-AUCNA47.75%73.09%
   Macro-F1NA7.77%49.98%
  Curated Breast Imaging Subset of DDSMDigital Database for Screening Mammography (CBIS-DDSM) (calcification)Macro-AUCNA340.67%82.22%
   Macro-F170.71%2611.37%63.81%
 Genomics (variant calling)PrecisionFDA (Truth Challenge V2)Indel-F199.40%2753.01%97.04%
   SNP-F199.70%2752.84%99.32%
NA, NA2, and NA3 indicate the absence of a SOTA metric due to variations in task setups, different train/test split, and the metric not being reported, respectively.
Performance Comparison on MultiMedBench.*

Share

Share

CONTENT LINK

Share

References

References

1.
Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542:115-118.
2.
Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316:2402-2410.
3.
Tomašev N, Glorot X, Rae JW, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 2019;572:116-119.
4.
McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature 2020;577:89-94.
5.
Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. August 16, 2021 (https://arxiv.org/abs/2108.07258). Preprint.
6.
Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877-1901.
7.
Azizi S, Culp L, Freyberg J, et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat Biomed Eng 2023;7:756-779.
8.
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30.
9.
Lake BM, Baroni M. Human-like systematic generalization through a meta-learning neural network. Nature 2023;623:115-121.
10.
Chowdhery A, Narang S, Devlin J, et al. PaLM: scaling language modeling with pathways. April 5, 2022 (https://arxiv.org/abs/2204.02311). Preprint.
11.
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023;620:172-180.
12.
Driess D, Xia F, Sajjadi MSM, et al. PaLM-E: an embodied multimodal language model. March 6, 2023 (https://arxiv.org/abs/2303.03378). Preprint.
13.
Chen X, Wang X, Changpinyo S, et al. Pali: a jointly-scaled multilingual language-image model. September 14, 2022 (https://arxiv.org/abs/2209.06794). Preprint.
14.
Jeong J, Tian K, Li A, et al. Multimodal image-text matching improves retrieval-based chest X-ray report generation. March 29, 2023 (https://arxiv.org/abs/2303.17579). Preprint.
15.
Wei J, Bosma M, Zhao VY, et al. Finetuned language models are zero-shot learners. September 3, 2021 (https://arxiv.org/abs/2109.01652). Preprint.
16.
Singhal K, Tu T, Gottweis J, et al. Towards expert-level medical question answering with large language models. May 16, 2023 (https://arxiv.org/abs/2305.09617). Preprint.
17.
Van Veen D, Van Uden C, Attias M, et al. RadAdapt: radiology report summarization via lightweight domain adaptation of large language models. May 2, 2023 (https://arxiv.org/abs/2305.01146). Preprint.
18.
Bazi Y, Rahhal MMA, Bashmal L, Zuair M. Vision-language model for visual question answering in medical imagery. Bioengineering (Basel) 2023;10:380.
19.
van Sonsbeek T, Derakhshani MM, Najdenkoska I, Snoek CG, Worring M. Open-ended medical visual question answering through prefix tuning of language models. March 10, 2023 (https://arxiv.org/abs/2303.05977). Preprint.
20.
Nicolson A, Dowling J, Koopman B. Improving chest X-ray report generation by leveraging warm-starting. January 24, 2022 (https://arxiv.org/abs/2201.09405). Preprint.
21.
Miura Y, Zhang Y, Tsai EB, Langlotz CP, Jurafsky D. Improving factual completeness and consistency of image-to-text radiology report generation. October 20, 2020 (https://arxiv.org/abs/2010.10042). Preprint.
22.
Bannur S, Hyland S, Liu Q, et al. Learning to exploit temporal structure for biomedical vision-language processing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023:15016-15027.
23.
Tanida T, Müller P, Kaissis G, Rueckert D. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023:7433-7442.
24.
Rammuni Silva RS, Fernando P. Effective utilization of multiple convolutional neural networks for chest x-ray classification. SN Comput Sci 2022;3:492.
25.
Wantlin K, Wu C, Huang S-C, et al. BenchMD: a benchmark for modality-agnostic learning on medical images and sensors. April 17, 2023 (https://arxiv.org/abs/2304.08486). Preprint.
26.
Panambur AB, Madhu P, Maier A. Effect of random histogram equalization on breast calcification analysis using deep learning. Bildverarbeitung für die Medizin 2022: Proceedings, German Workshop on Medical Image Computing. Heidelberg, Germany: 2023:173-178
27.
Poplin R, Chang P-C, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 2018;36:983-987.
28.
Vankov II, Bowers JS. Training neural networks to encode symbols enables combinatorial generalization. Philos Trans R Soc Lond B Biol Sci 2020;375:20190309.
29.
Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models. June 15, 2022 (https://arxiv.org/abs/2206.07682). Preprint.
30.
Oloko-Oba M, Viriri S. Ensemble of EfficientNets for the diagnosis of tuberculosis. Comput Intell Neurosci 2021;2021:9790894.
31.
Kline A, Wang H, Li Y, et al. Multimodal machine learning in precision health: a scoping review. NPJ Digit Med 2022;5:171.
32.
Rollet Q, Bouvier V, Moutel G, et al. Multidisciplinary team meetings: are all patients presented and does it impact quality of care and survival — a registry-based study. BMC Health Serv Res 2021;21:1032.
33.
Expert Advisory Group on Cancer. A policy framework for commissioning cancer services: a report by the Expert Advisory Group on Cancer to the Chief Medical Officers of England and Wales: guidance for purchasers and providers of cancer services. London: Department of Health, 1995.
34.
Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 2019;17:195.
35.
Borges do Nascimento IJ, Abdulazeem H, Vasanthan LT, et al. Barriers and facilitators to utilizing digital health technologies by healthcare professionals. NPJ Digit Med 2023;6:161.
36.
Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature 2023;616:259-265.
37.
Zhang R, Han J, Zhou A, et al. Llama-adapter: efficient fine-tuning of language models with zero-init attention. March 28, 2023 (https://arxiv.org/abs/2303.16199). Preprint.
38.
Schick T, Dwivedi-Yu J, Dess R, et al. Toolformer: language models can teach themselves to use tools. February 9, 2023 (https://arxiv.org/abs/2302.04761). Preprint.
39.
Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018;169:866-872.
40.
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022;16:26.


来自第1卷第3期的更多内容

Your trusted guide to rigorous evaluation of clinical applications of AI.

Monthly & Annual Options

Subscribe Now