Development and analysis of medical instruction-tuning for Japanese large language models

© 2024 by the Author (s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )

Download PDF

Cite

XML

Abstract

In the ongoing wave of impact driven by large language models (LLMs) like ChatGPT, the adaptation of LLMs to the medical domain has emerged as a crucial research frontier. Since mainstream LLMs tend to be designed for general-purpose applications, constructing a medical LLM through domain adaptation is a huge challenge. While instruction-tuning, particularly based on low-rank adaptation (LoRA), has become a frequently employed strategy to fine-tune LLMs recently, its precise roles in domain adaptation remain unknown. Here, we investigated how LoRA-based instruction-tuning improves the performance of Japanese medical question-answering tasks by employing a multifaceted evaluation of multiple-choice questions, including scoring based on “Exact match” and “Gestalt distance” in addition to the conventional accuracy. Our findings suggest that LoRA-based instruction-tuning can partially incorporate domain-specific knowledge into LLMs, with larger models demonstrating more pronounced effects. Furthermore, our results underscore the potential of adapting English-centric models for Japanese applications in domain adaptation, while also highlighting the persisting limitations of Japanese-centric models. This initiative represents a pioneering effort in enabling medical institutions to fine-tune and operate models without relying on external services.

Keywords

Medical large language models

Llama2

Instruction-tuning

Domain adaptation

Low-rank adaptation

QLoRA

Funding

This study was supported by the Japan Agency for Medical Research and Development (Grant Number: JP23hk0102078h0003).

Conflict of interest

The authors declare they have no competing interests.

References

Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172-180. doi: 10.1038/s41586-023-06291-2

Singhal K, Tu T, Gottweis J, et al. Towards Expert-level Medical Question Answering with Large Language Models. arXiv:2305.09617 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2305.09617

Tu T, Azizi S, Driess D, et al. Towards generalist biomedical ai. NEJM AI. 2024;1(3). doi: 10.48550/arXiv.2307.14334

Wang G, Yang G, Du Z, Fan L, Li X. CLINICALGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation. arXiv:2306.09968 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2306.09968

Sugimoto K, Iki T, Chida Y, Kanazawa T, Aizawa A. JMedRoBERTa: A Japanese Pre-trained Language Model on Academic Articles in Medical Sciences (in Japanese). In: Proceedings of the 29th Annual Meeting of the Association for Natural Language Processing; 2023.

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019.

Hu EJ, Wallis P, Allen-Zhu Z, et al. LoRA: Low-rank Adaptation of Large Language Models. In: International Conference on Learning Representations; 2021.

Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems. 2023;36:10088- 10115.

Suzuki M, Hirano M, Sakaji H. From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models. In: 2023 IEEE International Conference on Big Data (Big Data); 2023.

Xie Q, Han W, Zhang X, et al. PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance. Advances in Neural Information Processing Systems. 2023;36:33469-33484.

Zhou C, Liu P, Xu P, et al. Lima: Less is More for Alignment. Advances in Neural Information Processing Systems. 2023;36:55006-55021.

Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020; 33:1877-1901.

Lee J, Yoon W, Kim S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682

Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4:86. doi: 10.1038/s41746-021-00455-y

Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv:1904.05342 [arXiv Preprint], 2019. doi: 10.48550/arXiv.1904.05342

Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc. 2021;3(1):1-23. doi: 10.1145/3458754

Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. doi: 10.1038/sdata.2016.35

Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLoS One. 2021;16(11):e0259763. doi: 10.1371/journal.pone.0259763

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998-6008.

Bolton E, Hall D, Yasunaga M, Lee T, Manning C, Liang P. Stanford CRFM Introduces PubMedGPT 2.7B; 2022. Available from: https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b [Last accessed on 2024 Apr 04].

Luo R, Sun L, Xia Y, et al. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23:bbac409. doi: 10.1093/bib/bbac409

Luo Y, Zhang J, Fan S, et al. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for Biomedicine. arXiv:2308.09442 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2308.09442

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1:9.

Touvron H, Martin L, Stone K, et al. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv:2307.09288 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2307.09288

Wei J, Bosma M, Zhao V, et al. Fine-tuned Language Models are Zero-shot Learners. In: International Conference on Learning Representations; 2022.

Mangrulkar S, Gugger S, Debut L, Belkada Y, Paul S. PEFT: State-of-the-art Parameter-Efficient Fine-tuning Methods; 2022. Available from: https://github.com/huggingface/peft [Last accessed on 2024 Apr 04].

Dettmers T, Zettlemoyer L. The Case for 4-bit Precision: K-bit Inference Scaling Laws. In: International Conference on Machine Learning. PMLR; 2023.

Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11(14):6421. doi: 10.3390/app11146421

Pal A, Umapathi LK, Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. In: Proceedings of the Conference on Health, Inference, and Learning (2022); 2022. p. 248-260.

Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019. p. 2567–2577. doi: 10.18653/v1/D19-1259

Kasai J, Kasai Y, Sakaguchi K, Yamada Y, Radev D. Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations. arXiv:2303.18027 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2303.18027

Taori R, Gulrajani I, Zhang T, et al. Stanford Alpaca: An Instruction-following Llama Model; 2023. Available from: https://github.com/tatsu-lab/stanford_alpaca [Last accessed on 2024 Apr 04].

Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 2020. p. 38-45.

Gao L, Tow J, Biderman S, et al. A framework for few-shot language model evaluation. Zenodo. 2023;v0.0.1. doi: 10.5281/zenodo.5371629

Kurihara K, Kawahara D, Shibata T. JGLUE: Japanese General Language Understanding Evaluation. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference; 2022. p. 2957-2966.

Pezeshkpour P, Hruschka E. Large Language Models Sensitivity to the Order of Options in Multiple-choice Questions. arXiv:2308.11483 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2308.11483

Zheng C, Zhou H, Meng F, Zhou J, Huang M. Large Language Models are not Robust Multiple Choice Selectors. arXiv:2309.03882 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2309.03882

Previous article in this issue

Next article in this issue

Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing