Development and analysis of medical instruction-tuning for Japanese large language models
In the ongoing wave of impact driven by large language models (LLMs) like ChatGPT, the adaptation of LLMs to the medical domain has emerged as a crucial research frontier. Since mainstream LLMs tend to be designed for general-purpose applications, constructing a medical LLM through domain adaptation is a huge challenge. While instruction-tuning, particularly based on low-rank adaptation (LoRA), has become a frequently employed strategy to fine-tune LLMs recently, its precise roles in domain adaptation remain unknown. Here, we investigated how LoRA-based instruction-tuning improves the performance of Japanese medical question-answering tasks by employing a multifaceted evaluation of multiple-choice questions, including scoring based on “Exact match” and “Gestalt distance” in addition to the conventional accuracy. Our findings suggest that LoRA-based instruction-tuning can partially incorporate domain-specific knowledge into LLMs, with larger models demonstrating more pronounced effects. Furthermore, our results underscore the potential of adapting English-centric models for Japanese applications in domain adaptation, while also highlighting the persisting limitations of Japanese-centric models. This initiative represents a pioneering effort in enabling medical institutions to fine-tune and operate models without relying on external services.
- Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172-180. doi: 10.1038/s41586-023-06291-2
- Singhal K, Tu T, Gottweis J, et al. Towards Expert-level Medical Question Answering with Large Language Models. arXiv:2305.09617 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2305.09617
- Tu T, Azizi S, Driess D, et al. Towards generalist biomedical ai. NEJM AI. 2024;1(3). doi: 10.48550/arXiv.2307.14334
- Wang G, Yang G, Du Z, Fan L, Li X. CLINICALGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation. arXiv:2306.09968 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2306.09968
- Sugimoto K, Iki T, Chida Y, Kanazawa T, Aizawa A. JMedRoBERTa: A Japanese Pre-trained Language Model on Academic Articles in Medical Sciences (in Japanese). In: Proceedings of the 29th Annual Meeting of the Association for Natural Language Processing; 2023.
- Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019.
- Hu EJ, Wallis P, Allen-Zhu Z, et al. LoRA: Low-rank Adaptation of Large Language Models. In: International Conference on Learning Representations; 2021.
- Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems. 2023;36:10088- 10115.
- Suzuki M, Hirano M, Sakaji H. From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models. In: 2023 IEEE International Conference on Big Data (Big Data); 2023.
- Xie Q, Han W, Zhang X, et al. PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance. Advances in Neural Information Processing Systems. 2023;36:33469-33484.
- Zhou C, Liu P, Xu P, et al. Lima: Less is More for Alignment. Advances in Neural Information Processing Systems. 2023;36:55006-55021.
- Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020; 33:1877-1901.
- Lee J, Yoon W, Kim S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682
- Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4:86. doi: 10.1038/s41746-021-00455-y
- Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv:1904.05342 [arXiv Preprint], 2019. doi: 10.48550/arXiv.1904.05342
- Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc. 2021;3(1):1-23. doi: 10.1145/3458754
- Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. doi: 10.1038/sdata.2016.35
- Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLoS One. 2021;16(11):e0259763. doi: 10.1371/journal.pone.0259763
- Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998-6008.
- Bolton E, Hall D, Yasunaga M, Lee T, Manning C, Liang P. Stanford CRFM Introduces PubMedGPT 2.7B; 2022. Available from: https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b [Last accessed on 2024 Apr 04].
- Luo R, Sun L, Xia Y, et al. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23:bbac409. doi: 10.1093/bib/bbac409
- Luo Y, Zhang J, Fan S, et al. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for Biomedicine. arXiv:2308.09442 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2308.09442
- Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1:9.
- Touvron H, Martin L, Stone K, et al. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv:2307.09288 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2307.09288
- Wei J, Bosma M, Zhao V, et al. Fine-tuned Language Models are Zero-shot Learners. In: International Conference on Learning Representations; 2022.
- Mangrulkar S, Gugger S, Debut L, Belkada Y, Paul S. PEFT: State-of-the-art Parameter-Efficient Fine-tuning Methods; 2022. Available from: https://github.com/huggingface/peft [Last accessed on 2024 Apr 04].
- Dettmers T, Zettlemoyer L. The Case for 4-bit Precision: K-bit Inference Scaling Laws. In: International Conference on Machine Learning. PMLR; 2023.
- Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11(14):6421. doi: 10.3390/app11146421
- Pal A, Umapathi LK, Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. In: Proceedings of the Conference on Health, Inference, and Learning (2022); 2022. p. 248-260.
- Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019. p. 2567–2577. doi: 10.18653/v1/D19-1259
- Kasai J, Kasai Y, Sakaguchi K, Yamada Y, Radev D. Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations. arXiv:2303.18027 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2303.18027
- Taori R, Gulrajani I, Zhang T, et al. Stanford Alpaca: An Instruction-following Llama Model; 2023. Available from: https://github.com/tatsu-lab/stanford_alpaca [Last accessed on 2024 Apr 04].
- Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 2020. p. 38-45.
- Gao L, Tow J, Biderman S, et al. A framework for few-shot language model evaluation. Zenodo. 2023;v0.0.1. doi: 10.5281/zenodo.5371629
- Kurihara K, Kawahara D, Shibata T. JGLUE: Japanese General Language Understanding Evaluation. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference; 2022. p. 2957-2966.
- Pezeshkpour P, Hruschka E. Large Language Models Sensitivity to the Order of Options in Multiple-choice Questions. arXiv:2308.11483 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2308.11483
- Zheng C, Zhou H, Meng F, Zhou J, Huang M. Large Language Models are not Robust Multiple Choice Selectors. arXiv:2309.03882 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2309.03882