AccScience Publishing / AIH / Volume 1 / Issue 2 / DOI: 10.36922/aih.2695
Cite this article
Journal Browser
Volume | Year
News and Announcements
View All

Development and analysis of medical instruction-tuning for Japanese large language models

Issey Sukeda1* Masahiro Suzuki2 Hiroki Sakaji3 Satoshi Kodera1
Show Less
1 Department of Cardiovascular Medicine, Graduate School of Medicine, The University of Tokyo, Bunkyo, Tokyo, Japan
2 Department of Systems Innovation, School of Engineering, The University of Tokyo, Bunkyo, Tokyo, Japan
3 Faculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, Japan
AIH 2024, 1(2), 107–116;
Submitted: 10 January 2024 | Accepted: 13 March 2024 | Published: 8 April 2024
© 2024 by the Author (s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( )

In the ongoing wave of impact driven by large language models (LLMs) like ChatGPT, the adaptation of LLMs to the medical domain has emerged as a crucial research frontier. Since mainstream LLMs tend to be designed for general-purpose applications, constructing a medical LLM through domain adaptation is a huge challenge. While instruction-tuning, particularly based on low-rank adaptation (LoRA), has become a frequently employed strategy to fine-tune LLMs recently, its precise roles in domain adaptation remain unknown. Here, we investigated how LoRA-based instruction-tuning improves the performance of Japanese medical question-answering tasks by employing a multifaceted evaluation of multiple-choice questions, including scoring based on “Exact match” and “Gestalt distance” in addition to the conventional accuracy. Our findings suggest that LoRA-based instruction-tuning can partially incorporate domain-specific knowledge into LLMs, with larger models demonstrating more pronounced effects. Furthermore, our results underscore the potential of adapting English-centric models for Japanese applications in domain adaptation, while also highlighting the persisting limitations of Japanese-centric models. This initiative represents a pioneering effort in enabling medical institutions to fine-tune and operate models without relying on external services.

Medical large language models
Domain adaptation
Low-rank adaptation
This study was supported by the Japan Agency for Medical Research and Development (Grant Number: JP23hk0102078h0003).
  1. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620:172-180. doi: 10.1038/s41586-023-06291-2


  1. Singhal K, Tu T, Gottweis J, et al. Towards Expert-level Medical Question Answering with Large Language Models. arXiv:2305.09617 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2305.09617


  1. Tu T, Azizi S, Driess D, et al. Towards generalist biomedical ai. NEJM AI. 2024;1(3). doi: 10.48550/arXiv.2307.14334


  1. Wang G, Yang G, Du Z, Fan L, Li X. CLINICALGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation. arXiv:2306.09968 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2306.09968


  1. Sugimoto K, Iki T, Chida Y, Kanazawa T, Aizawa A. JMedRoBERTa: A Japanese Pre-trained Language Model on Academic Articles in Medical Sciences (in Japanese). In: Proceedings of the 29th Annual Meeting of the Association for Natural Language Processing; 2023.


  1. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019.


  1. Hu EJ, Wallis P, Allen-Zhu Z, et al. LoRA: Low-rank Adaptation of Large Language Models. In: International Conference on Learning Representations; 2021.


  1. Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLoRA: Efficient Finetuning of Quantized LLMs. Advances in Neural Information Processing Systems. 2023;36:10088- 10115.


  1. Suzuki M, Hirano M, Sakaji H. From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models. In: 2023 IEEE International Conference on Big Data (Big Data); 2023.


  1. Xie Q, Han W, Zhang X, et al. PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance. Advances in Neural Information Processing Systems. 2023;36:33469-33484.


  1. Zhou C, Liu P, Xu P, et al. Lima: Less is More for Alignment. Advances in Neural Information Processing Systems. 2023;36:55006-55021.


  1. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020; 33:1877-1901.


  1. Lee J, Yoon W, Kim S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682


  1. Rasmy L, Xiang Y, Xie Z, Tao C, Zhi D. Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med. 2021;4:86. doi: 10.1038/s41746-021-00455-y


  1. Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv:1904.05342 [arXiv Preprint], 2019. doi: 10.48550/arXiv.1904.05342


  1. Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc. 2021;3(1):1-23. doi: 10.1145/3458754


  1. Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. doi: 10.1038/sdata.2016.35


  1. Kawazoe Y, Shibata D, Shinohara E, Aramaki E, Ohe K. A clinical specific BERT developed using a huge Japanese clinical text corpus. PLoS One. 2021;16(11):e0259763. doi: 10.1371/journal.pone.0259763


  1. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998-6008.


  1. Bolton E, Hall D, Yasunaga M, Lee T, Manning C, Liang P. Stanford CRFM Introduces PubMedGPT 2.7B; 2022. Available from: [Last accessed on 2024 Apr 04].


  1. Luo R, Sun L, Xia Y, et al. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. 2022;23:bbac409. doi: 10.1093/bib/bbac409


  1. Luo Y, Zhang J, Fan S, et al. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for Biomedicine. arXiv:2308.09442 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2308.09442


  1. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI Blog. 2019;1:9.


  1. Touvron H, Martin L, Stone K, et al. Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv:2307.09288 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2307.09288


  1. Wei J, Bosma M, Zhao V, et al. Fine-tuned Language Models are Zero-shot Learners. In: International Conference on Learning Representations; 2022.


  1. Mangrulkar S, Gugger S, Debut L, Belkada Y, Paul S. PEFT: State-of-the-art Parameter-Efficient Fine-tuning Methods; 2022. Available from: [Last accessed on 2024 Apr 04].


  1. Dettmers T, Zettlemoyer L. The Case for 4-bit Precision: K-bit Inference Scaling Laws. In: International Conference on Machine Learning. PMLR; 2023.


  1. Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl Sci. 2021;11(14):6421. doi: 10.3390/app11146421


  1. Pal A, Umapathi LK, Sankarasubbu, M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical Domain Question Answering. In: Proceedings of the Conference on Health, Inference, and Learning (2022); 2022. p. 248-260.


  1. Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019. p. 2567–2577. doi: 10.18653/v1/D19-1259


  1. Kasai J, Kasai Y, Sakaguchi K, Yamada Y, Radev D. Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations. arXiv:2303.18027 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2303.18027


  1. Taori R, Gulrajani I, Zhang T, et al. Stanford Alpaca: An Instruction-following Llama Model; 2023. Available from: [Last accessed on 2024 Apr 04].


  1. Wolf T, Debut L, Sanh V, et al. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 2020. p. 38-45.


  1. Gao L, Tow J, Biderman S, et al. A framework for few-shot language model evaluation. Zenodo. 2023;v0.0.1. doi: 10.5281/zenodo.5371629


  1. Kurihara K, Kawahara D, Shibata T. JGLUE: Japanese General Language Understanding Evaluation. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference; 2022. p. 2957-2966.


  1. Pezeshkpour P, Hruschka E. Large Language Models Sensitivity to the Order of Options in Multiple-choice Questions. arXiv:2308.11483 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2308.11483


  1. Zheng C, Zhou H, Meng F, Zhou J, Huang M. Large Language Models are not Robust Multiple Choice Selectors. arXiv:2309.03882 [arXiv Preprint], 2023. doi: 10.48550/arXiv.2309.03882
Conflict of interest
The authors declare they have no competing interests.
Back to top
Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Published by AccScience Publishing