AccScience Publishing / EJMO / Online First / DOI: 10.36922/EJMO025150097
ORIGINAL RESEARCH ARTICLE

Evaluation of DeepSeek-R1 and its distilled models for performance and cost efficiency in oncology

Xiao Wei1† Fangcen Liu1† Kai Xin2* Lijing Zhu2*
Show Less
1 Department of Pathology, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, Jiangsu, China
2 Department of Oncology, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, Jiangsu, China
†These authors contributed equally to this work.
Received: 8 April 2025 | Revised: 17 April 2025 | Accepted: 6 May 2025 | Published online: 3 June 2025
© 2025 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC-by the license) ( https://creativecommons.org/licenses/by-nc/4.0/ )
Abstract

Introduction: Malignant tumors represent a significant public health threat, and the integration of artificial intelligence in health care is increasingly becoming a priority. Many oncology institutions are already considering the use of DeepSeek-R1 to assist doctors in making complex medical decisions. However, there remains a lack of sufficient evidence regarding the accuracy, consistency, and cost-efficiency of DeepSeek-R1 and its distilled models in oncology decision-making. This study aims to fill this gap by evaluating the performance and cost-effectiveness of DeepSeek-R1 and its distilled models in oncology, providing critical insights into their potential for clinical integration. Objectives: This study aimed to systematically evaluate the performance, consistency, and cost-efficiency of the open-source large language model (LLM) DeepSeek-R1 and its distilled variants in the context of oncology decision-making, using a benchmark derived from the MedQA dataset. Methods: A custom oncology question set containing 1,206 multiple choice questions was curated from MedQA. Seven models, including DeepSeek-R1 and six distilled versions, were evaluated using an automated testing framework. Accuracy, consistency, latency, and token consumption were compared across models. Statistical tests, including McNemar and Wilcoxon signed-rank, were used to assess differences in performance. Questions were also categorized into clinical task types (diagnosis, treatment, triage, and follow-up) for subgroup analysis. Results: DeepSeek-R1 achieved the highest performance (accuracy: 91.38%; consistency: 90.47%), whereas DeepSeek-R1-Distill-Qwen-32B was the only distilled model to exceed both metrics at the 0.8 threshold (accuracy: 88.72%; consistency: 81.44%). DeepSeek-R1 demonstrated significantly higher accuracy than its distilled counterpart (p<0.05), particularly in diagnosis- and treatment-related tasks (p<0.05). However, it also exhibited significantly greater latency and token consumption. A Cohen’s kappa value of 0.575 indicated moderate agreement between the two models. Conclusion: DeepSeek-R1 is more suitable for high-stakes oncology tasks requiring high accuracy and consistency, whereas DeepSeek-R1-Distill-Qwen-32B offers a cost-effective alternative for use in outpatient or resource-limited settings. These findings support a task- and resource-adaptive deployment strategy for LLMs in clinical oncology.

Keywords
DeepSeek-R1
Distilled models
Oncology
Performance
Cost efficiency
Funding
None.
Conflict of interest
The authors declare no conflicts of interest.
References
  1. Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33: 1877-1901. doi: 10.48550/arXiv.2005.14165

 

  1. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Pennsylvania: Association for Computational Linguistics; Vol. 1. 2019. p. 4171-4186.doi: 10.48550/arXiv.1810.04805

 

  1. Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv; 2020. doi: 10.48550/arXiv.2001.08361

 

  1. Stiennon N, Ouyang L, Wu J, et al. Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems. Vol. 33. Cambridge: MIT Press; 2020. p. 3008-3021. doi: 10.48550/arXiv.2009.03125

 

  1. Liao H. Deepseek large-scale model: Technical analysis and development prospect. J Comput Sci Electr Eng. 2025;7(1):33-37. doi: 10.61784/jcsee3035

 

  1. De Carvalho GP, Sawanobori T, Horii T. Data-driven motion planning: A survey on deep neural networks, reinforcement learning, and large language model approaches. IEEE Access. 2025;13:52195-52245. doi: 10.1109/ACCESS.2025.3552225

 

  1. Guo D, Yang D, Zhang H, et al. Deepseek-R1: Incentivizing Reasoning Capability in LLMs Via Reinforcement Learning. arXiv. China: DeepSeek; 2025. doi: 10.48550/arXiv.2501.12948

 

  1. Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: A survey. Int J Comput Vis. 2021;129(6):1789-1819. doi: 10.1007/s11263-021-01453-z

 

  1. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv. 2015. doi: 10.48550/arXiv.1503.02531

 

  1. Shimizu H, Nakayama KI. Artificial intelligence in oncology. Cancer Sci. 2020;111(5):1452-1460. doi: 10.1111/cas.14377

 

  1. Mulita F, Verras GI, Anagnostopoulos CN, Kotis K. A smarter health through the internet of surgical things. Sensors (Basel). 2022;22(12):4577. doi: 10.3390/s22124577

 

  1. Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. [arXiv Preprint]; 2019. doi: 10.48550/arXiv.1910.01108

 

  1. Esmaeilzadeh P. Challenges and strategies for wide-scale artificial intelligence (AI) deployment in healthcare practices: A perspective for healthcare organizations. Artif Intell Med. 2024;151:102861. doi: 10.1016/j.artmed.2024.102861

 

  1. Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209-249. doi: 10.3322/caac.21660

 

  1. Huang Y, Tang K, Chen M, Wang B. A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry. [arXiv Preprint]; 2024. doi: 10.48550/arXiv.2404.15777

 

  1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44-56. doi: 10.1038/s41591-018-0300-7

 

  1. Jiang L, Wu Z, Xu X, et al. Opportunities and challenges of artificial intelligence in the medical field: current application, emerging problems, and problem-solving strategies. J Int Med Res. 2021;49(3):03000605211000157. doi: 10.1177/03000605211000157

 

  1. Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: Applications and implications. JMIR Med Educ. 2023;9:e50945. doi: 10.2196/50945
Share
Back to top
Eurasian Journal of Medicine and Oncology, Electronic ISSN: 2587-196X Print ISSN: 2587-2400, Published by AccScience Publishing