Evaluation of DeepSeek-R1 and its distilled models for performance and cost efficiency in oncology

© 2025 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC-by the license) ( https://creativecommons.org/licenses/by-nc/4.0/ )

Download PDF

XML

Cite

Abstract

Introduction: Malignant tumors represent a significant public health threat, and the integration of artificial intelligence in health care is increasingly becoming a priority. Many oncology institutions are already considering the use of DeepSeek-R1 to assist doctors in making complex medical decisions. However, there remains a lack of sufficient evidence regarding the accuracy, consistency, and cost-efficiency of DeepSeek-R1 and its distilled models in oncology decision-making. This study aims to fill this gap by evaluating the performance and cost-effectiveness of DeepSeek-R1 and its distilled models in oncology, providing critical insights into their potential for clinical integration. Objectives: This study aimed to systematically evaluate the performance, consistency, and cost-efficiency of the open-source large language model (LLM) DeepSeek-R1 and its distilled variants in the context of oncology decision-making, using a benchmark derived from the MedQA dataset. Methods: A custom oncology question set containing 1,206 multiple choice questions was curated from MedQA. Seven models, including DeepSeek-R1 and six distilled versions, were evaluated using an automated testing framework. Accuracy, consistency, latency, and token consumption were compared across models. Statistical tests, including McNemar and Wilcoxon signed-rank, were used to assess differences in performance. Questions were also categorized into clinical task types (diagnosis, treatment, triage, and follow-up) for subgroup analysis. Results: DeepSeek-R1 achieved the highest performance (accuracy: 91.38%; consistency: 90.47%), whereas DeepSeek-R1-Distill-Qwen-32B was the only distilled model to exceed both metrics at the 0.8 threshold (accuracy: 88.72%; consistency: 81.44%). DeepSeek-R1 demonstrated significantly higher accuracy than its distilled counterpart (p<0.05), particularly in diagnosis- and treatment-related tasks (p<0.05). However, it also exhibited significantly greater latency and token consumption. A Cohen’s kappa value of 0.575 indicated moderate agreement between the two models. Conclusion: DeepSeek-R1 is more suitable for high-stakes oncology tasks requiring high accuracy and consistency, whereas DeepSeek-R1-Distill-Qwen-32B offers a cost-effective alternative for use in outpatient or resource-limited settings. These findings support a task- and resource-adaptive deployment strategy for LLMs in clinical oncology.

Keywords

DeepSeek-R1

Distilled models

Oncology

Performance

Cost efficiency

Funding

None.

Conflict of interest

The authors declare no conflicts of interest.

References

Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33: 1877-1901. doi: 10.48550/arXiv.2005.14165

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Pennsylvania: Association for Computational Linguistics; Vol. 1. 2019. p. 4171-4186.doi: 10.48550/arXiv.1810.04805

Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv; 2020. doi: 10.48550/arXiv.2001.08361

Stiennon N, Ouyang L, Wu J, et al. Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems. Vol. 33. Cambridge: MIT Press; 2020. p. 3008-3021. doi: 10.48550/arXiv.2009.03125

Liao H. Deepseek large-scale model: Technical analysis and development prospect. J Comput Sci Electr Eng. 2025;7(1):33-37. doi: 10.61784/jcsee3035

De Carvalho GP, Sawanobori T, Horii T. Data-driven motion planning: A survey on deep neural networks, reinforcement learning, and large language model approaches. IEEE Access. 2025;13:52195-52245. doi: 10.1109/ACCESS.2025.3552225

Guo D, Yang D, Zhang H, et al. Deepseek-R1: Incentivizing Reasoning Capability in LLMs Via Reinforcement Learning. arXiv. China: DeepSeek; 2025. doi: 10.48550/arXiv.2501.12948

Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: A survey. Int J Comput Vis. 2021;129(6):1789-1819. doi: 10.1007/s11263-021-01453-z

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv. 2015. doi: 10.48550/arXiv.1503.02531

Shimizu H, Nakayama KI. Artificial intelligence in oncology. Cancer Sci. 2020;111(5):1452-1460. doi: 10.1111/cas.14377

Mulita F, Verras GI, Anagnostopoulos CN, Kotis K. A smarter health through the internet of surgical things. Sensors (Basel). 2022;22(12):4577. doi: 10.3390/s22124577

Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. [arXiv Preprint]; 2019. doi: 10.48550/arXiv.1910.01108

Esmaeilzadeh P. Challenges and strategies for wide-scale artificial intelligence (AI) deployment in healthcare practices: A perspective for healthcare organizations. Artif Intell Med. 2024;151:102861. doi: 10.1016/j.artmed.2024.102861

Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209-249. doi: 10.3322/caac.21660

Huang Y, Tang K, Chen M, Wang B. A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry. [arXiv Preprint]; 2024. doi: 10.48550/arXiv.2404.15777

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44-56. doi: 10.1038/s41591-018-0300-7

Jiang L, Wu Z, Xu X, et al. Opportunities and challenges of artificial intelligence in the medical field: current application, emerging problems, and problem-solving strategies. J Int Med Res. 2021;49(3):03000605211000157. doi: 10.1177/03000605211000157

Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: Applications and implications. JMIR Med Educ. 2023;9:e50945. doi: 10.2196/50945

Previous article in this issue

Next article in this issue

Eurasian Journal of Medicine and Oncology, Electronic ISSN: 2587-196X Print ISSN: 2587-2400, Published by AccScience Publishing