Evaluation of DeepSeek-R1 and its distilled models for performance and cost efficiency in oncology

Introduction: Malignant tumors represent a significant public health threat, and the integration of artificial intelligence in health care is increasingly becoming a priority. Many oncology institutions are already considering the use of DeepSeek-R1 to assist doctors in making complex medical decisions. However, there remains a lack of sufficient evidence regarding the accuracy, consistency, and cost-efficiency of DeepSeek-R1 and its distilled models in oncology decision-making. This study aims to fill this gap by evaluating the performance and cost-effectiveness of DeepSeek-R1 and its distilled models in oncology, providing critical insights into their potential for clinical integration. Objectives: This study aimed to systematically evaluate the performance, consistency, and cost-efficiency of the open-source large language model (LLM) DeepSeek-R1 and its distilled variants in the context of oncology decision-making, using a benchmark derived from the MedQA dataset. Methods: A custom oncology question set containing 1,206 multiple choice questions was curated from MedQA. Seven models, including DeepSeek-R1 and six distilled versions, were evaluated using an automated testing framework. Accuracy, consistency, latency, and token consumption were compared across models. Statistical tests, including McNemar and Wilcoxon signed-rank, were used to assess differences in performance. Questions were also categorized into clinical task types (diagnosis, treatment, triage, and follow-up) for subgroup analysis. Results: DeepSeek-R1 achieved the highest performance (accuracy: 91.38%; consistency: 90.47%), whereas DeepSeek-R1-Distill-Qwen-32B was the only distilled model to exceed both metrics at the 0.8 threshold (accuracy: 88.72%; consistency: 81.44%). DeepSeek-R1 demonstrated significantly higher accuracy than its distilled counterpart (p<0.05), particularly in diagnosis- and treatment-related tasks (p<0.05). However, it also exhibited significantly greater latency and token consumption. A Cohen’s kappa value of 0.575 indicated moderate agreement between the two models. Conclusion: DeepSeek-R1 is more suitable for high-stakes oncology tasks requiring high accuracy and consistency, whereas DeepSeek-R1-Distill-Qwen-32B offers a cost-effective alternative for use in outpatient or resource-limited settings. These findings support a task- and resource-adaptive deployment strategy for LLMs in clinical oncology.
- Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33: 1877-1901. doi: 10.48550/arXiv.2005.14165
- Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Pennsylvania: Association for Computational Linguistics; Vol. 1. 2019. p. 4171-4186.doi: 10.48550/arXiv.1810.04805
- Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models. arXiv; 2020. doi: 10.48550/arXiv.2001.08361
- Stiennon N, Ouyang L, Wu J, et al. Learning to summarize with human feedback. In: Advances in Neural Information Processing Systems. Vol. 33. Cambridge: MIT Press; 2020. p. 3008-3021. doi: 10.48550/arXiv.2009.03125
- Liao H. Deepseek large-scale model: Technical analysis and development prospect. J Comput Sci Electr Eng. 2025;7(1):33-37. doi: 10.61784/jcsee3035
- De Carvalho GP, Sawanobori T, Horii T. Data-driven motion planning: A survey on deep neural networks, reinforcement learning, and large language model approaches. IEEE Access. 2025;13:52195-52245. doi: 10.1109/ACCESS.2025.3552225
- Guo D, Yang D, Zhang H, et al. Deepseek-R1: Incentivizing Reasoning Capability in LLMs Via Reinforcement Learning. arXiv. China: DeepSeek; 2025. doi: 10.48550/arXiv.2501.12948
- Gou J, Yu B, Maybank SJ, Tao D. Knowledge distillation: A survey. Int J Comput Vis. 2021;129(6):1789-1819. doi: 10.1007/s11263-021-01453-z
- Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv. 2015. doi: 10.48550/arXiv.1503.02531
- Shimizu H, Nakayama KI. Artificial intelligence in oncology. Cancer Sci. 2020;111(5):1452-1460. doi: 10.1111/cas.14377
- Mulita F, Verras GI, Anagnostopoulos CN, Kotis K. A smarter health through the internet of surgical things. Sensors (Basel). 2022;22(12):4577. doi: 10.3390/s22124577
- Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. [arXiv Preprint]; 2019. doi: 10.48550/arXiv.1910.01108
- Esmaeilzadeh P. Challenges and strategies for wide-scale artificial intelligence (AI) deployment in healthcare practices: A perspective for healthcare organizations. Artif Intell Med. 2024;151:102861. doi: 10.1016/j.artmed.2024.102861
- Sung H, Ferlay J, Siegel RL, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209-249. doi: 10.3322/caac.21660
- Huang Y, Tang K, Chen M, Wang B. A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry. [arXiv Preprint]; 2024. doi: 10.48550/arXiv.2404.15777
- Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44-56. doi: 10.1038/s41591-018-0300-7
- Jiang L, Wu Z, Xu X, et al. Opportunities and challenges of artificial intelligence in the medical field: current application, emerging problems, and problem-solving strategies. J Int Med Res. 2021;49(3):03000605211000157. doi: 10.1177/03000605211000157
- Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: Applications and implications. JMIR Med Educ. 2023;9:e50945. doi: 10.2196/50945