RefSAM3D: Adapting the Segment Anything Model with cross-modal references for three-dimensional medical image segmentation

© 2025 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )

Download PDF

XML

Cite

Abstract

The Segment Anything Model (SAM), originally built on a two-dimensional vision transformer, excels at capturing global patterns in two-dimensional natural images but faces challenges when applied to three-dimensional (3D) medical imaging modalities such as computed tomography and magnetic resonance imaging. These modalities require capturing spatial information in volumetric space for tasks such as organ segmentation and tumor quantification. To address this challenge, we introduce RefSAM3D, an adaptation of SAM for 3D medical imaging by incorporating a 3D image adapter and cross-modal reference prompt generation. Our approach modifies the visual encoder to handle 3D inputs and enhances the mask decoder for direct 3D mask generation. We also integrate textual prompts to improve segmentation accuracy and consistency in complex anatomical scenarios. By employing a hierarchical attention mechanism, our model effectively captures and integrates information across different scales. Extensive evaluations on multiple medical imaging datasets demonstrate that RefSAM3D outperforms state-of-the-art methods. Our work thus advances the application of SAM in accurately segmenting complex anatomical structures in medical imaging.

Keywords

Three-dimensional medical imaging

Cross-modal reference prompt

Volumetric segmentation

Vision transformer

Funding

None.

Conflict of interest

The authors declare they have no competing interests.

References

Obuchowicz R, Strzelecki M, Piorkowski A. Clinical applications of artificial intelligence in medical imaging and image processing-A review. Cancers (Basel). 2024;16(10):1870. doi: 10.3390/cancers16101870

Addimulam S, Mohammed MA, Karanam RK, et al. Deep learning-enhanced image segmentation for medical diagnostics. Malays J Med Biol Res. 2020;7(2):145-152.

Khalifa M, Albadawy M. AI in diagnostic imaging: Revolutionising accuracy and efficiency. Comput Methods Prog Biomed Update. 2024;5:100146. doi: 10.1016/j.cmpbup.2024.100146

Kirillov A, Mintun E, Ravi N, et al. Segment Anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2023:3992-4003. doi: 10.1109/iccv51070.2023.00371

Zou X, Yang J, Zhang H, et al. Segment Everything Everywhere all at Once. arXiv Preprint arXiv: 2304.06718; 2023.

Huang Y, Yang X, Liu L, et al. Segment anything model for medical images? Med Image Anal. 2024;92:103061. doi: 10.1016/j.media.2023.103061

Hu EJ, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models. arXiv preprint:2106.09685, 2021.

Poth C, Sterz H, Paul I, et al. Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics; 2023:149-160. doi: 10.18653/v1/2023.emnlp-demo.13

Shen J, Wang W, Chen C, et al. Medtuning: A New Parameter-efficient Tuning Framework for Medical Volumetric Segmentation. arXiv Preprint arXiv: 2304.10880; 2024.

Zhang K, Liu D. Customized Segment Anything Model for Medical Image Segmentation. arXiv preprint arXiv: 2304.13785; 2023.

Wang H, Guo S, Ye J, et al. Sam-med3d: Towards General-purpose Segmentation Models for Volumetric Medical Images. arXiv preprint arXiv: 2310.15161; 2024.

Wu J, Ji W, Liu Y, et al. Medical Sam Adapter: Adapting Segment Anything Model for Medical Image Segmentation. arXiv preprint arXiv: 2304.12620; 2023.

Gong S, Zhong Y, Ma W, et al. 3dsamadapter: Holistic adaptation of sam from 2d to 3d for promptable tumor segmentation. Med Image Anal. 2024;98:103324.

Xie B, Tang H, Duan B, Cai D, Yan Y. Masksam: Towards Auto-prompt Sam with Mask Classification for Medical Image Segmentation. arXiv preprint arXiv: 2403.14103; 2024.

Li C, Khanduri P, Qiang Y, Sultan RI, Chetty I, Zhu D. Autoprosam: Automated Prompting Sam for 3d Multi-Organ Segmentation. arXiv preprint arXiv: 2308.14936; 2024.

Zhang Y, Jiao R. Towards Segment Anything Model (sam) for Medical Image Segmentation: A Survey. arXiv preprint arXiv: 2305.03678; 2023.

Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654. doi: 10.1038/s41467-024-44824-z.

Shaharabany T, Dahan A, Giryes R, Wolf L. Autosam: Adapting Sam to Medical Images by Overloading the Prompt Encoder. arXiv preprint arXiv: 2306.06370; 2023.

Na S, Guo Y, Jiang F, Ma H, Huang J. Segment any Cell: A Sam-Based Auto-Prompting Finetuning Framework for Nuclei Segmentation. arXiv preprint arXiv: 2401.13220; 2024.

Min B, Ross H, Sulem E, et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv. 2024;57(1):1-45. doi: 10.1145/3605943

Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020;2021.

Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918; 2021.

Zou X, Yang J, Zhang H, et al. Segment everything everywhere all at once. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA; 2023.

Wang X, Zhang X, Cao Y, Wang W, Shen C, Huang T. SegGPT: Segmenting Everything in Context. arXiv Preprint arXiv: 2304.03284; 2023.

Oquab M, Darcet T, Moutakanni T. Dinov2: Learning Robust Visual Features without Supervision. arXiv Preprint arXiv: 2304.07193; 2024.

Wang Y, Zhou W, Mao Y, Li H. Detect Any Shadow: Segment Anything for Video Shadow Detection. IEEE Trans Circuits Syst Video Technol. 2024;34(5):3782-3794. doi: 10.1109/tcsvt.2023.3320688

Deng R, Cui C, Liu Q, et al. Segment Anything Model (SAM) for Digital Pathology: Assess Zero-shot Segmentation on Whole Slide Imaging. Electronic Imaging. 2025;37(14):132- 1-132-134. doi: 10.2352/ei.2025.37.14.coimg-132

He S, Bao R, Li J, et al. Accuracy of Segment-Anything Model (SAM) in Medical Image Segmentation Tasks. arXiv preprint arXiv: 2304.09324; 2023.

Hu C, Li X. When SAM Meets Medical Images: An Investigation of Segment Anything Model (Sam) on Multi- Phase Liver Tumor Segmentation. arXiv preprint arXiv: 2304.08506; 2023.

Zhou T, Zhang Y, Zhou Y, Wu Y, Gong C. Can SAM Segment Polyps? arXiv preprint arXiv: 2304.07583; 2023.

Cheng J, Ye Y, Deng Z, et al. Sam-med2d. arXiv preprint arXiv: 2308.16184; 2023.

Lei W, Wei X, Zhang X, Li K, Zhang S. Medlsam: Localize and Segment Anything Model for 3D CT Images. arXiv preprint arXiv: 2306.14752; 2024.

Yang Y, Wu X, He T, Zhao H, Liu X. SAM3d: Segment Anything in 3D Scenes. In: International Conference on Computer Vision; 2023.

Chen C, Miao J, Wu D, et al. Ma-sam: Modality-agnostic sam adaptation for 3D medical image segmentation. Med Image Anal. 2024;98:103310.

Pan J, Lin Z, Zhu X, Shao J, Li H. St-adapter: Parameter- Efficient Image-to-Video Transfer Learning. arXiv preprint arXiv: 2206.13559; 2022.

Muksimova S, Umirzakova S, Baltayev J, Cho YI. RL-Cervix. Net: A Hybrid Lightweight Model Integrating Reinforcement Learning for Cervical Cell Classification. Diagnostics. 2025;15(3):364. doi: 10.3390/diagnostics15030364

Jia M, Tang L, Chen BC, et al. Visual Prompt Tuning. arXiv Preprint arXiv: 2203.12119; 2022.

Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. PMLR; 2021:8748-8763.

Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning. PMLR; 2021:4904–4916.

Dosovitskiy A. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929; 2020.

Ding H, Liu C, Wang S, Jiang X. Vision-Language Transformer and Query Generation for Referring Segmentation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2021:16301-16310. doi: 10.1109/iccv48922.2021.01601

Li Y, Zhang J, Teng X, Lan L, Liu X. Refsam: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation. arXiv Preprint arXiv: 2307.00997; 2024.

Heller N, Isensee F, Trofimova D. The kits21 Challenge: Automatic Segmentation of Kidneys, Renal Tumors, and Renal Cysts in Corticomedullary-phase ct. arXiv Preprint arXiv: 2307.01984; 2023.

Bilic P, Christ P, Li HB, et al. The liver tumor segmentation benchmark (LiTS). Med Image Anal. 2023;84:102680. doi: 10.1016/j.media.2022.102680

Antonelli M, Reinke A, Bakas S, et al. The medical segmentation decathlon. Nat Commun. 2022;13(1):4128. doi: 10.1038/s41467-022-30695-9

Zhuang X, Li L, Payer C. Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge. Med Image Anal. 2019;58:101537. doi: 10.1016/j.media.2019.101537

Landman B, Xu Z, Iglesias J, Styner M, Langerak T, Klein A. Miccai multi-atlas labeling beyond the cranial vault-workshop and challenge. Vol. 5. In: Proceeding MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge; 2015:12.

Tang Y, Yang D, Li W, et al. Self-supervised Pre-training of Swin Transformers for 3d Medical Image Analysis. arXiv preprint arXiv:2111.14791; 2022.

Ji Y, Bai H, Yang J, et al. Amos: A Large-scale Abdominal Multiorgan Benchmark for Versatile Medical Image Segmentation. arXiv preprint arXiv:2206.08023; 2022.

Isensee F, Petersen J, Klein A, et al. nnU-net: Self-Adapting Framework for u-net-Based Medical Image Segmentation. arXiv preprint arXiv: 1809.10486; 2018.

Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Wells WM 3rd, Frangi AF, editors. Medical Image Computing and Computer-Assisted Intervention - MICCAI, Nassir Navab, Joachim Hornegger; 2015.

Hatamizadeh A, Nath V, Tang Y, Yang D, Roth H, Xu D. Swin unetr: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv preprint arXiv: 2201.01266; 2022.

Zhou HY, Guo J, Zhang Y, et al. nnformer: Volumetric medical image segmentation via a 3D transformer. IEEE Trans Image Process. 2023;32:4036-4045. doi: 10.1109/TIP.2023.3293771

Shaker A, Maaz M, Rasheed H, Khan S, Yang MH, Shahbaz Khan F. UNETR++: Delving Into Efficient and Accurate 3D Medical Image Segmentation. IEEE Trans Med Imaging. 2024;43(9):3377-3390. doi: 10.1109/tmi.2024.3398728

Lee HH, Bao S, Huo Y, Landman BA. 3D ux-net: A Large Kernel Volumetric Convnet Modernizing Hierarchical Transformer for Medical Image Segmentation. arXiv Preprint arXiv: 2209.15076; 2023.

Previous article in this issue

Next article in this issue

Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing