RefSAM3D: Adapting the Segment Anything Model with cross-modal references for three-dimensional medical image segmentation

The Segment Anything Model (SAM), originally built on a two-dimensional vision transformer, excels at capturing global patterns in two-dimensional natural images but faces challenges when applied to three-dimensional (3D) medical imaging modalities such as computed tomography and magnetic resonance imaging. These modalities require capturing spatial information in volumetric space for tasks such as organ segmentation and tumor quantification. To address this challenge, we introduce RefSAM3D, an adaptation of SAM for 3D medical imaging by incorporating a 3D image adapter and cross-modal reference prompt generation. Our approach modifies the visual encoder to handle 3D inputs and enhances the mask decoder for direct 3D mask generation. We also integrate textual prompts to improve segmentation accuracy and consistency in complex anatomical scenarios. By employing a hierarchical attention mechanism, our model effectively captures and integrates information across different scales. Extensive evaluations on multiple medical imaging datasets demonstrate that RefSAM3D outperforms state-of-the-art methods. Our work thus advances the application of SAM in accurately segmenting complex anatomical structures in medical imaging.
- Obuchowicz R, Strzelecki M, Piorkowski A. Clinical applications of artificial intelligence in medical imaging and image processing-A review. Cancers (Basel). 2024;16(10):1870. doi: 10.3390/cancers16101870
- Addimulam S, Mohammed MA, Karanam RK, et al. Deep learning-enhanced image segmentation for medical diagnostics. Malays J Med Biol Res. 2020;7(2):145-152.
- Khalifa M, Albadawy M. AI in diagnostic imaging: Revolutionising accuracy and efficiency. In: Computer Methods and Programs in Biomedicine Update. Vol. 5; 2024.
- Kirillov A, Mintun E, Ravi N, et al. Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2023. p. 4015-4026.
- Zou X, Yang J, Zhang H, et al. Segment Everything Everywhere all at Once. arXiv Preprint arXiv: 2304.06718; 2023.
- Huang Y, Yang X, Liu L, et al. Segment anything model for medical images? Med Image Anal. 2024;92:103061. doi: 10.1016/j.media.2023.103061
- Hu EJ, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models. arXiv preprint:2106.09685, 2021.
- Poth C, Sterz H, Paul I, et al. Adapters: A unified library for parameter-efficient and modular transfer learning. In: Feng Y, Lefever E, editors. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore; 2023. p. 149-160.
- Shen J, Wang W, Chen C, et al. Medtuning: A New Parameter-efficient Tuning Framework for Medical Volumetric Segmentation. arXiv Preprint arXiv: 2304.10880; 2024.
- Zhang K, Liu D. Customized Segment Anything Model for Medical Image Segmentation. arXiv preprint arXiv: 2304.13785; 2023.
- Wang H, Guo S, Ye J, et al. Sam-med3d: Towards General-purpose Segmentation Models for Volumetric Medical Images. arXiv preprint arXiv: 2310.15161; 2024.
- Wu J, Ji W, Liu Y, et al. Medical Sam Adapter: Adapting Segment Anything Model for Medical Image Segmentation. arXiv preprint arXiv: 2304.12620; 2023.
- Gong S, Zhong Y, Ma W, et al. 3dsamadapter: Holistic adaptation of sam from 2d to 3d for promptable tumor segmentation. Med Image Anal. 2024;98:103324.
- Xie B, Tang H, Duan B, Cai D, Yan Y. Masksam: Towards Auto-prompt Sam with Mask Classification for Medical Image Segmentation. arXiv preprint arXiv: 2403.14103; 2024.
- Li C, Khanduri P, Qiang Y, Sultan RI, Chetty I, Zhu D. Autoprosam: Automated Prompting Sam for 3d Multi-Organ Segmentation. arXiv preprint arXiv: 2308.14936; 2024.
- Zhang Y, Jiao R. Towards Segment Anything Model (sam) for Medical Image Segmentation: A Survey. arXiv preprint arXiv: 2305.03678; 2023.
- Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654. doi: 10.1038/s41467-024-44824-z.
- Shaharabany T, Dahan A, Giryes R, Wolf L. Autosam: Adapting Sam to Medical Images by Overloading the Prompt Encoder. arXiv preprint arXiv: 2306.06370; 2023.
- Na S, Guo Y, Jiang F, Ma H, Huang J. Segment any Cell: A Sam-Based Auto-Prompting Finetuning Framework for Nuclei Segmentation. arXiv preprint arXiv: 2401.13220; 2024.
- Min B, Ross H, Sulem E, et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv. 2024;57(1):1-45. doi: 10.1145/3605943
- Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020;2021.
- Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918; 2021.
- Zou X, Yang J, Zhang H, et al. Segment everything everywhere all at once. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA; 2023.
- Wang X, Zhang X, Cao Y, Wang W, Shen C, Huang T. Seggpt: Segmenting Everything in Context. arXiv Preprint arXiv: 2304.03284; 2023.
- Oquab M, Darcet T, Moutakanni T. Dinov2: Learning Robust Visual Features without Supervision. arXiv Preprint arXiv: 2304.07193; 2024.
- Wang Y, Zhou W, Mao Y, Li H. Detect any Shadow: Segment Anything for Video Shadow Detection. arXiv preprint arXiv: 2305.16698; 2023.
- Deng R, Cui C, Liu Q, et al. Segment Anything Model (sam) for Digital Pathology: Assess Zero-Shot Segmentation on Whole Slide Imaging. arXiv preprint arXiv: 2304.04155; 2023.
- He S, Bao R, Li J, et al. Accuracy of Segmentanything Model (sam) in Medical Image Segmentation Tasks. arXiv preprint arXiv: 2304.09324; 2023.
- Hu C, Li X. When Sam Meets Medical Images: An Investigation of Segment Anything Model (Sam) on Multi- Phase Liver Tumor Segmentation. arXiv preprint arXiv: 2304.08506; 2023.
- Zhou T, Zhang Y, Zhou Y, Wu Y, Gong C. Can Sam Segment Polyps? arXiv preprint arXiv: 2304.07583; 2023.
- Cheng J, Ye Y, Deng Z, et al. Sam-med2d. arXiv preprint arXiv: 2308.116184; 2023.
- Lei W, Wei X, Zhang X, Li K, Zhang S. Medlsam: Localize and Segment Anything Model for 3D CT Images. arXiv preprint arXiv: 2306.14752; 2024.
- Yang Y, Wu X, He T, Zhao H, Liu X. Sam3d: Segment Anything in 3D Scenes. In: International Conference on Computer Vision; 2023.
- Chen C, Miao J, Wu D, et al. Ma-sam: Modality-agnostic sam adaptation for 3D medical image segmentation. Med Image Anal. 2024;98:103310.
- Pan J, Lin Z, Zhu X, Shao J, Li H. St-adapter: Parameter- Efficient Image-to-Video Transfer Learning. arXiv preprint arXiv: 2206.13559; 2022. 36. Muksimova S, Umirzakova S, Baltayev J, Cho YI. Rl-cervix. net: A hybrid lightweight model integrating reinforcement learning for cervical cell classification. Diagnostics (Basel). 2025;15(3):364.
- Jia M, Tang L, Chen BC, et al. Visual Prompt Tuning. arXiv Preprint arXiv: 2203.12119; 2022.
- Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR; 2021. p. 8748-8763.
- Jia C, Yang Y, Xia Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning. PMLR; 2021, pp. 4904–4916.
- Dosovitskiy A. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929; 2020.
- Ding H, Liu C, Wang S, Jiang X. Vision-language transformer and query generation for referring segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021. p. 16321-16330.
- Li Y, Zhang J, Teng X, Lan L, Liu X. Refsam: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation. arXiv Preprint arXiv: 2307.00997; 2024.
- Heller N, Isensee F, Trofimova D. The kits21 Challenge: Automatic Segmentation of Kidneys, Renal Tumors, and Renal Cysts in Corticomedullary-phase ct. arXiv Preprint arXiv: 2307.01984; 2023.
- Bilic P, Christ P, Li HB, et al. The liver tumor segmentation benchmark (LiTS). Med Image Anal. 2023;84:102680. doi: 10.1016/j.media.2022.102680
- Antonelli M, Reinke A, Bakas S, et al. The medical segmentation decathlon. Nat Commun. 2022;13(1):4128. doi: 10.1038/s41467-022-30695-9
- Zhuang X, Li L, Payer C. Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge. Med Image Anal. 2019;58:101537. doi: 10.1016/j.media.2019.101537
- Landman B, Xu Z, Iglesias J, Styner M, Langerak T, Klein A. Miccai multi-atlas labeling beyond the cranial vault-workshop and challenge. Vol. 5. In: Proceeding MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge; 2015. p. 12.
- Tang Y, Yang D, Li W, et al. Self-supervised Pre-training of Swin Transformers for 3d Medical Image Analysis. arXiv preprint arXiv:2111.14791; 2022.
- Ji Y, Bai H, Yang J, et al. Amos: A Large-scale Abdominal Multiorgan Benchmark for Versatile Medical Image Segmentation. arXiv preprint arXiv:2206.08023; 2022.
- Isensee F, Petersen J, Klein A, et al. nnU-net: Self-Adapting Framework for u-net-Based Medical Image Segmentation. arXiv preprint arXiv: 1809.10486; 2018.
- Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Wells WM 3rd, Frangi AF, editors. Medical Image Computing and Computer-Assisted Intervention - MICCAI, Nassir Navab, Joachim Hornegger; 2015.
- Hatamizadeh A, Nath V, Tang Y, Yang D, Roth H, Xu D. Swin unetr: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv preprint arXiv: 2201.01266; 2022.
- Zhou HY, Guo J, Zhang Y, et al. nnformer: Volumetric medical image segmentation via a 3D transformer. IEEE Trans Image Process. 2023;32:4036-4045. doi: 10.1109/TIP.2023.3293771
- Shaker A, Maaz M, Rasheed H, et al. Unetr++: Delving into Efficient and Accurate 3D Medical Image Segmentation. arXiv Preprint arXiv: 2212.04497; 2024.
- Lee HH, Bao S, Huo Y, Landman BA. 3D ux-net: A Large Kernel Volumetric Convnet Modernizing Hierarchical Transformer for Medical Image Segmentation. arXiv Preprint arXiv: 2209.15076; 2023.