AccScience Publishing / AIH / Online First / DOI: 10.36922/aih.3427
ORIGINAL RESEARCH ARTICLE

Integrated sources model: A new space-learning model for heterogeneous multi-view data reduction, visualization, and clustering

Paul Fogel1* Christophe Geissler1 Franck Augé2 Galina Boldina3 George Luta4
Show Less
1 Data Services, Mazars, Courbevoie, France
2 Translational Precision Medicine, Sanofi, Vitry-sur-Seine, France
3 Precision Medicine and Computational Biology, Sanofi, Vitry-sur-Seine, France
4 Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, D.C., United States of America
AIH 2024, 1(3), 89–113; https://doi.org/10.36922/aih.3427
Submitted: 16 April 2024 | Accepted: 5 June 2024 | Published: 24 July 2024
© 2024 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )
Abstract

In machine learning, multi-view data involve multiple distinct sets of attributes (“views”) for a common set of observations; when each view has the same attributes considered in different contexts, the data are said to contain multiple views of homogeneous format, which can be conceptualized as a tensor. In this article, we describe a novel approach for integrating multiple views of heterogeneous format into a common latent space using a workflow that involves non-negative matrix and tensor factorization (NMF/NTF). This approach, which we refer to as the integrated sources model (ISM), consists of two main steps: Embedding and analysis. In the embedding step, the views are transformed into matrices with common non-negative components. In the analysis step, the transformed views are combined into a tensor and decomposed using NTF. We also present a variant of ISM; the integrated latent sources model (ILSM), which offers significant advantages over ISM in terms of computational power and in cases where the views are highly unbalanced with regard to the number of attributes per view. Noteworthy, ISM can be extended to process multi-omic and multi-view datasets even in the presence of missing views. We provide a proof-of-concept analysis using five examples, including the UCI Digits (the University of California Irvine Pen-Based Recognition of Handwritten Digits) dataset, a public cell-type gene signatures dataset, and a multi-omic single-cell dataset. These examples demonstrate that, in most cases, multi-view clustering is better achieved with ISM or its variant ILSM than with other latent space approaches. We also show how the non-negativity and sparsity of the ISM model components enable straightforward interpretations, in contrast to other approaches that involve latent factors of mixed signs. Finally, we present potential applications to single-cell multi-omics and spatial mapping, including spatial imaging, spatial transcriptomics, and computational biology, which are currently under evaluation. ISM relies on state-of-the-art algorithms invoked through a simple workflow implemented in Python.

Keywords
Principal component analysis
Non-negative matrix factorization
Non-negative tensor factorization
Multi-view clustering
Canonical correlation analysis
Common principal components
Multidimensional scaling
Funding
None.
Conflict of interest
Franck Augé and Galina Boldina are employees of Sanofi and may hold shares and/or stock options in the company. All other authors declare no conflicts of interest.
References
  1. Cichocki A, Zdunek R, Phan AH, Amari S. Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. IEEE Signal Process Mag. 2009;25:142-145. doi: 10.1002/9780470747278

 

  1. Perry R, Mischler G, Guo R, et al. mvlearn: Multiview machine learning in python. J Mach Learn Res. 2020;22(109):1-7. doi: 10.48550/arXiv.2005.11890

 

  1. Argelaguet R, Velten B, Arnol D, et al. Multi‐omics factor analysis-a framework for unsupervised integration of multi‐omics data sets. Mol Syst Biol. 2018;14(6):e8124. doi: 10.15252/msb.20178124

 

  1. Argelaguet R, Arnol D, Bredikhin D, et al. MOFA+: A statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21(1):111. doi: 10.1186/s13059-020-02015-1

 

  1. Wu J, Lin Z, Zha H. Essential tensor learning for multi-view spectral clustering. IEEE Trans Image Process. 2019;28(12):5910-5922. doi: 10.1109/tip.2019.2916740

 

  1. Guo W, Che H, Leung M. Tensor-based adaptive consensus graph learning for multi-view clustering. IEEE Trans Consum Electron. 2024. doi: 10.1109/tce.2024.3376397

 

  1. Li J, Gao Q, Wang Q, Xia W, Gao X. Multi-View Clustering via Semi-Non-Negative Tensor Factorization. arXiv [Preprint]; 2023. doi: 10.48550/arXiv.2303.16748

 

  1. Wang S, Cao J, Lei F, Jiang J, Dai Q, Ling BW. Multiple kernel-based anchor graph coupled low-rank tensor learning for incomplete multi-view clustering. Appl Intell. 2022;53(4):3687-3712. doi: 10.1007/s10489-022-03735-6

 

  1. Zhao W, Gao Q, Li G, Deng C, Yang M. One-Step Multi- View Clustering Based on Transition Probability. arXiv [Preprint]; 2024. doi: 10.48550/arXiv.2403.01460

 

  1. Ali W, Yang M, Ali M, Ud-Din S. Fuzzy model-based sparse clustering with multivariate t-mixtures. Appl Artif Intell. 2023;37(1):2169299. doi: 10.1080/08839514.2023.2169299

 

  1. Yang M, Hussain I. Unsupervised multi-view k-means clustering algorithm. IEEE Access. 2023;11:13574-13593. doi: 10.1109/access.2023.3243133

 

  1. Hussain I, Sinaga KP, Yang M. Unsupervised multiview fuzzy C-means clustering algorithm. Electronics. 2023;12(21):4467-4467. doi: 10.3390/electronics12214467

 

  1. Smilde AK, Westerhuis JA, de Jong S. A framework for sequential multiblock component methods. J Chemometr. 2003;17(6):323-337. doi: 10.1002/cem.811

 

  1. Trendafilov NT. Stepwise estimation of common principal components. Comput Stat Data Anal. 2010;54(12):3446-3457. doi: 10.1016/j.csda.2010.03.010

 

  1. Tenenhaus A, Tenenhaus M. Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. Eur J Oper Res. 2014;238(2):391-403. doi: 10.1016/j.ejor.2014.01.008

 

  1. Zhang C, Hu Q, Fu H, Zhu PF, Cao X. Latent Multi-View Subspace Clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. p. 4333-4341. doi: 10.1109/cvpr.2017.461

 

  1. Chen M, Huang L, Wang C, Huang D. Multi-view clustering in latent embedding space. Proc AAAI Conf Artif Intell. 2020;34(4):3513-3520. doi: 10.1609/aaai.v34i04.5756

 

  1. Leppäaho E, Ammad-ud-din M, Kaski S. GFA: Exploratory analysis of multiple data sources with group factor analysis. J Mach Learn Res. 2017;18(1):1294-1298.

 

  1. Zhao S, Gao C, Mukherjee S, Engelhardt BE. Bayesian group factor analysis with structured sparsity. J Mach Learn Res. 2016;17(1):6868-6914.

 

  1. Zhang X, Zhao L, Zong L, Liu X, Yu H. Multi-view Clustering via Multi-Manifold Regularized Nonnegative Matrix Factorization. In: IEEE International Conference on Data Mining; 2014. p. 1103-1108. doi: 10.1109/icdm.2014.19

 

  1. Huizing G, Deutschmann IM, Peyré G, Cantini L. Paired single-cell multi-omics data integration with Mowgli. Nat Commun. 2023;14(1):7711. doi: 10.1038/s41467-023-43019-2

 

  1. Brbic M, Kopriva I. Multi-view low-rank sparse subspace clustering. Pattern Recognit. 2018;73:247-258. doi: 10.1016/j.patcog.2017.08.024

 

  1. Dong Y, Che H, Leung MF, Liu C, Yan Z. Centric graph regularized log-norm sparse non-negative matrix factorization for multi-view clustering. Signal Process. 2024;217:109341. doi: 10.1016/j.sigpro.2023.109341

 

  1. Fu L, Lin P, Vasilakos AV, Wang S. An overview of recent multi-view clustering. Neurocomputing. 2020;402:148-161. doi: 10.1016/j.neucom.2020.02.104

 

  1. Dua D, Graff C. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science; 2017. Available from: https://archive.ics. uci.edu/dataset/72/multiple+features

 

  1. Boldina G, Fogel P, Rocher C, Bettembourg C, Luta G, Augé F. A2Sign: Agnostic algorithms for signatures-a universal method for identifying molecular signatures from transcriptomic datasets prior to cell-type deconvolution. Bioinformatics. 2021;38(4):1015-1021. doi: 10.1093/bioinformatics/btab773

 

  1. Lewis DD, Yang Y, Rose TG, Li F. RCV1: A new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361-397.

 

  1. Brbic M, Piškorec M, Vidulin V, Kriško A, Šmuc T, Supek F. The landscape of microbial phenotypic traits and associated genes. Nucleic Acids Res. 2016;44:10074-10090. doi: 10.1093/nar/gkw964

 

  1. Swanson E, Lord C, Reading J, et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. Elife. 2021;10:e63632. doi: 10.7554/eLife.63632

 

  1. Hirschman AO. The paternity of an index. Am Econ Rev. 1964;54:761-762.

 

  1. Fogel P, Geissler C, Morizet N, Luta G. On rank selection in non-negative matrix factorization using concordance. Mathematics. 2023;11(22):4611. doi: 10.3390/math11224611

 

  1. Badeau R, Bertin N, Vincent E. Stability analysis of multiplicative update algorithms and application to nonnegative matrix factorization. IEEE Trans Neural Netw. 2010;21(12):1869-1881. doi: 10.1109/tnn.2010.2076831

 

  1. Donoho DL, Stodden V. When does non-negative matrix factorization give a correct decomposition into parts? Adv Neural Inf Process Syst. 2003;16:1141-1148. doi: 10.7916/d88d05n7

 

  1. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193-218. doi: 10.1007/BF01908075

 

  1. Strehl A, Ghosh J. Cluster ensembles-A knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 2002;3:583-617. doi: 10.1162/153244303321897735

 

  1. Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings: Rejoinder. J Am Stat Assoc. 1983;78(383):553-569. doi: 10.2307/2288123

 

  1. Demaine ED, Hesterberg A, Koehler F, Lynch J, Urschel JC. Multidimensional Scaling: Approximation and Complexity. In: Proceedings of the 38th International Conference on Machine Learning; 2021. p. 2568-2578. doi: 10.48550/arXiv.2109.11505

 

  1. Zhai Z, Lei YL, Wang R, Xie Y. Supervised capacity preserving mapping: A clustering guided visualization method for scRNA-seq Data. Bioinformatics. 2022;38(9):2496-2503. doi: 10.1093/bioinformatics/btac131

 

  1. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python. arXiv [Preprint]; 2011. doi: 10.48550/arXiv.1201.0490

 

  1. Fogel P, Hawkins DM, Beecher C, Luta G, Young SS. A tale of two matrix factorizations. Am Stat. 2013;67(4):207-218. doi: 10.1080/00031305.2013.845607

 

  1. Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A. 2004;101(12):4164-4169. doi: 10.1073/pnas.0308531101

 

  1. Hoyer PO. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res. 2004;5:1457-1469. doi: 10.48550/arXiv.cs/0408058

 

  1. Potluru VK, Plis S, Le Roux J, Pearlmutter BA, Calhoun VD, Hayes TP. Block Coordinate Descent for Sparse NMF. International Conference on Learning Representations (ICLR); 2013.

 

  1. Boutsidis C, Gallopoulos E. SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognit. 2008;41(4):1350-1362. doi: 10.1016/j.patcog.2007.09.010

 

  1. Ma A, Wang X, Li J, et al. Single-cell biological network inference using a heterogeneous graph transformer. Nat Commun. 2023;14(1):964. doi: 10.1038/s41467-023-36559-0

 

  1. Vaswani A, Shazeer NM, Parmar N, et al. Attention is all you need. Neural Inf Process Syst. 2017;30:5998-6008.

 

  1. Park J, Jin IH, Jeon M. How social networks influence human behavior: An integrated latent space approach for differential social influence. Psychometrika. 2023;88:1529-1555. doi: 10.1007/s11336-023-09934-5

 

  1. Pinel P, Guichaoua G, Najm M, et al. Exploring isofunctional molecules: Design of a benchmark and evaluation of prediction performance. Mol Inform. 2023;42(4):e2200216. doi: 10.1002/minf.202200216
Share
Back to top
Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing