Integrated sources model: A new space-learning model for heterogeneous multi-view data reduction, visualization, and clustering

© 2024 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )

Download PDF

Cite

XML

Abstract

In machine learning, multi-view data involve multiple distinct sets of attributes (“views”) for a common set of observations; when each view has the same attributes considered in different contexts, the data are said to contain multiple views of homogeneous format, which can be conceptualized as a tensor. In this article, we describe a novel approach for integrating multiple views of heterogeneous format into a common latent space using a workflow that involves non-negative matrix and tensor factorization (NMF/NTF). This approach, which we refer to as the integrated sources model (ISM), consists of two main steps: Embedding and analysis. In the embedding step, the views are transformed into matrices with common non-negative components. In the analysis step, the transformed views are combined into a tensor and decomposed using NTF. We also present a variant of ISM; the integrated latent sources model (ILSM), which offers significant advantages over ISM in terms of computational power and in cases where the views are highly unbalanced with regard to the number of attributes per view. Noteworthy, ISM can be extended to process multi-omic and multi-view datasets even in the presence of missing views. We provide a proof-of-concept analysis using five examples, including the UCI Digits (the University of California Irvine Pen-Based Recognition of Handwritten Digits) dataset, a public cell-type gene signatures dataset, and a multi-omic single-cell dataset. These examples demonstrate that, in most cases, multi-view clustering is better achieved with ISM or its variant ILSM than with other latent space approaches. We also show how the non-negativity and sparsity of the ISM model components enable straightforward interpretations, in contrast to other approaches that involve latent factors of mixed signs. Finally, we present potential applications to single-cell multi-omics and spatial mapping, including spatial imaging, spatial transcriptomics, and computational biology, which are currently under evaluation. ISM relies on state-of-the-art algorithms invoked through a simple workflow implemented in Python.

Keywords

Principal component analysis

Non-negative matrix factorization

Non-negative tensor factorization

Multi-view clustering

Canonical correlation analysis

Common principal components

Multidimensional scaling

Funding

None.

Conflict of interest

Franck Augé and Galina Boldina are employees of Sanofi and may hold shares and/or stock options in the company. All other authors declare no conflicts of interest.

References

Cichocki A, Zdunek R, Phan AH, Amari S. Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation. IEEE Signal Process Mag. 2009;25:142-145. doi: 10.1002/9780470747278

Perry R, Mischler G, Guo R, et al. mvlearn: Multiview machine learning in python. J Mach Learn Res. 2020;22(109):1-7. doi: 10.48550/arXiv.2005.11890

Argelaguet R, Velten B, Arnol D, et al. Multi‐omics factor analysis-a framework for unsupervised integration of multi‐omics data sets. Mol Syst Biol. 2018;14(6):e8124. doi: 10.15252/msb.20178124

Argelaguet R, Arnol D, Bredikhin D, et al. MOFA+: A statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21(1):111. doi: 10.1186/s13059-020-02015-1

Wu J, Lin Z, Zha H. Essential tensor learning for multi-view spectral clustering. IEEE Trans Image Process. 2019;28(12):5910-5922. doi: 10.1109/tip.2019.2916740

Guo W, Che H, Leung M. Tensor-based adaptive consensus graph learning for multi-view clustering. IEEE Trans Consum Electron. 2024. doi: 10.1109/tce.2024.3376397

Li J, Gao Q, Wang Q, Xia W, Gao X. Multi-View Clustering via Semi-Non-Negative Tensor Factorization. arXiv [Preprint]; 2023. doi: 10.48550/arXiv.2303.16748

Wang S, Cao J, Lei F, Jiang J, Dai Q, Ling BW. Multiple kernel-based anchor graph coupled low-rank tensor learning for incomplete multi-view clustering. Appl Intell. 2022;53(4):3687-3712. doi: 10.1007/s10489-022-03735-6

Zhao W, Gao Q, Li G, Deng C, Yang M. One-Step Multi- View Clustering Based on Transition Probability. arXiv [Preprint]; 2024. doi: 10.48550/arXiv.2403.01460

Ali W, Yang M, Ali M, Ud-Din S. Fuzzy model-based sparse clustering with multivariate t-mixtures. Appl Artif Intell. 2023;37(1):2169299. doi: 10.1080/08839514.2023.2169299

Yang M, Hussain I. Unsupervised multi-view k-means clustering algorithm. IEEE Access. 2023;11:13574-13593. doi: 10.1109/access.2023.3243133

Hussain I, Sinaga KP, Yang M. Unsupervised multiview fuzzy C-means clustering algorithm. Electronics. 2023;12(21):4467-4467. doi: 10.3390/electronics12214467

Smilde AK, Westerhuis JA, de Jong S. A framework for sequential multiblock component methods. J Chemometr. 2003;17(6):323-337. doi: 10.1002/cem.811

Trendafilov NT. Stepwise estimation of common principal components. Comput Stat Data Anal. 2010;54(12):3446-3457. doi: 10.1016/j.csda.2010.03.010

Tenenhaus A, Tenenhaus M. Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. Eur J Oper Res. 2014;238(2):391-403. doi: 10.1016/j.ejor.2014.01.008

Zhang C, Hu Q, Fu H, Zhu PF, Cao X. Latent Multi-View Subspace Clustering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017. p. 4333-4341. doi: 10.1109/cvpr.2017.461

Chen M, Huang L, Wang C, Huang D. Multi-view clustering in latent embedding space. Proc AAAI Conf Artif Intell. 2020;34(4):3513-3520. doi: 10.1609/aaai.v34i04.5756

Leppäaho E, Ammad-ud-din M, Kaski S. GFA: Exploratory analysis of multiple data sources with group factor analysis. J Mach Learn Res. 2017;18(1):1294-1298.

Zhao S, Gao C, Mukherjee S, Engelhardt BE. Bayesian group factor analysis with structured sparsity. J Mach Learn Res. 2016;17(1):6868-6914.

Zhang X, Zhao L, Zong L, Liu X, Yu H. Multi-view Clustering via Multi-Manifold Regularized Nonnegative Matrix Factorization. In: IEEE International Conference on Data Mining; 2014. p. 1103-1108. doi: 10.1109/icdm.2014.19

Huizing G, Deutschmann IM, Peyré G, Cantini L. Paired single-cell multi-omics data integration with Mowgli. Nat Commun. 2023;14(1):7711. doi: 10.1038/s41467-023-43019-2

Brbic M, Kopriva I. Multi-view low-rank sparse subspace clustering. Pattern Recognit. 2018;73:247-258. doi: 10.1016/j.patcog.2017.08.024

Dong Y, Che H, Leung MF, Liu C, Yan Z. Centric graph regularized log-norm sparse non-negative matrix factorization for multi-view clustering. Signal Process. 2024;217:109341. doi: 10.1016/j.sigpro.2023.109341

Fu L, Lin P, Vasilakos AV, Wang S. An overview of recent multi-view clustering. Neurocomputing. 2020;402:148-161. doi: 10.1016/j.neucom.2020.02.104

Dua D, Graff C. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science; 2017. Available from: https://archive.ics. uci.edu/dataset/72/multiple+features

Boldina G, Fogel P, Rocher C, Bettembourg C, Luta G, Augé F. A2Sign: Agnostic algorithms for signatures-a universal method for identifying molecular signatures from transcriptomic datasets prior to cell-type deconvolution. Bioinformatics. 2021;38(4):1015-1021. doi: 10.1093/bioinformatics/btab773

Lewis DD, Yang Y, Rose TG, Li F. RCV1: A new benchmark collection for text categorization research. J Mach Learn Res. 2004;5:361-397.

Brbic M, Piškorec M, Vidulin V, Kriško A, Šmuc T, Supek F. The landscape of microbial phenotypic traits and associated genes. Nucleic Acids Res. 2016;44:10074-10090. doi: 10.1093/nar/gkw964

Swanson E, Lord C, Reading J, et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. Elife. 2021;10:e63632. doi: 10.7554/eLife.63632

Hirschman AO. The paternity of an index. Am Econ Rev. 1964;54:761-762.

Fogel P, Geissler C, Morizet N, Luta G. On rank selection in non-negative matrix factorization using concordance. Mathematics. 2023;11(22):4611. doi: 10.3390/math11224611

Badeau R, Bertin N, Vincent E. Stability analysis of multiplicative update algorithms and application to nonnegative matrix factorization. IEEE Trans Neural Netw. 2010;21(12):1869-1881. doi: 10.1109/tnn.2010.2076831

Donoho DL, Stodden V. When does non-negative matrix factorization give a correct decomposition into parts? Adv Neural Inf Process Syst. 2003;16:1141-1148. doi: 10.7916/d88d05n7

Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193-218. doi: 10.1007/BF01908075

Strehl A, Ghosh J. Cluster ensembles-A knowledge reuse framework for combining multiple partitions. J Mach Learn Res. 2002;3:583-617. doi: 10.1162/153244303321897735

Fowlkes EB, Mallows CL. A method for comparing two hierarchical clusterings: Rejoinder. J Am Stat Assoc. 1983;78(383):553-569. doi: 10.2307/2288123

Demaine ED, Hesterberg A, Koehler F, Lynch J, Urschel JC. Multidimensional Scaling: Approximation and Complexity. In: Proceedings of the 38^thInternational Conference on Machine Learning; 2021. p. 2568-2578. doi: 10.48550/arXiv.2109.11505

Zhai Z, Lei YL, Wang R, Xie Y. Supervised capacity preserving mapping: A clustering guided visualization method for scRNA-seq Data. Bioinformatics. 2022;38(9):2496-2503. doi: 10.1093/bioinformatics/btac131

Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python. arXiv [Preprint]; 2011. doi: 10.48550/arXiv.1201.0490

Fogel P, Hawkins DM, Beecher C, Luta G, Young SS. A tale of two matrix factorizations. Am Stat. 2013;67(4):207-218. doi: 10.1080/00031305.2013.845607

Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A. 2004;101(12):4164-4169. doi: 10.1073/pnas.0308531101

Hoyer PO. Non-negative matrix factorization with sparseness constraints. J Mach Learn Res. 2004;5:1457-1469. doi: 10.48550/arXiv.cs/0408058

Potluru VK, Plis S, Le Roux J, Pearlmutter BA, Calhoun VD, Hayes TP. Block Coordinate Descent for Sparse NMF. International Conference on Learning Representations (ICLR); 2013.

Boutsidis C, Gallopoulos E. SVD based initialization: A head start for nonnegative matrix factorization. Pattern Recognit. 2008;41(4):1350-1362. doi: 10.1016/j.patcog.2007.09.010

Ma A, Wang X, Li J, et al. Single-cell biological network inference using a heterogeneous graph transformer. Nat Commun. 2023;14(1):964. doi: 10.1038/s41467-023-36559-0

Vaswani A, Shazeer NM, Parmar N, et al. Attention is all you need. Neural Inf Process Syst. 2017;30:5998-6008.

Park J, Jin IH, Jeon M. How social networks influence human behavior: An integrated latent space approach for differential social influence. Psychometrika. 2023;88:1529-1555. doi: 10.1007/s11336-023-09934-5

Pinel P, Guichaoua G, Najm M, et al. Exploring isofunctional molecules: Design of a benchmark and evaluation of prediction performance. Mol Inform. 2023;42(4):e2200216. doi: 10.1002/minf.202200216

Previous article in this issue

Next article in this issue

Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing