Abstract
The advances of modern sequencing techniques have generated an unprecedented amount of multi-omics data which provide great opportunities to quantitatively explore functional genomes from different but complementary perspectives. However, distinct modalities/sequencing technologies generate diverse types of data which greatly complicate statistical modeling because uniquely optimized methods are required for handling each type of data. In this paper, we propose a unified framework for Bayesian nonparametric matrix factorization that infers overlapping bi-clusters for multi-omics data. The proposed method adaptively discretizes different types of observations into common latent states on which cluster structures are built hierarchically. The proposed Bayesian nonparametric method is able to automatically determine the number of clusters. We demonstrate the utility of the proposed method using simulation studies and applications to a single-cell RNA-sequencing dataset, a combination of single-cell RNA-sequencing and single-cell ATAC-sequencing dataset, a bulk RNA-sequencing dataset, and a DNA methylation dataset which reveal several interesting findings that are consistent with biological literature.
Similar content being viewed by others
References
Arteaga CL, Moulder SL, Yakes FM (2002) HER (ERBB) tyrosine kinase inhibitors in the treatment of breast cancer. Semin Oncol 29:4–10
Banchereau J, Steinman RM (1998) Dendritic cells and the control of immunity. Nature 392(6673):245–252
Banerjee A, Krumpelman C, Ghosh J, Basu S, Mooney RJ (2005) Model-based overlapping clustering. In Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. pp 532–537
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203
Bod L, Douguet L, Auffray C, Lengagne R, Bekkat F, Rondeau E, Molinier-Frenkel V, Castellano F, Richard Y, Prévost-Blondel A (2018) IL-4-induced gene 1: a negative immune checkpoint controlling B cell differentiation and activation. J Immunol 200(3):1027–1038
Bolós V, Gasent JM, Lopez-Tarruella S, Grande E (2010) The dual kinase complex FAK-SRC as a promising therapeutic target in cancer. OncoTargets Therapy 3:83
Brenna Ø, Furnes MW, Munkvold B, Kidd M, Sandvik AK, Gustafsson BI (2016) Cellular localization of guanylin and uroguanylin mRNAs in human and rat duodenal and colonic mucosa. Cell Tissue Res 365(2):331–341
Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, Chang HY, Greenleaf WJ (2015) Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523(7561):486–490
Cai T, Li H, Ma J, Xia Y (2019) Differential Markov random field analysis with an application to detecting differential microbial community networks. Biometrika 106(2):401–416
Clark EA (1997) Regulation of B lymphocytes by dendritic cells. J Exp Med 185(5):801–804
Cleuziou G (2008) An extended version of the k-means method for overlapping clustering. In Proceedings of the 19th international conference on pattern recognition. pp 1–4
Demokan S, Dalay N (2011) Role of DNA methylation in head and neck cancer. Clin Epigenet 2(2):123
DeSantis CE, Ma J, Sauer AG, Newman LA, Jemal A (2017) Breast cancer statistics, 2017, racial disparity in mortality by state. CA Cancer J Clin 67(6):439–448
Ding B, Zheng L, Zhu Y, Li N, Jia H, Ai R, Wildberg A, Wang W (2015) Normalization and noise reduction for single cell RNA-seq experiments. Bioinformatics 31(13):2225–2227
Engelstoft MS, Lund ML, Grunddal KV, Egerod KL, Osborne-Lawrence S, Poulsen SS, Zigman JM, Schwartz TW (2015) Research resource: a chromogranin a reporter for serotonin and histamine secreting enteroendocrine cells. Mol Endocrinol 29(11):1658–1671
Ghahramani Z, Griffiths TL (2006) Infinite latent feature models and the Indian buffet process. In Advances in neural information processing systems. pp 475–482
Gopalan P, Ruiz FJ, Ranganath R, Blei D (2014) Bayesian nonparametric Poisson factorization for recommendation systems. In Proceedings of the seventeenth international conference on artificial intelligence and statistics, pp 275–283
Haagenson KK, Wu GS (2010) The role of MAP kinases and MAP kinase phosphatase-1 in resistance to breast cancer treatment. Cancer Metastasis Rev 29(1):143–149
Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67(337):123–129
Hartigan JA, Wong MA (1979) Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108
Heppner GH, Miller BE (1983) Tumor heterogeneity: biological implications and therapeutic consequences. Cancer Metastasis Rev 2:5–23
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5(11):1457–1469
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Kang JM, Park S, Kim SJ, Hong H, Jeong J, Kim H (2012) CBL enhances breast tumor formation by inhibiting tumor suppressive activity of TGF-$\beta $ signaling. Oncogene 31(50):5123–5131
Kaske S, Krasteva G, König P, Kummer W, Hofmann T, Gudermann T, Chubanov V (2007) TRPM5, a taste-signaling transient receptor potential ion-channel, is a ubiquitous signaling component in chemosensory cells. BMC Neurosci 8:49
Kim JK, Kolodziejczyk AA, Ilicic T, Teichmann SA, Marioni JC (2015) Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat Commun 6:8687
Kim E, Davidson LA, Zoh RS, Hensel ME, Salinas ML, Patil BS, Jayaprakasha GK, Callaway ES, Allred CD, Turner ND, Weeks BR, Chapkin RS (2016) Rapidly cycling LGR5+ stem cells are exquisitely sensitive to extrinsic dietary factors that modulate colon cancer risk. Cell Death Dis 7(11):e2460
Kiselev VY, Andrews TS, Hemberg M (2019) Challenges in unsupervised clustering of single-cell RNA-seq data. Nat Rev Genet 20(5):273–282
Kranich J, Krautler NJ (2016) How follicular dendritic cells shape the B-cell antigenome. Front Immunol 7:225
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pp 556–562
Lee E-R, Kim J-Y, Kang Y-J, Ahn J-Y, Kim J-H, Kim B-W, Choi H-Y, Jeong M-Y, Cho S-G (2006) Interplay between PI3K/AKT and MAPK signaling pathways in DNA-damaging drug-induced apoptosis. Biochimica et Biophysica Acta (BBA)-Mol Cell Res 1763(9):958–968
Lee J, Müller P, Gulukota K, Ji Y (2015) A Bayesian feature allocation model for tumor heterogeneity. Ann Appl Stat 9(2):621–639
Leek JT (2014) Svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res 42(21):e161
Li L, Tao Q, Jin H, Van Hasselt A, Poon FF, Wang X, Zeng M-S, Jia W-H, Zeng Y-X, Chan AT et al (2010) The tumor suppressor UCHL1 forms a complex with P53/MDM2/ARF to promote P53 signaling and is frequently silenced in nasopharyngeal carcinoma. Clin Cancer Res 16(11):2949–2958
Lin Z, Zamanighomi M, Daley T, Ma S, Wong WH (2020) Model-based approach to the joint analysis of single-cell data on chromatin accessibility and gene expression. Stat Sci 35(1):2–13
Liu Y, Zhang R, Xin J, Sun Y, Li J, Wei D, Zhao AZ (2011) Identification of S100A16 as a novel adipogenesis promoting factor in 3T3-L1 cells. Endocrinology 152(3):903–911
Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45
Mallik S, Zhao Z (2019) Multi-objective optimized fuzzy clustering for detecting cell clusters from single-cell expression profiles. Genes 10(8):611
Marusyk A, Polyak K (2010) Tumor heterogeneity: causes and consequences. Biochimica et Biophysica Acta (BBA) 1805(1):105–117
McLachlan GJ, Peel D (2004) Finite mixture models. Wiley, Hoboken
Morris DC, Popp JL, Tang LK, Gibbs HC, Schmitt E, Chaki SP, Bywaters BC, Yeh AT, Porter WW, Burghardt RC et al (2017) NCK deficiency is associated with delayed breast carcinoma progression and reduced metastasis. Mol Biol Cell 28(24):3500–3516
Müller P, Quintana FA, Jara A, Hanson T (2015) Bayesian nonparametric data analysis. Springer, Berlin
Muñoz J, Stange DE, Schepers AG, Van De Wetering M, Koo B-K, Itzkovitz S, Volckmann R, Kung KS, Koster J, Radulescu S et al (2012) The LGR5 intestinal stem cell signature: robust expression of proposed quiescent ‘+ 4’ cell markers. EMBO J 31(14):3079–3091
Ni Y, Müller P, Ji Y (2019) Bayesian double feature allocation for phenotyping with electronic health records. J Am Stat Assoc 115:1–15
Noren NK, Foos G, Hauser CA, Pasquale EB (2006) The EPHB4 receptor suppresses breast cancer cell tumorigenicity through an ABL-CRK pathway. Nat Cell Biol 8(8):815–825
Ongusaha PP, Kwak JC, Zwible AJ, Macip S, Higashiyama S, Taniguchi N, Fang L, Lee SW (2004) HB-EGF is a potent inducer of tumor growth and angiogenesis. Can Res 64(15):5283–5290
Paplomata E, O’Regan R (2014) The PI3K/AKT/MTOR pathway in breast cancer: targets, trials and biomarkers. Therap Adv Med Oncol 6(4):154–166
Parmigiani G, Garrett ES, Anbazhagan R, Gabrielson E (2002) A statistical framework for expression-based molecular classification in cancer. J R Stat Soc Ser B (Statistical Methodology) 64(4):717–736
Rehfeld JF (1998) The new biology of gastrointestinal hormones. Physiol Rev 78(4):1087–1108
Ročková V, George EI (2016) Fast Bayesian factor analysis via automatic rotations to sparsity. J Am Stat Assoc 111(516):1608–1622
Safe S, Han H, Goldsby J, Mohankumar K, Chapkin RS (2018) Aryl hydrocarbon receptor (AhR) ligands as selective AhR modulators: genomic studies. Current Opin Toxicol 11:10–20
Shintani S, Nakahara Y, Mihara M, Ueyama Y, Matsumura T (2001) Inactivation of the P14ARF, P15INK4B and P16INK4A genes is a frequent event in human oral squamous cell carcinomas. Oral Oncol 37(6):498–504
Stern DF (2000) Tyrosine kinase signalling in breast cancer: ERBB family receptor tyrosine kinases. Breast Cancer Res 2(3):176
Wei L, Jin Z, Yang S, Xu Y, Zhu Y, Ji Y (2018) TCGA-assembler 2: software pipeline for retrieval and processing of TCGA/CPTAC data. Bioinformatics 34(9):1615–1617
Xu Y, Lee J, Yuan Y, Mitra R, Liang S, Müller P, Ji Y (2013) Nonparametric Bayesian bi-clustering for next generation sequencing count data. Bayesian Anal 8(4):759
Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, Van Der Zwan J, Häring M, Braun E, Borm LE, La Manno G et al (2018) Molecular architecture of the mouse nervous system. Cell 174(4):999–1014
Zeng Y, Min L, Han Y, Meng L, Liu C, Xie Y, Dong B, Wang L, Jiang B, Xu H et al (2014) Inhibition of STAT5A by NAA10P contributes to decreased breast cancer metastasis. Carcinogenesis 35(10):2244–2253
Zhang Z, Li T, Ding C, Zhang X (2007) Binary matrix factorization with applications. In Seventh IEEE international conference on data mining, pp 391–400
Zhang Z-Y, Li T, Ding C, Ren X-W, Zhang X-S (2010) Binary matrix factorization for analyzing gene expression data. Data Min Knowl Disc 20:28–52
Zhou M, Hannah L, Dunson D, Carin L (2012) Beta-negative binomial process and Poisson factor analysis. In Proceedings of the fifteenth international conference on artificial intelligence and statistics. pp 1462–1471
Zhou C, Ye M, Ni S, Li Q, Ye D, Li J, Shen Z, Deng H (2018) DNA methylation biomarkers for head and neck squamous cell carcinoma. Epigenetics 13(4):398–409
Zhou F, He K, Li Q, Chapkin RS, Ni Y (2021) Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization. Biostatistics
Acknowledgements
Yang Ni is partially supported by the National Science Foundation, NSF DMS-1918851 and NSF DMS-2112943. Robert S. Chapkin is partially supported by the Allen Endowed Chair in Nutrition & Chronic Disease Prevention, and the National Institutes of Health (Grant Nos. R01-ES025713, R01-CA202697, R35-CA197707, and T32-CA090301). Kejun He is partially supported by the National Natural Science Foundation of China under Grant 11801560.
Author information
Authors and Affiliations
Corresponding authors
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zhou, F., He, K., Cai, J.J. et al. A Unified Bayesian Framework for Bi-overlapping-Clustering Multi-omics Data via Sparse Matrix Factorization. Stat Biosci 15, 669–691 (2023). https://doi.org/10.1007/s12561-022-09350-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-022-09350-w