- -

Enabling network inference methods to handle missing data and outliers

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Enabling network inference methods to handle missing data and outliers

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Folch-Fortuny, Abel es_ES
dc.contributor.author Fernández Villaverde, Alejandro es_ES
dc.contributor.author Ferrer Riquelme, Alberto José es_ES
dc.contributor.author Rodríguez Banga, Julio es_ES
dc.date.accessioned 2016-05-30T09:36:13Z
dc.date.available 2016-05-30T09:36:13Z
dc.date.issued 2015-09-03
dc.identifier.issn 1471-2105
dc.identifier.uri http://hdl.handle.net/10251/64905
dc.description © 2015 Folch-Fortuny et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. es_ES
dc.description.abstract [EN] Background: The inference of complex networks from data is a challenging problem in biological sciences, as well as in a wide range of disciplines such as chemistry, technology, economics, or sociology. The quantity and quality of the data greatly affect the results. While many methodologies have been developed for this task, they seldom take into account issues such as missing data or outlier detection and correction, which need to be properly addressed before network inference. Results: Here we present an approach to (i) handle missing data and (ii) detect and correct outliers based on multivariate projection to latent structures. The method, called trimmed scores regression (TSR), enables network inference methods to analyse incomplete datasets by imputing the missing values coherently with the latent data structure. Furthermore, it substitutes the faulty values in a dataset by proper estimations. We provide an implementation of this approach, and show how it can be integrated with any network inference method as a preliminary data curation step. This functionality is demonstrated with a state of the art network inference method based on mutual information distance and entropy reduction, MIDER. Conclusion: The methodology presented here enables network inference methods to analyse a large number of incomplete and faulty datasets that could not be reliably analysed so far. Our comparative studies show the superiority of TSR over other missing data approaches used by practitioners. Furthermore, the method allows for outlier detection and correction. es_ES
dc.description.sponsorship Research in this study was partially supported by the European Union through project BioPreDyn (FP7-KBBE 289434), and the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grants MultiScales (DPI2011-28112-C04-02, DPI2011-28112-C04-03), and SynBioFactory (DPI2014-55276-C5-1-R, DPI2014-55276-C5-2-R). AF Villaverde also acknowledges funding from the Xunta de Galicia through an I2C postdoctoral fellowship (I2C ED481B 2014/133-0). We also gratefully acknowledge Associate Professor Francisco Arteaga for his help in the adaptation of TSR to the PCA model building context. en_EN
dc.language Inglés es_ES
dc.publisher BioMed Central es_ES
dc.relation.ispartof BMC Bioinformatics es_ES
dc.rights Reconocimiento (by) es_ES
dc.subject Network inference es_ES
dc.subject Missing data es_ES
dc.subject Outlier detection es_ES
dc.subject Projection to latent structures es_ES
dc.subject Trimmed scores regression es_ES
dc.subject Information theory es_ES
dc.subject Mutual information es_ES
dc.subject.classification ESTADISTICA E INVESTIGACION OPERATIVA es_ES
dc.title Enabling network inference methods to handle missing data and outliers es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1186/s12859-015-0717-7
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/289434/EU/From Data to Models: New Bioinformatics Methods and Tools for Data-Driven Predictive Dynamic Modelling in Biotechnological Applications/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MICINN//DPI2011-28112-C04-02/ES/MONITORIZACION, INFERENCIA, OPTIMIZACION Y CONTROL MULTI-ESCALA: DE CELULAS A BIORREACTORES. (MULTISCALES)/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MICINN//DPI2011-28112-C04-03/ES/INFERENCIA, MONITORIZACION, OPTIMIZACION Y CONTROL MULTI-ESCALA: DE CELULAS A BIORREACTORES (MULTISCALES)/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/Xunta de Galicia//I2C ED481B 2014%2F133-0/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//DPI2014-55276-C5-1-R/ES/BIOLOGIA SINTETICA PARA LA MEJORA EN BIOPRODUCCION: DISEÑO, OPTIMIZACION, MONITORIZACION Y CONTROL/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/MINECO//DPI2014-55276-C5-2-R/ES/BIOLOGIA SINTETICA PARA LA MEJORA DE BIOPROCESOS: DISEÑO, OPTIMIZACION, MONITORIZACION Y CONTROL/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Estadística e Investigación Operativa Aplicadas y Calidad - Departament d'Estadística i Investigació Operativa Aplicades i Qualitat es_ES
dc.description.bibliographicCitation Folch-Fortuny, A.; Fernández Villaverde, A.; Ferrer Riquelme, AJ.; Rodríguez Banga, J. (2015). Enabling network inference methods to handle missing data and outliers. BMC Bioinformatics. 16(283):1-12. https://doi.org/10.1186/s12859-015-0717-7 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://dx.doi.org/10.1186/s12859-015-0717-7 es_ES
dc.description.upvformatpinicio 1 es_ES
dc.description.upvformatpfin 12 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 16 es_ES
dc.description.issue 283 es_ES
dc.relation.senia 294027 es_ES
dc.identifier.pmid 26335628 en_EN
dc.identifier.pmcid PMC4559359 en_EN
dc.contributor.funder European Commission
dc.contributor.funder Ministerio de Ciencia e Innovación
dc.contributor.funder Xunta de Galicia
dc.contributor.funder Ministerio de Economía y Competitividad
dc.description.references Albert R, Barabási AL. Statistical mechanics of complex networks. Rev Mod Phys. 2002; 74(1):47–97. es_ES
dc.description.references Newman MEJ. The structure and function of complex networks. SIAM Rev. 2003; 45(2):167–256. es_ES
dc.description.references De Smet R, Marchal K. Advantages and limitations of current network inference methods. Nat Rev Microbiol. 2010; 8(10):717–29. es_ES
dc.description.references Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D, Stolovitzky G. Revealing strengths and weaknesses of methods for gene network inference. Proc Natl Acad Sci. 2010; 107(14):6286–291. es_ES
dc.description.references Prill RJ, Saez-Rodriguez J, Alexopoulos LG, Sorger PK, Stolovitzky G. Crowdsourcing network inference: the DREAM predictive signaling network challenge. Sci Signal. 2011; 4(189):7. es_ES
dc.description.references Lecca P, Priami C. Biological network inference for drug discovery. Drug Discovery Today. 2013; 18(5-6):256–64. es_ES
dc.description.references Maetschke SR, Madhamshettiwar PB, Davis MJ, Ragan MA. Supervised, semi-supervised and unsupervised inference of gene regulatory networks. Brief Bioinform. 2013; 15(2):195–211. es_ES
dc.description.references Grung B, Manne R. Missing values in principal component analysis. Chemometr Intell Lab Syst. 1998; 42(1-2):125–39. es_ES
dc.description.references Arteaga F, Ferrer A. Missing data. In: Comprehensive chemometrics chemical and biochemical data analysis. Amsterdam: Elsevier: 2009. p. 285–314. es_ES
dc.description.references Jackson JE. A user’s guide to principal components. Hoboken: Wiley Ser Probab Stat; 2004. es_ES
dc.description.references Walczak B, Massart DL. Dealing with missing data. Chemometr Intell Lab Syst. 2001; 58(1):15–27. es_ES
dc.description.references Martens H, Jr Russwurm H. Food research and data analysis. London; New York, NY, USA: Elsevier Applied Science; 1983. es_ES
dc.description.references Arteaga F, Ferrer A. Dealing with missing data in MSPC: Several methods, different interpretations, some examples. J Chemom. 2002; 16(8-10):408–18. es_ES
dc.description.references Folch-Fortuny A, Arteaga F, Ferrer A. PCA model building with missing data: new proposals and a comparative study. Chemometr Intell Lab Syst. 2015; 146:77–88. es_ES
dc.description.references Liao SG, Lin Y, Kang DD, Chandra D, Bon J, Kaminski N, et al.Missing value imputation in high-dimensional phenomic data: imputable or not, and how?BMC Bioinforma. 2014; 15(1):346. es_ES
dc.description.references Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometr Intell Lab Syst. 1987; 2(1-3):37–52. es_ES
dc.description.references Kourti T, MacGregor JF. Process analysis, monitoring and diagnosis, using multivariate projection methods. Chemometr Intell Lab Syst. 1995; 28(1):3–21. es_ES
dc.description.references Ferrer A. Latent structures-based multivariate statistical process control: A paradigm shift. Qual Eng. 2014; 26(1):72–91. es_ES
dc.description.references Villaverde AF, Ross J, Morán F, Banga JR. MIDER: Network inference with mutual information distance and entropy reduction. PLoS ONE. 2014; 9(5):96732. es_ES
dc.description.references Shannon CE. A mathematical theory of communication. Bell Sys Tech J. 1948; 27(3):379–423. es_ES
dc.description.references Cover TM, Thomas JA. Elements of information theory, 99 ed. New York: Wiley-Interscience; 1991. es_ES
dc.description.references Villaverde AF, Ross J, Banga JR. Reverse engineering cellular networks with information theoretic methods. Cells. 2013; 2(2):306–29. es_ES
dc.description.references Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al.Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007; 5(1):8. es_ES
dc.description.references Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, et al.ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinforma. 2006; 7(Suppl 1):7. es_ES
dc.description.references Meyer PE, Kontos K, Lafitte F, Bontempi G. Information-theoretic inference of large transcriptional regulatory networks. EURASIP J Bioinforma Syst Biol. 2007; 2007(1):79879. es_ES
dc.description.references Luo W, Hankenson KD, Woolf PJ. Learning transcriptional regulatory networks from high throughput gene expression data using continuous three-way mutual information. BMC Bioinforma. 2008; 9:467. es_ES
dc.description.references Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC bioinforma. 2010; 11:154. es_ES
dc.description.references Wu CC, Huang HC, Juan HF, Chen ST. GeneNetwork: an interactive tool for reconstruction of genetic networks using microarray data. Bioinformatics (Oxford, England). 2004; 20(18):3691–693. es_ES
dc.description.references Gustafsson M, Hörnquist M, Lombardi A. Constructing and analyzing a large-scale gene-to-gene regulatory network–lasso-constrained inference and biological validation. IEEE/ACM trans comput biol bioinform/IEEE, ACM. 2005; 2(3):254–61. es_ES
dc.description.references Guthke R, Möller U, Hoffmann M, Thies F, Töpfer S. Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection. Bioinformatics (Oxford, England). 2005; 21(8):1626–34. es_ES
dc.description.references Schulze S, Henkel SG, Driesch D, Guthke R, Linde J. Computational prediction of molecular pathogen-host interactions based on dual transcriptome data. Front Microbiol. 2015; 6:65. es_ES
dc.description.references Hurley D, Araki H, Tamada Y, Dunmore B, Sanders D, Humphreys S, et al.Gene network inference and visualization tools for biologists: application to new human transcriptome datasets. Nucleic Acids Res. 2012; 40(6):2377–398. es_ES
dc.description.references Souto MCd, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinforma. 2015; 16(1):64. es_ES
dc.description.references Guitart-Pla O, Kustagi M, Rügheimer F, Califano A, Schwikowski B. The Cyni framework for network inference in Cytoscape. Bioinformatics (Oxford, England). 2015; 31(9):1499–1501. es_ES
dc.description.references Camacho J, Picó J, Ferrer A. Data understanding with PCA: Structural and variance information plots. Chemometr Intell Lab Syst. 2010; 100(1):48–56. es_ES
dc.description.references Wold S. Cross-validatory estimation of the number of components in factor and principal components models. Technometrics. 1978; 20(4):397–405. es_ES
dc.description.references Camacho J, Ferrer A. Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: theoretical aspects. J Chemom. 2012; 26(7):361–73. es_ES
dc.description.references Little RJA, Rubin DB. Statistical analysis with missing data, 2nd ed. Hoboken, NJ: Wiley-Interscience; 2002. es_ES
dc.description.references Ferrer A. Multivariate statistical process control based on principal component analysis (MSPC-PCA): Some reflections and a case study in an autobody assembly process. Qual Eng. 2007; 19(4):311–25. es_ES
dc.description.references MacGregor JF, Kourti T. Statistical process control of multivariate processes. Control Eng Pract. 1995; 3(3):403–14. es_ES
dc.description.references Stanimirova I, Daszykowski M, Walczak B. Dealing with missing values and outliers in principal component analysis. Talanta. 2007; 72(1):172–8. es_ES
dc.description.references Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010; 2(4):433–59. es_ES
dc.description.references Camacho J, Picó J, Ferrer A. The best approaches in the on-line monitoring of batch processes based on PCA: Does the modelling structure matter?Anal Chim Acta. 2009; 642(1-2):59–68. es_ES
dc.description.references González-Martínez JM, de Noord OE, Ferrer A. Multisynchro: a novel approach for batch synchronization in scenarios of multiple asynchronisms. J Chemom. 2014; 28(5):462–75. es_ES
dc.description.references Samoilov MS. Reconstruction and Functional Analysis of General Chemical Reactions and Reaction Networks. California, United States: Stanford University; 1997. es_ES
dc.description.references Samoilov M, Arkin A, Ross J. On the deduction of chemical reaction pathways from measurements of time series of concentrations. Chaos (Woodbury, NY). 2001; 11(1):108–14. es_ES
dc.description.references Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, Bansal M, et al.A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell. 2009; 137(1):172–81. es_ES
dc.description.references Arkin A, Shen P, Ross J. A test case of correlation metric construction of a reaction pathway from measurements. Science. 1997; 277(5330):1275–9. es_ES
dc.description.references Schaffter T, Marbach D, Floreano D. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics (Oxford, England). 2011; 27(16):2263–270. es_ES
dc.description.references Marbach D, Schaffter T, Mattiussi C, Floreano D. Generating realistic in silico gene networks for performance assessment of reverse engineering methods. J Comput Biol J Comput Mol Cell Biol. 2009; 16(2):229–39. es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem