Aggregative quantification for regression

Bella Sanjuán, Antonio; Ferri Ramírez, César; Hernández Orallo, José; Ramírez Quintana, María José

doi:10.1007/s10618-013-0308-z

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Aggregative quantification for regression

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: DMKDD.pdf

Tamaño: 615.6Kb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: art%3A10.1007%2Fs ...

Tamaño: 833.9Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Bella Sanjuán, Antonio	es_ES
dc.contributor.author	Ferri Ramírez, César	es_ES
dc.contributor.author	Hernández Orallo, José	es_ES
dc.contributor.author	Ramírez Quintana, María José	es_ES
dc.date.accessioned	2015-04-27T12:05:54Z
dc.date.available	2015-04-27T12:05:54Z
dc.date.issued	2014-03-01
dc.identifier.issn	1384-5810
dc.identifier.uri	http://hdl.handle.net/10251/49300
dc.description	The final publication is available at Springer via http://dx.doi.org/10.1007/s10618-013-0308-z	es_ES
dc.description.abstract	The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.	es_ES
dc.description.sponsorship	We would like to thank the anonymous reviewers for their careful reviews, insightful comments and very useful suggestions. This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, GVA project PROME-TEO/2008/051, the COST-European Cooperation in the field of Scientific and Technical Research IC0801 AT, and the REFRAME project granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by the Ministerio de Economia y Competitividad in Spain.	en_EN
dc.language	Inglés	es_ES
dc.publisher	Springer Verlag (Germany)	es_ES
dc.relation.ispartof	Data Mining and Knowledge Discovery	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject	Quantification	es_ES
dc.subject	Regression quantification	es_ES
dc.subject	Probability estimation	es_ES
dc.subject	Segmentation	es_ES
dc.subject	Distribution	es_ES
dc.subject	Aggregation	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Aggregative quantification for regression	es_ES
dc.type	Artículo	es_ES
dc.identifier.doi	10.1007/s10618-013-0308-z
dc.relation.projectID	info:eu-repo/grantAgreement/MEC//CSD2007-00022/ES/Agreement Technologies/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/COST//IC0801/EU/Agreement Technologies/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MICINN//TIN2010-21062-C02-02/ES/SWEETLOGICS-UPV/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/Generalitat Valenciana//PROMETEO08%2F2008%2F051/ES/Advances on Agreement Technologies for Computational Entities (atforce)/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Bella Sanjuán, A.; Ferri Ramírez, C.; Hernández Orallo, J.; Ramírez Quintana, MJ. (2014). Aggregative quantification for regression. Data Mining and Knowledge Discovery. 28(2):475-518. https://doi.org/10.1007/s10618-013-0308-z	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.publisherversion	http://link.springer.com/article/10.1007%2Fs10618-013-0308-z	es_ES
dc.description.upvformatpinicio	475	es_ES
dc.description.upvformatpfin	518	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.description.volume	28	es_ES
dc.description.issue	2	es_ES
dc.relation.senia	263092
dc.contributor.funder	Generalitat Valenciana	es_ES
dc.contributor.funder	European Cooperation in Science and Technology	es_ES
dc.contributor.funder	Ministerio de Educación y Ciencia	es_ES
dc.contributor.funder	Ministerio de Ciencia e Innovación	es_ES
dc.description.references	Alonzo TA, Pepe MS, Lumley T (2003) Estimating disease prevalence in two-phase studies. Biostatistics 4(2):313–326	es_ES
dc.description.references	Anderson T (1962) On the distribution of the two-sample Cramer–von Mises criterion. Ann Math Stat 33(3):1148–1159	es_ES
dc.description.references	Bakar AA, Othman ZA, Shuib NLM (2009) Building a new taxonomy for data discretization techniques. In: Proceedings of 2nd conference on data mining and optimization (DMO’09), pp 132–140	es_ES
dc.description.references	Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009a) Calibration of machine learning models. In: Handbook of research on machine learning applications. IGI Global, Hershey	es_ES
dc.description.references	Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009b) Similarity-binning averaging: a generalisation of binning calibration. In: International conference on intelligent data engineering and automated learning. LNCS, vol 5788. Springer, Berlin, pp 341–349	es_ES
dc.description.references	Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2010) Quantification via probability estimators. In: International conference on data mining, ICDM2010, pp 737–742	es_ES
dc.description.references	Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2012) On the effect of calibration in classifier combination. Appl Intell. doi: 10.1007/s10489-012-0388-2	es_ES
dc.description.references	Chan Y, Ng H (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp 89–96	es_ES
dc.description.references	Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6	es_ES
dc.description.references	Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30	es_ES
dc.description.references	Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 194–202	es_ES
dc.description.references	Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38	es_ES
dc.description.references	Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, Cambridge	es_ES
dc.description.references	Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), pp 564–575	es_ES
dc.description.references	Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 157–166	es_ES
dc.description.references	Forman G (2008) Quantifying counts and costs via classification. Data Min Knowl Discov 17(2):164–206	es_ES
dc.description.references	Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml	es_ES
dc.description.references	González-Castro V, Alaiz-Rodríguez R, Alegre E (2012) Class distribution estimation based on the Hellinger distance. Inf Sci 218(1):146–164	es_ES
dc.description.references	Hastie TJ, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin	es_ES
dc.description.references	Hernández-Orallo J, Flach P, Ferri C (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res (JMLR) 13:2813–2869	es_ES
dc.description.references	Hodges J, Lehmann E (1963) Estimates of location based on rank tests. Ann Math Stat 34(5):598–611	es_ES
dc.description.references	Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New York	es_ES
dc.description.references	Hwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810	es_ES
dc.description.references	Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336	es_ES
dc.description.references	Moreno-Torres J, Raeder T, Alaiz-Rodríguez R, Chawla N, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530	es_ES
dc.description.references	Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33(201):101–116	es_ES
dc.description.references	Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74	es_ES
dc.description.references	Raeder T, Forman G, Chawla N (2012) Learning from imbalanced data: evaluation matters. Data Min 23:315–331	es_ES
dc.description.references	Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition. LNCS, vol 5112. Springer, Heidelberg, pp 827–836	es_ES
dc.description.references	Sturges H (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66	es_ES
dc.description.references	Team R et al (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna	es_ES
dc.description.references	Tenenbein A (1970) A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65(331):1350–1361	es_ES
dc.description.references	Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19	es_ES
dc.description.references	Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44	es_ES
dc.description.references	Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques with Java implementations. Elsevier, Amsterdam	es_ES
dc.description.references	Xiao Y, Gordon A, Yakovlev A (2006a) A C++ program for the Cramér–von Mises two-sample test. J Stat Softw 17:1–15	es_ES
dc.description.references	Xiao Y, Gordon A, Yakovlev A (2006b) The L1-version of the Cramér-von Mises test for two-sample comparisons in microarray data analysis. EURASIP J Bioinform Syst Biol 2006:85769	es_ES
dc.description.references	Xue J, Weiss G (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 897–906	es_ES
dc.description.references	Yang Y (2003) Discretization for naive-bayes learning. PhD thesis, Monash University	es_ES
dc.description.references	Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the 8th international conference on machine learning (ICML), pp 609–616	es_ES
dc.description.references	Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: The 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 694–699	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [48360]

Mostrar el registro sencillo del ítem

Aggregative quantification for regression

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Aggregative quantification for regression

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)