TeMex: The Web Template Extractor

Alarte, julián; Insa Cabrera, David; Silva Galiana, Josep Francesc; Tamarit Muñoz, Salvador

doi:10.1145/2740908.2742835

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

TeMex: The Web Template Extractor

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: paper.pdf

Tamaño: 1.400Mb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: Published-ACM.pdf

Tamaño: 1.614Mb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Alarte, julián	es_ES
dc.contributor.author	Insa Cabrera, David	es_ES
dc.contributor.author	Silva Galiana, Josep Francesc	es_ES
dc.contributor.author	Tamarit Muñoz, Salvador	es_ES
dc.date.accessioned	2016-07-01T12:49:21Z
dc.date.available	2016-07-01T12:49:21Z
dc.date.issued	2015-05
dc.identifier.isbn	978-1-4503-3473-0
dc.identifier.uri	http://hdl.handle.net/10251/66952
dc.description	"© ACM} 2015. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM, In Proceedings of the 24th International Conference on World Wide Web (pp. 155-158), http://dx.doi.org/10.1145/2740908.2742835	es_ES
dc.description.abstract	This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a prede- fined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation.	es_ES
dc.description.sponsorship	This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Economía y Competitividad (Secretaría de Estado de Investigación, Desarrollo e Innovación) under Grant TIN2013-44742-C4-1-R and by the Generalitat Valenciana under Grant PROMETEOII/2015/013. David Insa was partially supported by the Spanish Ministerio de Educación under FPU Grant AP2010-4415. Salvador Tamarit was partially supported by research project POLCA, Programming Large Scale Heterogeneous Infrastructures (610686), funded by the European Union, STREP FP7.	es_ES
dc.format.extent	4	es_ES
dc.language	Inglés	es_ES
dc.publisher	ACM	es_ES
dc.rights	Reserva de todos los derechos	es_ES
dc.subject.classification	BIBLIOTECONOMIA Y DOCUMENTACION	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	TeMex: The Web Template Extractor	es_ES
dc.type	Comunicación en congreso	es_ES
dc.identifier.doi	10.1145/2740908.2742835
dc.relation.projectID	info:eu-repo/grantAgreement/MINECO//TIN2013-44742-C4-1-R/ES/VALIDACION ASISTIDA DE PROGRAMAS MEDIANTE METODOS PRECISOS Y RIGUROSOS PARA UNA INGENIERIA DEL SOFTWARE ROBUSTA/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/EC/FP7/610686/EU/Programming Large Scale Heterogeneous Infrastructures/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/GVA//PROMETEOII%2F2015%2F013/ES/SmartLogic: Logic Technologies for Software Security and Performance/	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/MECD//AP2010-4415/ES/AP2010-4415/	es_ES
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Alarte, J.; Insa Cabrera, D.; Silva Galiana, JF.; Tamarit Muñoz, S. (2015). TeMex: The Web Template Extractor. ACM. https://doi.org/10.1145/2740908.2742835	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.conferencename	24th International World Wide Web Conference (WWW 2015)	es_ES
dc.relation.conferencedate	May 18-22, 2015	es_ES
dc.relation.conferenceplace	Florence, Italy	es_ES
dc.relation.publisherversion	http://dx.doi.org/10.1145/2740908.2742835	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.relation.senia	286700	es_ES
dc.contributor.funder	Ministerio de Economía y Competitividad	es_ES
dc.contributor.funder	European Regional Development Fund	es_ES
dc.contributor.funder	Generalitat Valenciana	es_ES
dc.contributor.funder	European Commission	es_ES
dc.contributor.funder	Ministerio de Educación, Cultura y Deporte	es_ES
dc.description.references	Overlay extension. Available from URL: https://developer.mozilla.org/en-US/Add-ons/Overlay_Extensions, 2005.	es_ES
dc.description.references	J. Alarte, D. Insa, J. Silva, and S. Tamarit. Automatic Detection of Webpages that Share the Same Web Template. In M. H. ter Beek and A. Ravara, editors, Proceedings of the 10th International Workshop on Automated Specification and Verification of Web Systems (WWV 14), volume 163 of Electronic Proceedings in Theoretical Computer Science, pages 2--15. Open Publishing Association, July 2014.	es_ES
dc.description.references	J. Alarte, D. Insa, J. Silva, and S. Tamarit. A Benchmark Suite for Template Detection and Content Extraction. CoRR, abs/1409.6182, 2014.	es_ES
dc.description.references	Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW'02), pages 580--591, New York, NY, USA, 2002. ACM.	es_ES
dc.description.references	M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. Cleaneval: a Competition for Cleaning Web Pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'08), pages 638--643. European Language Resources Association, may 2008.	es_ES
dc.description.references	D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In A. Ellis and T. Hagino, editors, Proceedings of the 14th International Conference on World Wide Web (WWW'05), pages 830--839. ACM, may 2005.	es_ES
dc.description.references	T. Gottron. Evaluating content extraction on HTML documents. In V. Grout, D. Oram, and R. Picking, editors, Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA'07), pages 123--132. National Assembly for Wales, sep 2007.	es_ES
dc.description.references	D. d. C. Reis, P. B. Golgher, A. S. Silva, and A. H. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web (WWW'04), pages 502--511, New York, NY, USA, 2004. ACM.	es_ES
dc.description.references	K. Vieira, A. L. da Costa Carvalho, K. Berlt, E. S. de Moura, A. S. da Silva, and J. Freire. On finding templates on web collections. World Wide Web, 12(2):171--211, 2009.	es_ES
dc.description.references	K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. a. M. B. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), pages 258--267, New York, NY, USA, 2006. ACM.	es_ES
dc.description.references	T. Weninger, W. Henry Hsu, and J. Han. CETR: Content Extraction via Tag Ratios. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web (WWW'10), pages 971--980. ACM, apr 2010.	es_ES
dc.description.references	L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03), pages 296--305, New York, NY, USA, 2003. ACM.	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem

TeMex: The Web Template Extractor

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

TeMex: The Web Template Extractor

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)