- -

TeMex: The Web Template Extractor

RiuNet: Institutional repository of the Polithecnic University of Valencia

Share/Send to

Cited by

Statistics

TeMex: The Web Template Extractor

Show simple item record

Files in this item

dc.contributor.author Alarte, julián es_ES
dc.contributor.author Insa Cabrera, David es_ES
dc.contributor.author Silva Galiana, Josep Francesc es_ES
dc.contributor.author Tamarit Muñoz, Salvador es_ES
dc.date.accessioned 2016-07-01T12:49:21Z
dc.date.available 2016-07-01T12:49:21Z
dc.date.issued 2015-05
dc.identifier.isbn 978-1-4503-3473-0
dc.identifier.uri http://hdl.handle.net/10251/66952
dc.description "© ACM} 2015. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM, In Proceedings of the 24th International Conference on World Wide Web (pp. 155-158), http://dx.doi.org/10.1145/2740908.2742835 es_ES
dc.description.abstract This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, more importantly, it does not need a prede- fined set of webpages to perform the analysis. TeMex only needs a URL. Contrarily to previous approaches, it includes a mechanism to identify webpage candidates that share the same template. This mechanism increases both recall and precision, and it also reduces the amount of webpages loaded and processed. We describe the tool and its internal architecture, and we present the results of its empirical evaluation. es_ES
dc.description.sponsorship This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Economía y Competitividad (Secretaría de Estado de Investigación, Desarrollo e Innovación) under Grant TIN2013-44742-C4-1-R and by the Generalitat Valenciana under Grant PROMETEOII/2015/013. David Insa was partially supported by the Spanish Ministerio de Educación under FPU Grant AP2010-4415. Salvador Tamarit was partially supported by research project POLCA, Programming Large Scale Heterogeneous Infrastructures (610686), funded by the European Union, STREP FP7. es_ES
dc.format.extent 4 es_ES
dc.language Inglés es_ES
dc.publisher ACM es_ES
dc.relation MINECO-FEDER/TIN2013-44742-C4-1-R es_ES
dc.relation GV/PROMETEOII/201/013 es_ES
dc.relation MECD/FPU/AP2010-4415 es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject.classification BIBLIOTECONOMIA Y DOCUMENTACION es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title TeMex: The Web Template Extractor es_ES
dc.type Comunicación en congreso es_ES
dc.identifier.doi 10.1145/2740908.2742835
dc.relation.projectID info:eu-repo/grantAgreement/EC/FP7/610686/EU es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Alarte, J.; Insa Cabrera, D.; Silva Galiana, JF.; Tamarit Muñoz, S. (2015). TeMex: The Web Template Extractor. ACM. https://doi.org/10.1145/2740908.2742835 es_ES
dc.description.accrualMethod S es_ES
dc.relation.conferencename 24th International World Wide Web Conference (WWW 2015) es_ES
dc.relation.conferencedate May 18-22, 2015 es_ES
dc.relation.conferenceplace Florence, Italy es_ES
dc.relation.publisherversion http://dx.doi.org/10.1145/2740908.2742835 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.relation.senia 286700 es_ES
dc.contributor.funder Ministerio de Economía y Competitividad es_ES
dc.contributor.funder European Regional Development Fund es_ES
dc.contributor.funder Generalitat Valenciana es_ES
dc.contributor.funder Ministerio de Educación es_ES
dc.contributor.funder European Commission es_ES
dc.relation.references Overlay extension. Available from URL: https://developer.mozilla.org/en-US/Add-ons/Overlay_Extensions, 2005. es_ES
dc.relation.references J. Alarte, D. Insa, J. Silva, and S. Tamarit. Automatic Detection of Webpages that Share the Same Web Template. In M. H. ter Beek and A. Ravara, editors, Proceedings of the 10th International Workshop on Automated Specification and Verification of Web Systems (WWV 14), volume 163 of Electronic Proceedings in Theoretical Computer Science, pages 2--15. Open Publishing Association, July 2014. es_ES
dc.relation.references J. Alarte, D. Insa, J. Silva, and S. Tamarit. A Benchmark Suite for Template Detection and Content Extraction. CoRR, abs/1409.6182, 2014. es_ES
dc.relation.references Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the 11th International Conference on World Wide Web (WWW'02), pages 580--591, New York, NY, USA, 2002. ACM. es_ES
dc.relation.references M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. Cleaneval: a Competition for Cleaning Web Pages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC'08), pages 638--643. European Language Resources Association, may 2008. es_ES
dc.relation.references D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In A. Ellis and T. Hagino, editors, Proceedings of the 14th International Conference on World Wide Web (WWW'05), pages 830--839. ACM, may 2005. es_ES
dc.relation.references T. Gottron. Evaluating content extraction on HTML documents. In V. Grout, D. Oram, and R. Picking, editors, Proceedings of the 2nd International Conference on Internet Technologies and Applications (ITA'07), pages 123--132. National Assembly for Wales, sep 2007. es_ES
dc.relation.references D. d. C. Reis, P. B. Golgher, A. S. Silva, and A. H. F. Laender. Automatic web news extraction using tree edit distance. In Proceedings of the 13th International Conference on World Wide Web (WWW'04), pages 502--511, New York, NY, USA, 2004. ACM. es_ES
dc.relation.references K. Vieira, A. L. da Costa Carvalho, K. Berlt, E. S. de Moura, A. S. da Silva, and J. Freire. On finding templates on web collections. World Wide Web, 12(2):171--211, 2009. es_ES
dc.relation.references K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. a. M. B. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM'06), pages 258--267, New York, NY, USA, 2006. ACM. es_ES
dc.relation.references T. Weninger, W. Henry Hsu, and J. Han. CETR: Content Extraction via Tag Ratios. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors, Proceedings of the 19th International Conference on World Wide Web (WWW'10), pages 971--980. ACM, apr 2010. es_ES
dc.relation.references L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (KDD'03), pages 296--305, New York, NY, USA, 2003. ACM. es_ES


This item appears in the following Collection(s)

Show simple item record