- -

Page-Level Main Content Extraction from Heterogeneous Webpages

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Page-Level Main Content Extraction from Heterogeneous Webpages

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Alarte, Julián es_ES
dc.contributor.author Silva, Josep es_ES
dc.date.accessioned 2022-04-05T06:28:24Z
dc.date.available 2022-04-05T06:28:24Z
dc.date.issued 2021-12 es_ES
dc.identifier.issn 1556-4681 es_ES
dc.identifier.uri http://hdl.handle.net/10251/181752
dc.description.abstract [EN] The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques. es_ES
dc.description.sponsorship This work has been partially supported by the EU (FEDER) and the Spanish MCI/AEI under grants TIN2016-76843-C4-1-R and PID2019-104735RB-C41, by the Generalitat Valenciana under grant Prometeo/2019/098 (DeepTrust), and by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215. es_ES
dc.language Inglés es_ES
dc.publisher Association for Computing Machinery es_ES
dc.relation.ispartof ACM Transactions on Knowledge Discovery from Data es_ES
dc.rights Reserva de todos los derechos es_ES
dc.subject Information retrieval es_ES
dc.subject Content extraction es_ES
dc.subject Template extraction es_ES
dc.subject Web mining es_ES
dc.subject Block detection es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Page-Level Main Content Extraction from Heterogeneous Webpages es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.1145/3451168 es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-104735RB-C41/ES/SAFER-UPV: ANALISIS Y VALIDACION DE SOFTWARE Y RECURSOS WEB/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/AEI//TIN2016-76843-C4-1-R//METODOS RIGUROSOS PARA EL INTERNET DEL FUTURO/ es_ES
dc.relation.projectID info:eu-repo/grantAgreement/EC/H2020/952215/EU es_ES
dc.relation.projectID info:eu-repo/grantAgreement/GENERALITAT VALENCIANA//PROMETEO%2F2019%2F098//DEEPTRUST/ es_ES
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Alarte, J.; Silva, J. (2021). Page-Level Main Content Extraction from Heterogeneous Webpages. ACM Transactions on Knowledge Discovery from Data. 15(6):1-21. https://doi.org/10.1145/3451168 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion https://doi.org/10.1145/3451168 es_ES
dc.description.upvformatpinicio 1 es_ES
dc.description.upvformatpfin 21 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 15 es_ES
dc.description.issue 6 es_ES
dc.relation.pasarela S\428851 es_ES
dc.contributor.funder GENERALITAT VALENCIANA es_ES
dc.contributor.funder AGENCIA ESTATAL DE INVESTIGACION es_ES
dc.contributor.funder European Regional Development Fund es_ES
dc.contributor.funder COMISION DE LAS COMUNIDADES EUROPEA es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem