Mostrar el registro sencillo del ítem
dc.contributor.author | Alarte, Julián | es_ES |
dc.contributor.author | Silva, Josep | es_ES |
dc.date.accessioned | 2022-04-05T06:28:24Z | |
dc.date.available | 2022-04-05T06:28:24Z | |
dc.date.issued | 2021-12 | es_ES |
dc.identifier.issn | 1556-4681 | es_ES |
dc.identifier.uri | http://hdl.handle.net/10251/181752 | |
dc.description.abstract | [EN] The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques. | es_ES |
dc.description.sponsorship | This work has been partially supported by the EU (FEDER) and the Spanish MCI/AEI under grants TIN2016-76843-C4-1-R and PID2019-104735RB-C41, by the Generalitat Valenciana under grant Prometeo/2019/098 (DeepTrust), and by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215. | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Association for Computing Machinery | es_ES |
dc.relation.ispartof | ACM Transactions on Knowledge Discovery from Data | es_ES |
dc.rights | Reserva de todos los derechos | es_ES |
dc.subject | Information retrieval | es_ES |
dc.subject | Content extraction | es_ES |
dc.subject | Template extraction | es_ES |
dc.subject | Web mining | es_ES |
dc.subject | Block detection | es_ES |
dc.subject.classification | LENGUAJES Y SISTEMAS INFORMATICOS | es_ES |
dc.title | Page-Level Main Content Extraction from Heterogeneous Webpages | es_ES |
dc.type | Artículo | es_ES |
dc.identifier.doi | 10.1145/3451168 | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-104735RB-C41/ES/SAFER-UPV: ANALISIS Y VALIDACION DE SOFTWARE Y RECURSOS WEB/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/AEI//TIN2016-76843-C4-1-R//METODOS RIGUROSOS PARA EL INTERNET DEL FUTURO/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/EC/H2020/952215/EU | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/GENERALITAT VALENCIANA//PROMETEO%2F2019%2F098//DEEPTRUST/ | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació | es_ES |
dc.description.bibliographicCitation | Alarte, J.; Silva, J. (2021). Page-Level Main Content Extraction from Heterogeneous Webpages. ACM Transactions on Knowledge Discovery from Data. 15(6):1-21. https://doi.org/10.1145/3451168 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | https://doi.org/10.1145/3451168 | es_ES |
dc.description.upvformatpinicio | 1 | es_ES |
dc.description.upvformatpfin | 21 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 15 | es_ES |
dc.description.issue | 6 | es_ES |
dc.relation.pasarela | S\428851 | es_ES |
dc.contributor.funder | GENERALITAT VALENCIANA | es_ES |
dc.contributor.funder | AGENCIA ESTATAL DE INVESTIGACION | es_ES |
dc.contributor.funder | European Regional Development Fund | es_ES |
dc.contributor.funder | COMISION DE LAS COMUNIDADES EUROPEA | es_ES |