Mostrar el registro sencillo del ítem
dc.contributor.author | Insa Cabrera, David | es_ES |
dc.contributor.author | Silva Galiana, Josep Francesc | es_ES |
dc.contributor.author | Tamarit, Salvador | es_ES |
dc.date.accessioned | 2014-05-21T12:32:06Z | |
dc.date.issued | 2013-11 | |
dc.identifier.issn | 1567-8326 | |
dc.identifier.uri | http://hdl.handle.net/10251/37664 | |
dc.description.abstract | The main content in a webpage is usually centered and visible without the need to scroll. It is often rounded by the navigation menus of the website and it can include advertisements, panels, banners, and other not necessarily related information. The process to automatically extract the main content of a webpage is called content extraction. Content extraction is an area of research of widely interest due to its many applications. Concretely, it is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems (i.e., robots, indexers, crawlers, etc.) that need to extract the main content of a web document to avoid the treatment and processing of other useless information. In thisworkwe present a newtechnique for content extraction that is based on the information contained in theDOMtree. The technique analyzes the hierarchical relations of the elements in the webpage and the distribution of textual information in order to identify the main block of content. Thanks to the hierarchy imposed by the DOM tree the technique achieves a considerable recall and precision. Using theDOMstructure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block (not necessarily textual such as images or videos), thus, producing very cohesive blocks. © 2013 Elsevier Inc. All rights reserved. | es_ES |
dc.description.sponsorship | This work has been partially supported by the Spanish Ministerio de Economia y Competitividad (Secretaria de Estado de Investigacion, Desarrollo e Innovacion) under Grant TIN2008-06622-C03-02 and by the Generalitat Valenciana under Grant PROMETEO/2011/052. Salvador Tamarit was partially supported by the Spanish MICINN under FPI Grant BES-2009-015019. David Insa was partially supported by the Spanish Ministerio de Eduacion under FPU Grant AP2010-4415. | en_EN |
dc.format.extent | 15 | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Elsevier | es_ES |
dc.relation.ispartof | Journal of Logic and Algebraic Programming | es_ES |
dc.rights | Reserva de todos los derechos | es_ES |
dc.subject | Content extraction | es_ES |
dc.subject | Block detection | es_ES |
dc.subject | DOM | es_ES |
dc.subject | Information retrieval | es_ES |
dc.subject.classification | LENGUAJES Y SISTEMAS INFORMATICOS | es_ES |
dc.title | Using the words/leafs ratio in the DOM tree for content extraction | es_ES |
dc.type | Artículo | es_ES |
dc.identifier.doi | 10.1016/j.jlap.2013.01.002 | |
dc.relation.projectID | info:eu-repo/grantAgreement/MICINN//TIN2008-06622-C03-02/ES/VERIFICACION Y DEPURACION AGILES ORIENTADAS A MEJORAR LA SEGURIDAD DEL SOFTWARE/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/GVA//PROMETEO%2F2011%2F052/ES/LOGICEXTREME: TECNOLOGIA LOGICA Y SOFTWARE SEGURO/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/MICINN//BES-2009-015019/ES/BES-2009-015019/ | es_ES |
dc.relation.projectID | info:eu-repo/grantAgreement/ME//AP2010-4415/ES/AP2010-4415/ | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació | es_ES |
dc.description.bibliographicCitation | Insa Cabrera, D.; Silva Galiana, JF.; Tamarit, S. (2013). Using the words/leafs ratio in the DOM tree for content extraction. Journal of Logic and Algebraic Programming. 82(8):311-325. https://doi.org/10.1016/j.jlap.2013.01.002 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | http://dx.doi.org/10.1016/j.jlap.2013.01.002 | es_ES |
dc.description.upvformatpinicio | 311 | es_ES |
dc.description.upvformatpfin | 325 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 82 | es_ES |
dc.description.issue | 8 | es_ES |
dc.relation.senia | 249674 | |
dc.contributor.funder | Ministerio de Educación | es_ES |
dc.contributor.funder | Generalitat Valenciana | es_ES |