Mostrar el registro sencillo del ítem
dc.contributor.author | López Romero, Sergio | es_ES |
dc.contributor.author | Silva Galiana, Josep Francesc | es_ES |
dc.contributor.author | Insa Cabrera, David | es_ES |
dc.date.accessioned | 2015-03-05T09:14:30Z | |
dc.date.available | 2015-03-05T09:14:30Z | |
dc.date.issued | 2012-06 | |
dc.identifier.issn | 1870-9044 | |
dc.identifier.uri | http://hdl.handle.net/10251/47738 | |
dc.description.abstract | This article introduces a new approach for content extraction that exploits the hierarchical inter-relations of the elements in a webpage. Content extraction is a technique used to extract from a webpage the main textual content. This is useful in order to filter out the advertisements and all the additional information that is not part of the main content. The main idea behind our approach is to use the DOM tree as an explicit representation of the inter-relations of the elements in a webpage. Using the information contained in the DOM tree we can identify blocks of content and we can easily determine what of the blocks contains more text. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks. | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | IPN, Centro de Innovación y Desarrollo Tecnológico en Cómputo | es_ES |
dc.relation.ispartof | Research and Development in Computer Science and Engineering | es_ES |
dc.rights | Reconocimiento - No comercial (by-nc) | es_ES |
dc.subject | Content extraction | es_ES |
dc.subject | Block detection | es_ES |
dc.subject | DOM | es_ES |
dc.subject.classification | LENGUAJES Y SISTEMAS INFORMATICOS | es_ES |
dc.title | Content Extraction based on Hierarchical Relations in DOM Structures | es_ES |
dc.type | Artículo | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació | es_ES |
dc.description.bibliographicCitation | López Romero, S.; Silva Galiana, JF.; Insa Cabrera, D. (2012). Content Extraction based on Hierarchical Relations in DOM Structures. Research and Development in Computer Science and Engineering. 45:5-12. http://hdl.handle.net/10251/47738 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.publisherversion | http://www.cidetec.ipn.mx/polibits/Paginas/issue45.aspx | es_ES |
dc.description.upvformatpinicio | 5 | es_ES |
dc.description.upvformatpfin | 12 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.description.volume | 45 | es_ES |
dc.relation.senia | 235315 |