- -

Using the DOM tree for content extraction

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Using the DOM tree for content extraction

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author López, Sergio es_ES
dc.contributor.author Silva Galiana, Josep Francesc es_ES
dc.contributor.author Insa Cabrera, David es_ES
dc.date.accessioned 2015-02-02T19:48:37Z
dc.date.available 2015-02-02T19:48:37Z
dc.date.issued 2012-10
dc.identifier.issn 2075-2180
dc.identifier.uri http://hdl.handle.net/10251/46658
dc.description.abstract The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of content extraction, a research area of widely interest due to its many applications. Content extraction is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems that need to extract the main content in a web document to avoid the treatment and processing of other useless information. Other interesting application where content extraction is particularly used is displaying webpages in small screens such as mobile phones or PDAs. In this work we present a new technique for content extraction that uses the DOM tree of the webpage to analyze the hierarchical relations of the elements in the webpage. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks es_ES
dc.language Inglés es_ES
dc.publisher Open Publishing Association es_ES
dc.relation.ispartof Electronic Proceedings in Theoretical Computer Science es_ES
dc.rights Reconocimiento (by) es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Using the DOM tree for content extraction es_ES
dc.type Artículo es_ES
dc.identifier.doi 10.4204/EPTCS.98
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation López, S.; Silva Galiana, JF.; Insa Cabrera, D. (2012). Using the DOM tree for content extraction. Electronic Proceedings in Theoretical Computer Science. 98(Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems):46-59. doi:10.4204/EPTCS.98 es_ES
dc.description.accrualMethod S es_ES
dc.relation.publisherversion http://dx.doi.org/10.4204/EPTCS.98.6 es_ES
dc.description.upvformatpinicio 46 es_ES
dc.description.upvformatpfin 59 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.description.volume 98 es_ES
dc.description.issue Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems es_ES
dc.relation.senia 249355


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem