Information extraction from Webpages based on DOM distances

Castillo, Carlos; Valero Llinares, Héctor; Guadalupe Ramos, José; Silva Galiana, Josep Francesc

doi:10.1007/978-3-642-28601-8_16

Identificarse

Buscar en RiuNet

Listar

Todo RiuNet
Esta colección

Mi cuenta

Acceder

Estadísticas

Ver Estadísticas de uso

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Information extraction from Webpages based on DOM distances

Mostrar el registro sencillo del ítem

Ficheros en el ítem

Nombre: cicling2012_submi ...

Tamaño: 1.339Mb

Formato: PDF

Descripción: Versión del Autor.

Abrir

Nombre: Information Extraction ...

Tamaño: 309.7Kb

Formato: PDF

Descripción: Versión editorial

Solicitar una copia al autor

dc.contributor.author	Castillo, Carlos	es_ES
dc.contributor.author	Valero Llinares, Héctor	es_ES
dc.contributor.author	Guadalupe Ramos, José	es_ES
dc.contributor.author	Silva Galiana, Josep Francesc	es_ES
dc.date.accessioned	2014-02-24T07:29:47Z
dc.date.issued	2012
dc.identifier.isbn	978-3-642-28600-1
dc.identifier.issn	0302-9743
dc.identifier.uri	http://hdl.handle.net/10251/35896
dc.description.abstract	Retrieving information from Internet is a difficult task as it is demonstrated by the lack of real-time tools able to extract information from webpages. The main cause is that most webpages in Internet are implemented using plain (X)HTML which is a language that lacks structured semantic information. For this reason much of the efforts in this area have been directed to the development of techniques for URLs extraction. This field has produced good results implemented by modern search engines. But, contrarily, extracting information from a single webpage has produced poor results or very limited tools. In this work we define a novel technique for information extraction from single webpages or collections of interconnected webpages. This technique is based on DOM distances to retrieve information. This allows the technique to work with any webpage and, thus, to retrieve information online. Our implementation and experiments demonstrate the usefulness of the technique.	es_ES
dc.format.extent	13	es_ES
dc.language	Inglés	es_ES
dc.publisher	Springer Verlag (Germany)	es_ES
dc.relation.ispartof	Computational Linguistics and Intelligent Text Processing	es_ES
dc.relation.ispartofseries	Lecture Notes in Computer Science;7182
dc.rights	Reserva de todos los derechos	es_ES
dc.subject.classification	LENGUAJES Y SISTEMAS INFORMATICOS	es_ES
dc.title	Information extraction from Webpages based on DOM distances	es_ES
dc.type	Capítulo de libro	es_ES
dc.embargo.lift	10000-01-01
dc.embargo.terms	forever	es_ES
dc.identifier.doi	10.1007/978-3-642-28601-8_16
dc.rights.accessRights	Abierto	es_ES
dc.contributor.affiliation	Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació	es_ES
dc.description.bibliographicCitation	Castillo, C.; Valero Llinares, H.; Guadalupe Ramos, J.; Silva Galiana, JF. (2012). Information extraction from Webpages based on DOM distances. En Computational Linguistics and Intelligent Text Processing. Springer Verlag (Germany). 181-193. doi:10.1007/978-3-642-28601-8_16	es_ES
dc.description.accrualMethod	S	es_ES
dc.relation.conferencename	13th International Conference, CICLing 2012	es_ES
dc.relation.conferencedate	March 11-17, 2012	es_ES
dc.relation.conferenceplace	New Delhi, India	es_ES
dc.relation.publisherversion	http://link.springer.com/chapter/10.1007%2F978-3-642-28601-8_16	es_ES
dc.description.upvformatpinicio	181	es_ES
dc.description.upvformatpfin	193	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	es_ES
dc.relation.senia	214743
dc.description.references	Dalvi, B., Cohen, W.W., Callan, J.: Websets: Extracting sets of entities from the web using unsupervised information extraction. Technical report, Carnegie Mellon School of computer Science (2011)	es_ES
dc.description.references	Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI 1997) (1997)	es_ES
dc.description.references	Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the international World Wide Web conference (WWW 2002), pp. 232–241 (2002)	es_ES
dc.description.references	Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17(5), 48–57 (2002)	es_ES
dc.description.references	Anti-Porn Parental Controls Software. Porn Filtering (March 2010), http://www.tueagles.com/anti-porn/	es_ES
dc.description.references	Kang, B.-Y., Kim, H.-G.: Web page filtering for domain ontology with the context of concept. IEICE - Trans. Inf. Syst. E90, D859–D862 (2007)	es_ES
dc.description.references	Henzinger, M.: The Past, Present and Future of Web Information Retrieval. In: Proceedings of the 23th ACM Symposium on Principles of Database Systems (2004)	es_ES
dc.description.references	W3C Consortium. Resource Description Framework (RDF), www.w3.org/RDF	es_ES
dc.description.references	W3C Consortium. Web Ontology Language (OWL), www.w3.org/2004/OWL	es_ES
dc.description.references	Microformats.org. The Official Microformats Site (2009), http://microformats.org	es_ES
dc.description.references	Khare, R., Çelik, T.: Microformats: a Pragmatic Path to the Semantic Web. In: Proceedings of the 15h International Conference on World Wide Web, pp. 865–866 (2006)	es_ES
dc.description.references	Khare, R.: Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing 10(1), 68–75 (2006)	es_ES
dc.description.references	Gupta, S., et al.: Automating Content Extraction of HTML Documents. World Wide Archive 8(2), 179–224 (2005)	es_ES
dc.description.references	Li, P., Liu, M., Lin, Y., Lai, Y.: Accelerating Web Content Filtering by the Early Decision Algorithm. IEICE Transactions on Information and Systems E91-D, 251–257 (2008)	es_ES
dc.description.references	W3C Consortium, Document Object Model (DOM), www.w3.org/DOM	es_ES
dc.description.references	Baeza-Yates, R., Castillo, C.: Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004)	es_ES
dc.description.references	Micarelli, A., Gasparetti, F.: Adaptative Focused Crawling. In: The Adaptative Web, pp. 231–262 (2007)	es_ES
dc.description.references	Nielsen, J.: Designing Web Usability: The Practice of Simplicity. New Riders Publishing, Indianapolis (2010) ISBN 1-56205-810-X	es_ES
dc.description.references	Zhang, J.: Visualization for Information Retrieval. The Information Retrieval Series. Springer, Heidelberg (2007) ISBN 3-54075-1475	es_ES
dc.description.references	Hearst, M.A.: TileBars: Visualization of Term Distribution Information. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, pp. 59–66 (May 1995)	es_ES
dc.description.references	Gottron, T.: Evaluating Content Extraction on HTML Documents. In: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (2007)	es_ES
dc.description.references	Apache Foundation. The Apache crawler Nutch (2010), http://nutch.apache.org	es_ES

Este ítem aparece en la(s) siguiente(s) colección(ones)

Artículos, conferencias, monografías [48357]

Mostrar el registro sencillo del ítem

Information extraction from Webpages based on DOM distances

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Buscar en RiuNet

Listar

Todo RiuNet

Esta colección

Mi cuenta

Estadísticas

Ayuda RiuNet

Admin. UPV

Compartir/Enviar a

Citas

Estadísticas

Information extraction from Webpages based on DOM distances

Ficheros en el ítem

Este ítem aparece en la(s) siguiente(s) colección(ones)