Mostrar el registro sencillo del ítem
dc.contributor.author | Castillo, Carlos | es_ES |
dc.contributor.author | Valero Llinares, Héctor | es_ES |
dc.contributor.author | Guadalupe Ramos, José | es_ES |
dc.contributor.author | Silva Galiana, Josep Francesc | es_ES |
dc.date.accessioned | 2014-02-24T07:29:47Z | |
dc.date.issued | 2012 | |
dc.identifier.isbn | 978-3-642-28600-1 | |
dc.identifier.issn | 0302-9743 | |
dc.identifier.uri | http://hdl.handle.net/10251/35896 | |
dc.description.abstract | Retrieving information from Internet is a difficult task as it is demonstrated by the lack of real-time tools able to extract information from webpages. The main cause is that most webpages in Internet are implemented using plain (X)HTML which is a language that lacks structured semantic information. For this reason much of the efforts in this area have been directed to the development of techniques for URLs extraction. This field has produced good results implemented by modern search engines. But, contrarily, extracting information from a single webpage has produced poor results or very limited tools. In this work we define a novel technique for information extraction from single webpages or collections of interconnected webpages. This technique is based on DOM distances to retrieve information. This allows the technique to work with any webpage and, thus, to retrieve information online. Our implementation and experiments demonstrate the usefulness of the technique. | es_ES |
dc.format.extent | 13 | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Springer Verlag (Germany) | es_ES |
dc.relation.ispartof | Computational Linguistics and Intelligent Text Processing | es_ES |
dc.relation.ispartofseries | Lecture Notes in Computer Science;7182 | |
dc.rights | Reserva de todos los derechos | es_ES |
dc.subject.classification | LENGUAJES Y SISTEMAS INFORMATICOS | es_ES |
dc.title | Information extraction from Webpages based on DOM distances | es_ES |
dc.type | Capítulo de libro | es_ES |
dc.embargo.lift | 10000-01-01 | |
dc.embargo.terms | forever | es_ES |
dc.identifier.doi | 10.1007/978-3-642-28601-8_16 | |
dc.rights.accessRights | Abierto | es_ES |
dc.contributor.affiliation | Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació | es_ES |
dc.description.bibliographicCitation | Castillo, C.; Valero Llinares, H.; Guadalupe Ramos, J.; Silva Galiana, JF. (2012). Information extraction from Webpages based on DOM distances. En Computational Linguistics and Intelligent Text Processing. Springer Verlag (Germany). 181-193. doi:10.1007/978-3-642-28601-8_16 | es_ES |
dc.description.accrualMethod | S | es_ES |
dc.relation.conferencename | 13th International Conference, CICLing 2012 | es_ES |
dc.relation.conferencedate | March 11-17, 2012 | es_ES |
dc.relation.conferenceplace | New Delhi, India | es_ES |
dc.relation.publisherversion | http://link.springer.com/chapter/10.1007%2F978-3-642-28601-8_16 | es_ES |
dc.description.upvformatpinicio | 181 | es_ES |
dc.description.upvformatpfin | 193 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.relation.senia | 214743 | |
dc.description.references | Dalvi, B., Cohen, W.W., Callan, J.: Websets: Extracting sets of entities from the web using unsupervised information extraction. Technical report, Carnegie Mellon School of computer Science (2011) | es_ES |
dc.description.references | Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI 1997) (1997) | es_ES |
dc.description.references | Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the international World Wide Web conference (WWW 2002), pp. 232–241 (2002) | es_ES |
dc.description.references | Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17(5), 48–57 (2002) | es_ES |
dc.description.references | Anti-Porn Parental Controls Software. Porn Filtering (March 2010), http://www.tueagles.com/anti-porn/ | es_ES |
dc.description.references | Kang, B.-Y., Kim, H.-G.: Web page filtering for domain ontology with the context of concept. IEICE - Trans. Inf. Syst. E90, D859–D862 (2007) | es_ES |
dc.description.references | Henzinger, M.: The Past, Present and Future of Web Information Retrieval. In: Proceedings of the 23th ACM Symposium on Principles of Database Systems (2004) | es_ES |
dc.description.references | W3C Consortium. Resource Description Framework (RDF), www.w3.org/RDF | es_ES |
dc.description.references | W3C Consortium. Web Ontology Language (OWL), www.w3.org/2004/OWL | es_ES |
dc.description.references | Microformats.org. The Official Microformats Site (2009), http://microformats.org | es_ES |
dc.description.references | Khare, R., Çelik, T.: Microformats: a Pragmatic Path to the Semantic Web. In: Proceedings of the 15h International Conference on World Wide Web, pp. 865–866 (2006) | es_ES |
dc.description.references | Khare, R.: Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing 10(1), 68–75 (2006) | es_ES |
dc.description.references | Gupta, S., et al.: Automating Content Extraction of HTML Documents. World Wide Archive 8(2), 179–224 (2005) | es_ES |
dc.description.references | Li, P., Liu, M., Lin, Y., Lai, Y.: Accelerating Web Content Filtering by the Early Decision Algorithm. IEICE Transactions on Information and Systems E91-D, 251–257 (2008) | es_ES |
dc.description.references | W3C Consortium, Document Object Model (DOM), www.w3.org/DOM | es_ES |
dc.description.references | Baeza-Yates, R., Castillo, C.: Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004) | es_ES |
dc.description.references | Micarelli, A., Gasparetti, F.: Adaptative Focused Crawling. In: The Adaptative Web, pp. 231–262 (2007) | es_ES |
dc.description.references | Nielsen, J.: Designing Web Usability: The Practice of Simplicity. New Riders Publishing, Indianapolis (2010) ISBN 1-56205-810-X | es_ES |
dc.description.references | Zhang, J.: Visualization for Information Retrieval. The Information Retrieval Series. Springer, Heidelberg (2007) ISBN 3-54075-1475 | es_ES |
dc.description.references | Hearst, M.A.: TileBars: Visualization of Term Distribution Information. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, pp. 59–66 (May 1995) | es_ES |
dc.description.references | Gottron, T.: Evaluating Content Extraction on HTML Documents. In: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (2007) | es_ES |
dc.description.references | Apache Foundation. The Apache crawler Nutch (2010), http://nutch.apache.org | es_ES |