Towards information system development for data extraction from web
DOI:
https://doi.org/10.20998/2079-0023.2018.22.08Keywords:
information, web search, data extraction, data source, data mining, language standards, informational technologyAbstract
Today, the Internet contains a huge number of sources of information, which is constantly used in our daily lives. It often happens that similar in meaning information is presented in different forms on different resources (for example, electronic libraries, online stores, news sites and etc.). In this paper, we analyze the extraction of information from certain type of web sources that is required by the user. The analysis of the data extraction problem was carried out. When considering the main approaches to data extraction, the strengths and weaknesses of each were identified. The main aspects of the extraction of web knowledge were formulated. Approaches and information technologies for solving problems of syntactic analysis based on existing information systems are analyzed. Based on the analysis, the task of developing models and software components for extracting data from certain types of web resources were solving. A conceptual model of extracting data was developed taking into account web space as an external data source. A requirements specification for the software component was created, which will allow to continue working on the project and to clearly understand the requirements and constraints for implementation. During the process of modeling software, the following diagrams have been developed, such as activities, sequences and deployments, which will then be used to create the finished software application. For further development of the software, a programming platform and types of testing (load and modular) were defined. The obtained results allow to state that the proposed design solution, which will be implemented as a prototype of the software system, can perform the task of extracting data from different sources on the basis of a single semantic template.References
Baumgartner R., Gatterbauer W., Gottlob G. Web data extraction system. In Encyclopedia of Database Systems. 2009, pp. 3465–3471.
Anupam V., Freire J., Kumar B., Lieuwen D. Automating web navigation with the WebVCR. Computer Networks. 2000, pp. 503– 517.
Memex (Domain-Specific Search). Available at: www.darpa.mil/program/memex (accessed 02.11.2017).
Gatterbauer W., Bohunsky P., Herzog M., Krüpl B., Pollak B. Towards domain-independent information extraction from web tables. Proceedings of the 16th international conference on World Wide Web (May 08–12, 2007, Banff, Alberta, Canada). New York, ACM, 2007, pp. 71–80.
Bonifati A., Braga D., Campi A, Ceri S. Active XQuery. Proceedings of the 18th International Conference on Data Engineering (26 February – 1 March 2002, San Jose, California). 2002, pp. 129–138.
Bohannon P., Dalvi N., Filmus Y. Automatic web-scale information extraction. Proceedings of the ACM SIGMOD ICMD. 2012, pp. 609–612.
Shen W., AnHai D., Jeffrey F. Naughton, Ramakrishnan R. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, 2007, pp. 1033–1044.
Crescenzi V. RoadRunner. Towards automatic data extraction from large Web sites. Proceedings of the 27th International Conference on Very Large Data Bases. 2001, pp. 109–118.
Agichtein E., Gravano L. Snowball: extracting relations from large plain-text collections. Proceedings of the fifth ACM conference on Digital libraries. 2000, pp. 85–94.
Arasu A., Garcia-Molina H. Extracting Structured Data from Webpages. Proceedings of SIGMOD International Conference on Management of Data (June 9–12, 2003, San Diego, California). ACM, New York, 2003, pp. 337–348.
John T. Van Stan, Aron Stubbins, Tree‐DOM: Dissolved organic matter in throughfall and stemflow. Limnology and Oceanography Letters. 2017, vol. 3, pp. 199–214.
Cunningham H., Tablan V., Roberts A., Bontcheva K. Getting more out of biomedical documents with gate’s full lifecycle open source text analytics. PLoS Comput Biol. 2013, no. 9 (2), pp. 31–47.
Shin J., Wu S., Wang F., Christopher De Sa, Ce Zhang C., Re C. Incremental knowledge base construction using deepdive. VLDB Endowment. 2015, vol. 8, no. 11, pp. 1310–1321.
Khare R, Cutting D., Sitaker K., Rifkin A. Nutch: A Flexible and Scalable Open-Source Web Search Engine. Proceedings of the 14th International Conference on World Wide Web. 2005, vol. 1, p. 32.
Avasarala S. Selenium WebDriver Practical Guide. Pact Publishing, 2014. 266 p.
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2018 Bulletin of National Technical University "KhPI". Series: System Analysis, Control and Information TechnologiesAuthors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).