Towards information system development for data extraction from web

Yulia Mukolaivna Gontar; Kateryna Victorivna Tkach; Bohdan Oleksandrovych Yena; Artem Victorovych Vasylenko

doi:10.20998/2079-0023.2018.22.08

Authors

Yulia Mukolaivna Gontar https://orcid.org/0000-0002-3748-5086
Kateryna Victorivna Tkach https://orcid.org/0000-0001-7104-800X
Bohdan Oleksandrovych Yena https://orcid.org/0000-0003-4791-956X
Artem Victorovych Vasylenko https://orcid.org/0000-0003-3121-4856

DOI:

https://doi.org/10.20998/2079-0023.2018.22.08

Keywords:

information, web search, data extraction, data source, data mining, language standards, informational technology

Abstract

Today, the Internet contains a huge number of sources of information, which is constantly used in our daily lives. It often happens that similar in meaning information is presented in different forms on different resources (for example, electronic libraries, online stores, news sites and etc.). In this paper, we analyze the extraction of information from certain type of web sources that is required by the user. The analysis of the data extraction problem was carried out. When considering the main approaches to data extraction, the strengths and weaknesses of each were identified. The main aspects of the extraction of web knowledge were formulated. Approaches and information technologies for solving problems of syntactic analysis based on existing information systems are analyzed. Based on the analysis, the task of developing models and software components for extracting data from certain types of web resources were solving. A conceptual model of extracting data was developed taking into account web space as an external data source. A requirements specification for the software component was created, which will allow to continue working on the project and to clearly understand the requirements and constraints for implementation. During the process of modeling software, the following diagrams have been developed, such as activities, sequences and deployments, which will then be used to create the finished software application. For further development of the software, a programming platform and types of testing (load and modular) were defined. The obtained results allow to state that the proposed design solution, which will be implemented as a prototype of the software system, can perform the task of extracting data from different sources on the basis of a single semantic template.

References

Baumgartner R., Gatterbauer W., Gottlob G. Web data extraction system. In Encyclopedia of Database Systems. 2009, pp. 3465–3471.

Anupam V., Freire J., Kumar B., Lieuwen D. Automating web navigation with the WebVCR. Computer Networks. 2000, pp. 503– 517.

Memex (Domain-Specific Search). Available at: www.darpa.mil/program/memex (accessed 02.11.2017).

Gatterbauer W., Bohunsky P., Herzog M., Krüpl B., Pollak B. Towards domain-independent information extraction from web tables. Proceedings of the 16th international conference on World Wide Web (May 08–12, 2007, Banff, Alberta, Canada). New York, ACM, 2007, pp. 71–80.

Bonifati A., Braga D., Campi A, Ceri S. Active XQuery. Proceedings of the 18th International Conference on Data Engineering (26 February – 1 March 2002, San Jose, California). 2002, pp. 129–138.

Bohannon P., Dalvi N., Filmus Y. Automatic web-scale information extraction. Proceedings of the ACM SIGMOD ICMD. 2012, pp. 609–612.

Shen W., AnHai D., Jeffrey F. Naughton, Ramakrishnan R. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, 2007, pp. 1033–1044.

Crescenzi V. RoadRunner. Towards automatic data extraction from large Web sites. Proceedings of the 27th International Conference on Very Large Data Bases. 2001, pp. 109–118.

Agichtein E., Gravano L. Snowball: extracting relations from large plain-text collections. Proceedings of the fifth ACM conference on Digital libraries. 2000, pp. 85–94.

Arasu A., Garcia-Molina H. Extracting Structured Data from Webpages. Proceedings of SIGMOD International Conference on Management of Data (June 9–12, 2003, San Diego, California). ACM, New York, 2003, pp. 337–348.

John T. Van Stan, Aron Stubbins, Tree‐DOM: Dissolved organic matter in throughfall and stemflow. Limnology and Oceanography Letters. 2017, vol. 3, pp. 199–214.

Cunningham H., Tablan V., Roberts A., Bontcheva K. Getting more out of biomedical documents with gate’s full lifecycle open source text analytics. PLoS Comput Biol. 2013, no. 9 (2), pp. 31–47.

Shin J., Wu S., Wang F., Christopher De Sa, Ce Zhang C., Re C. Incremental knowledge base construction using deepdive. VLDB Endowment. 2015, vol. 8, no. 11, pp. 1310–1321.

Khare R, Cutting D., Sitaker K., Rifkin A. Nutch: A Flexible and Scalable Open-Source Web Search Engine. Proceedings of the 14th International Conference on World Wide Web. 2005, vol. 1, p. 32.

Avasarala S. Selenium WebDriver Practical Guide. Pact Publishing, 2014. 266 p.

Towards information system development for data extraction from web

Authors

DOI:

Keywords:

Abstract

References

Downloads

How to Cite

Issue

Section

License

Information

Developed By