Preview

Proceedings of the Southwest State University. Series: IT Management, Computer Science, Computer Engineering. Medical Equipment Engineering

Advanced search

Distributed Algorithm for Extracting Text Information from News Sites Using Big Data Technologies

https://doi.org/10.21869/2223-1536-2021-11-4-8-25

Abstract

Purpose of the research. The purpose of this work is to develop an algorithm and a software system that allows to perform in a distributed mode the extraction of information from news sites using big data technologies.

Methods. Extracting the key concepts of news sites content helps artificial intelligent algorithms to investigate economic, political, and social phenomena in different contexts. This task is close to the text summarization problem which is being studied actively in modern research papers. But significantly less of the works touch on the big data algorithms for text summarization. We propose a new algorithm for effective text meaning extraction from a large count of news sites based on big data framework Apache Spark. The meaning of news is analyzed with Google BERT - the modern neural network architecture for natural language processing. Different groups of news are separated from each other with the k-means clusterization algorithm. The number of clusters is determined automatically using the gap statistic method. The content of sites scrapes by chrome browsers driven Selenium WebDriver in the distributed regime.

Results. The article presented detailed algorithms for the implemented software system, such as a mathematical model, architecture of a software distributed system.

Conclusion. The evaluation of our algorithm using the ROUGE metric demonstrates satisfactory summarization quality of news texts.

About the Authors

Y. A. Kachanov
Volgograd State Technical University
Russian Federation

Yurii A. Kachanov, Lecturer of the Department of Software of Automated Systems

 28 V. I. Lenina av., Volgograd 400005



P. D. Kravchenya
Volgograd State Technical University
Russian Federation

Pavel D. Kravchenya, Cand. of Sci. (Physical and Mathematical), Associate Professor of the Departments of Electronic Computing Machines and Systems

28 V. I. Lenina av., Volgograd 400005

 



M. A. Kuznetsov
Volgograd State Technical University
Russian Federation

Mikhail A. Kuznetsov, Cand. of Sci. (Engineering), Associate Professor of the Departments of Electronic Computing Machines and Systems

 28 V. I. Lenina av., Volgograd 400005



A. S. Kuznetsova
Volgograd State Technical University
Russian Federation

Agnessa S. Kuznetsova, Senior Lecturer of the Department of Software of Automated Systems

28 V. I. Lenina av., Volgograd 400005

 



V. V. Gilka
Volgograd State Technical University
Russian Federation

Vadim V. Gilka, Senior Lecturer of the Department of Software of Automated Systems

28 V. I. Lenina av., Volgograd 400005



References

1. Dalal V., Malik L. [A Survey of Extractive and Abstractive Text Summarization Techniques]. VI Mezhdunarodnaya konferentsiya po novym tendentsiyam v mashinostroenii i tekhnologiyakh [2013 6th International Conference on Emerging Trends in Engineering and Technology, 2013]. (In Russ.) https://doi.org/10.1109/icetet.2013.31.

2. Vaswani A., Shazir N., Parmar N., Ushkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I. Attention is all you need. Available at: http://arXiv.org/abs/1706.03762. (accessed 10.09.2021)

3. Devlin J., Ming-Wei C., Lee K., Tutanova K. BERT: Pre-training of Deep Bedirectional Transformers for Language Understanding. Available at: http://arXiv.org/abs/1810.04805. (accessed 10.09.2021).

4. Anand D., Wagh R. Effektivnye podkhody glubokogo obucheniya dlya obobshcheniya yuridicheskikh tekstov [Effective deep learning approaches for summarization of legal texts]. Zhurnal Universiteta Korolya Sauda - Komp'yuternye i informatsionnye nauki = Journal of King Saud University - Computer and Information Sciences, 2019. https://doi.org/10.1016/jjksuci.2019.11.015.

5. Fang W., Jiang T., Jiang K., Zhang F., Ding Y., Sheng J. A method of automatic text summarisation based on long short-term memory. International Journal of Computational Science and Engineering, 2020, no. 22(1), p. 39. https://doi.org/10.1504/ijcse.2020.107243.

6. Aakash Sinha, Abhishek Yadav, Akshay Gakhlot. Extractive Text Summarization Using Neural Networks. Available at: http://arXiv.org/abs/arXiv/1802.10137. (accessed 17.09.2021).

7. Luo T., Guo K., Guo H. [Automatic Text Summarization Based on Transformer and Switchable Normalization]. Mezhdunarodnaya konferentsiya IEEE 2019 po parallel'noi i raspredelennoi obrabotke s prilozheniyami, bol'shim dannym i oblachnym vychisleniyam, ustoichivym vychisleniyam i kommunikatsiyam, sotsial'nym vychisleniyam i setyam (ISPA / BDCloud / SocialCom / SustainCom) [2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom / SustainCom)]. (In Russ.) https://doi.org/10.1109/ispa-bddoud-sustaincom-socialcom48970.2019.00236.

8. Liu Y., Lapata M. Obobshchenie teksta s predvaritel'no obuchennymi kodirovsh- chikami [Text Summarization with Pretrained Encoders]. Materialy konferentsii 2019 goda po empiricheskim metodam obrabotki estestvennogo yazyka i XIX Mezhdunarodnoi sovmestnoi konferentsii po obrabotke estestvennogo yazyka (EMNLP-IJCNLP) [Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)]. 2019. (In Russ.) https://doi.org/10.18653/v1/d19-1387.

9. Xu J., Gan Z., Cheng Y., Liu J. Neironno-izvlekayushchee rezyumirovanie teksta s uchetom diskursa [Discourse-Aware Neural Extractive Text Summarization]. Materialy 58- go ezhegodnogo sobraniya Assotsiatsii komp'yuternoi lingvistiki [Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics]. 2020. (In Russ.) https://doi.org/10.18653/v1/2020.acl-main.451.

10. Zhang X., Wei F., Zhou M. HIBERT: Predvaritel'naya podgotovka na urovne dokumenta ierarkhicheskikh dvunapravlennykh preobrazovatelei dlya rezyumirovaniya dokumentov [HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization]. Materialy 57-go ezhegodnogo sobraniya Assotsiatsii komp'yuternoi lingvistiki [Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019]. (In Russ.) https://doi.org/10.18653/v1/p19-1499.

11. Zhong M., Liu P., Chen Yu., Wang D., Qiu X., Huang X. Extractive summarization as a text comparison. Available at: https://arXiv.org/abs/2004.08795. (accessed 10.09.2021).

12. Zhong M., Liu P., Wang D., Qiu X., Huang X. Poisk effektivnykh neironno- ekstraktivnykh obobshchenii: chto rabotaet i chto dalee [Searching for Effective Neural Extractive Summarization: What Works and What’s Next]. Materialy 57-go ezhegodnogo sobraniya [Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019]. (In Russ.) https://doi.org/ 10.18653/v1/p19-1100.

13. Cai Z., Lin N., Ma C., Jiang S. Indoneziiskoe avtomaticheskoe rezyumirovanie teksta na osnove novogo metoda klasterizatsii na urovne predlozhenii [Indonesian Automatic Text Summarization Based on A New Clustering Method in Sentence Level]. Materialy Mezhdunarodnoi konferentsii po inzhenerii bol'shikh dannykh 2019 g. (BDE 2019) [Proceedings of the 2019 International Conference on Big Data Engineering (BDE 2019)]. New York, Press Publ., 2019. https://doi.org/10.1145/3341620.3341626.

14. Iwendi C., Ponnan S., Munirathinam R., Srinivasan K., Chang C. Y. Effektivnyi i unikal'nyi analiz dannykh na osnove algoritmicheskoi modeli TF/IDF dlya obrabotki prilozhenii s potokovoi peredachei bol'shikh dannykh [An Efficient and Unique TF/IDF Algorithmic Model-Based Data Analysis for Handling Applications with Big Data Streaming]. Elektronika = Electronics, 2019, no. 8(11), p. 1331. https://doi.org/10.3390/electronics8111331.

15. Das S. eStep: novyi metod obobshcheniya semanticheskogo teksta s ispol'zovaniem bol'shikh dannykh v Internete [eStep: A Novel Method for Semantic Text Summarization with Web-based Big Data]. Mezhdunarodnyi zhurnal noveishikh tekhnologii i inzhenerii = International Journal of Recent Technology and Engineering, 2019, no. 8(3), pp. 5171-5175. https://doi.org/10.35940/ijrte.c5802.098319.

16. Gupta V., Bansal N., Sharma A. Obobshchenie teksta dlya bol'shikh dannykh: kompleksnoe issledovanie [Text generalization for big data: a comprehensive study]. Mezhdunarodnaya konferentsiya po innovatsionnym vychisleniyam i kommunikatsiyam [International Conference on Innovative Computing and Communications]. Springer Singapore, 2018, pp. 503-516. https://doi.org/ 10.1007/978-981-13-2354-6_51.

17. Tai Yu., Dehgani M., Abnar S., Shen Yu., Bahri D., Pham P., Rao J., Yang L., Ruder S., Metzler D. Long Range Arena: A Benchmark for Efficient Transformers. Available at: http://arXiv.org/abs/2011.04006. (accessed 17.09.2021).

18. Mohajer Mohajer, Karl-Hans Englemeyer, Volker Schmid J. A comparison of Gap Statistics Definitions with and without Logarithm Function. Available at: http://arXiv.org/ abs/1103.4767. (accessed 07.09.2021).

19. Tibshirani R., Walther G., Hastie T. Otsenka kolichestva klasterov v nabore dannykh s pomoshch'yu statistiki probelov [Estimating the number of clusters in a data set via the gap statistic]. Zhurnal Korolevskogo statisticheskogo obshchestva: Seriya B: Statisticheskaya metodologiya = Journal of the Royal Statistical Society: Series B: Statistical Methodology, 2001, no. 63(2), pp. 411-423. https://doi.org/10.1111/1467-9868.00293.

20. Keras_bert at master CyberZHG/keras-bert GitHub. Available at: https://github.com/CyberZHG/keras-bert/tree/master/keras_bert. (accessed 17.09.2021)

21. Lin C. Y. ROUGE: Paket dlya avtomaticheskoi otsenki rezyume [ROUGE: A Package for Automatic Evaluation of Summaries]. Available at: https://www.aclweb.org/anthology/ W04-1013. (accessed 04.09.2021).

22. Moritz Hermann K., Kochisky T., Grefenstette E., Espeholt L., Kay W., Suleyman M., Blansom P. Teaching Machines to Read and Comprehend. 2015. Available at: http://arXiv.org/ abs/1506.03340. (accessed 10.09.2021).


Review

For citations:


Kachanov Y.A., Kravchenya P.D., Kuznetsov M.A., Kuznetsova A.S., Gilka V.V. Distributed Algorithm for Extracting Text Information from News Sites Using Big Data Technologies. Proceedings of the Southwest State University. Series: IT Management, Computer Science, Computer Engineering. Medical Equipment Engineering. 2021;11(4):8-25. (In Russ.) https://doi.org/10.21869/2223-1536-2021-11-4-8-25

Views: 198


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2223-1536 (Print)