Preview

Proceedings of the Southwest State University. Series: IT Management, Computer Science, Computer Engineering. Medical Equipment Engineering

Advanced search

Algorithm for intelligent procesing of text information

https://doi.org/10.21869/2223-1536-2024-14-3-22-35

Abstract

The purpose of research. The goal of the research is to develop an intellectual processing algorithm for classifying text information. As the amount of information grows every day, it is necessary to quickly and efficiently separate significant from unimportant content. Therefore, the development of an intellectual processing algorithm for classifying text information is an urgent task.

Methods. A method is proposed for classifying text information presented in one or more natural languages. It is based on 5 key stages: entering a task, accumulating a queue of tasks, processing the task, generating the result of processing the task, outputting the result. The input task is presented in the form of an http request, the body of which contains a file object. If the intensity of the input stream is greater than the processing speed, then an accumulation of tasks occurs. After selecting the active task (using the FIFO principle), it is processed. As a result of the transformations, the received data is decoded into a string using UTF-8 encoding. Processing refers to the process of categorization, when a search for patterns occurs in a line. Upon completion of rubrication, the result for the selected task is generated. From the accumulated result, a response to the original http request is formed, the body of which contains a list of found categories.

Results. A method and algorithm for processing text data has been developed to determine the topics that are present in the input data set. The algorithm, implemented in software, allows you to work with text data in various languages.

Conclusion. The software development of the text data classification algorithm was carried out in the C++ programming language using the Qt libraries version 5.11. This implementation showed a throughput of 1-5 MB per second (on a homogeneous input text data set). The algorithm allows you to correctly process damaged file formats.

About the Authors

S. V. Efanov
Southwest State University
Russian Federation

Sergei V. Efanov, Post-Graduate Student

50 Let Oktyabrya Str. 94, Kursk 305040



E. N. Ivanova
Southwest State University
Russian Federation

Elena N. Ivanova, Candidate of Sciences  (Engineering), Associate Professor  of the Department of Computer Science

50 Let Oktyabrya Str. 94, Kursk 305040



I. E. Chernetskaya
Southwest State University
Russian Federation

Irina E. Chernetskaya, Doctor  of Sciences (Engineering), Head  of the Department of Computer Science

50 Let Oktyabrya Str. 94, Kursk 305040



References

1. Kobyshev K.S., Molodyakov S.A. Analysis and classification of algorithms for extracting relations from text data. Sovremennaya nauka: aktual’nye problemy teorii i praktiki. Seriya: Estestvennye i tekhnicheskie nauki = Modern Science: Current Problems of Theory and Practice. Series: Natural and Technical Sciences. 2021;(5):71–79. (In Russ.) https://doi.org/10.37882/2223-2966.2021.05.15. EDN KXLLZK

2. Polyakov A.A., Fetisov M.V. Classification of algorithms for preliminary processing of text data for machine learning. Tekhnologii inzhenernykh i informatsionnykh sistem = Technologies of Engineering and Information Systems. 2021;(4):70–79. (In Russ.) EDN QROXYD

3. Sazonov M.A., Yakovlev A.V., Kozhanchikov M.O., Maznichenko A.A. Development of promising methods for searching and classifying text information from open sources on the Internet. Sistemy upravleniya i informatsionnye tekhnologii = Control Systems and Information Technologies. 2023;2(92):92–95. (In Russ.) EDN QOMDON

4. Baranchikov A.I., Fedosova E.B. Application of data mining methods for analyzing and identifying patterns in relational databases. Radiotekhnicheskie i telekommunikatsionnye sistemy = Radio Engineering and Telecommunication Systems. 2023;2(50):40–45. (In Russ.) https://doi.org/10.24412/2221-2574-2023-2-40-45. EDN CIBVDW

5. Nabiullin D.A., Kononova V.V., Novikova S.V. Method of automated tagging of big data using neural networks. Vestnik Tekhnolo-gicheskogo universiteta = Bulletin of the Technological University. 2021;24(6):103–107. (In Russ.) EDN PJNLIK

6. Sabitov A.A., Minnikhanov R.N., Dagaeva M.V., et al. Methods for intellectual analysis of text data for emergency response services. Matema-ticheskie metody v tekhnike i tekhnologiyakh – MMTT = Mathematical Methods in Engineering and Technology – MMTT. 2020;(7):84–87. EDN NMCBWD

7. Lomakina L.S., Subbotin A.N. Classification of streaming data based on the Bayesian criterion. Modelirovanie, optimizatsiya i informatsionnye tekhnologii = Modeling, Optimization and Information Technologies. 2020;8(1):18. (In Russ.) https://doi.org/10.26102/2310-6018/2020.28.034. EDN ULSSNK

8. Andreev A.V. Artificial intelligence and its role in processing big data. Umnaya tsifrovaya ekonomika = Smart Digital Economy. 2023;3(1):65–69. (In Russ.)

9. Lane H., Hapke H., Howard C. Natural language processing in action. St. Petersburg: Peter; 2020. 576 p.

10. Baulina A.R., Resan M.T., Yanaeva M.V. Systems for text search, processing and analysis of natural language. Obshchestvoznanie i sotsial’naya psikhologiya = Social Science and Social Psychology. 2022:9(39):101–104. (In Russ.)

11. Ivanova G.S., Martynyuk P.A. Analysis of methods for extracting information from text data. Neirokomp’yutery: razrabotka, primenenie = Neurocomputers: Development, Application. 2022;24(3):18–28. (In Russ.) https://doi.org/10.18127/j19998554-202203-02

12. Kadermyatova L.M., Tutubalina E.V. Analysis of models of vector representations of words in the problem of marking semantic roles in Russian-language texts. Elektronnye biblioteki = Electronic Libraries. 2020;23(5):1026–1043. (In Russ.)

13. Leichter S.V., Chukanov S.N., Chukanov I.S., Shirokov I.V. Data analysis. Omsk: Omskii gosudarstvennyi universitet im. F.M. Dostoevskogo; 2022. 108 p. (In Russ.) EDN WHSYZW

14. Phat H.N., Anh N.T.M. Vietnamese text classification algorithm using long short term memory and word2vec. Informatics and Automation. 2020;19(6):1255–1279. https://doi.org/10.15622/ia.2020.19.6.5. EDN MFDPBK

15. Ogarok A.L., Zhavoronkova O.G. Methods of semantic processing of unstructured text information. Informatizatsiya i svyaz’ = Informatization and Communication. 2022;(6):44–48. (In Russ.) https://doi.org/10.34219/2078-8320-2022-13-6-44-48

16. Ogarok A.L. Mathematical model of the process of semantic processing of text information. Informatizatsiya i svyaz’ = Informatization and Communication. 2021;(6):87–91. https://doi.org/10.34219/2078-8320-2021-12-6-87-91

17. Popov O.R., Grebenyuk E.V. Algorithms for constructing intelligent systems for processing text information for the problem of opinion analysis. Intellektual’nye resursy – regional’nomu razvitiyu = Intellectual Resources for Regional Development. 2021;(2):104–110.

18. Kazantsev A.A., Prokhorov M.V., Khudyakova P.S. Review of approaches to text classification using current methods. Ekonomika i kachestvo sistem svyazi = Economics and Quality of Communication Systems. 2021;1(19):57–67. (In Russ.) EDN ZUJEVN

19. Semina T.A. Analysis of text sentiment: modern approaches and existing problems. Sotsial’nye i gumanitarnye nauki. Otechestvennaya i zarubezhnaya literatura. Seriya 6: Yazykoznanie = Social and Humanitarian Sciences. Domestic and Foreign Literature. Episode 6: Linguistics. 2020;4:47–63. (In Russ.)


Review

For citations:


Efanov S.V., Ivanova E.N., Chernetskaya I.E. Algorithm for intelligent procesing of text information. Proceedings of the Southwest State University. Series: IT Management, Computer Science, Computer Engineering. Medical Equipment Engineering. 2024;14(3):22-35. (In Russ.) https://doi.org/10.21869/2223-1536-2024-14-3-22-35

Views: 173


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2223-1536 (Print)