A method for determining an object from a limited sample based on a fuzzy description in natural language

Authors: Bryanskaya E.V.
Published in issue: #1(78)/2023
DOI: 10.18698/2541-8009-2023-1-856
Category: Informatics, Computer Engineering and Control \| Chapter: System Analysis, Control, and Information Processing, Statistics
Keywords: natural language, natural language text processing, ontology, “bag-of-words”, vectorization, TF-IDF, fuzzy duplicates, cosine similarity, semantic network, syntactic graph
Published: 15.02.2023

The article is devoted to solving the problem of determining an object from a limited sample by a fuzzy description in Russian. The developed method consists in combining two main approaches to solving typical problems in this area, one of which is based on a statistical algorithm, and the second is based on the use of a semantic network. Each of them requires its own ontology. To form the knowledge base of the first stage, the adapted TF-IDF method is used, for the second, a set of syntactic graphs is taken as a basis. Cosine similarity is used to find fuzzy duplicates between the user's query and the knowledge base created in advance. The paper investigates the influence of the sample size on the similarity measure and the accuracy of the object definition. The proportion of requests to the second step of the proposed method is also evaluated, including in order to determine what percentage of these requests falls on an incorrect assumption made at the first stage.

References

[1] Bolshakova E.I., Vorontsov K.V., Efremova N.E. et al. Avtomaticheskaya obrabotka tekstov na estestvennom yazyke i analiz dannykh [Automatic natural language text processing and data analysis]. Moscow, Izd-vo NIU VShE Publ., 2017 (in Russ.).

[2] Gavrilova T.A., Khoroshevskiy V.F. Bazy znaniy intellektualnykh system [Knowledge bases of intelligent systems]. Sankt-Petersburg, Piter Publ., 2000 (in Russ.).

[3] Srividhya V., Anitha R. Evaluating preprocessing techniques in text categorization. Int. J. Comput. Sci. Appl., 2010, vol. 47, no. 11, pp. 49–51.

[4] Akiko Aizawa An information-theoretic perspective of TF–IDF measures. Inf. Process. Manag., 2003, vol. 39, no. 1, pp. 45–65. DOI: https://doi.org/10.1016/S0306-4573(02)00021-3

[5] Zibert A.O., Khrustalev V.I. Development of a system for determining the existence of adoption in the works of the students. The search algorithms of indistinct duplicates. Universum: Tekhnicheskie nauki, 2014, no. 3. URL: https://7universum.com/ru/tech/archive/item/1139 (in Russ.).

[6] Preobrazhenskiy Yu.P., Konovalov V.M. About methods for creating recommendation systems. Vestnik VIVT, 2019, no. 4, pp. 75–79 (in Russ.).

[7] Babkin E.A., Kozyrev O.R., Kurkina I.V. Printsipy i algoritmy iskusstvennogo intellekta [Principles and algorithms of artificial intelligence]. Nizhniy Novgorod, NGTU Publ., 2006 (in Russ.).

[8] Tesniere L. Elements de syntaxe structurale. Librairie C. Klinckseick, 1959. (Russ. ed.: Osnovy strukturnogo sintaksisa. Progress Publ., 1988.)

[9] Enikeev R.D., Rudoy B.P. Dvigateli vnutrennego sgoraniya. Osnovnye terminy i russko-angliyskie sootvetstviya [Internal combustion engines. Basic terms and Russian-English correspondences]. Moscow, Mashinostroenie Publ., 2004 (in Russ.).