Multi-criteria recognition of text-to-topic correspondence based on the TF-IDF algorithm

DOI: 10.31673/2412-9070.2025.027728

Authors

  • В. М. Данильченко, (Danylchenko V. M.) State University of Information and Communication Technologies, Kyiv
  • С. І. Отрох, (Otrokh S. I.) National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine
  • М. О. Шалигін, (Shaligin M. O.) National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine
  • А. Г. Донець, (Donets A. G.) National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute», Ukraine

DOI:

https://doi.org/10.31673/2412-9070.2025.027728

Abstract

This research delves into the critical role of statistical methodologies in determining the thematic alignment of textual content with specific user interests. Recognizing the everincreasing volume of digital information, the need for accurate and efficient document classifycation has become paramount across numerous applications. This study meticulously examines the potential and subsequently implements a refined algorithmic approach grounded in the Term Frequency-Inverse Document Frequency (TF-IDF) metric. The modifications introduced aim to optimize the algorithm's performance in analyzing and categorizing diverse sets of textual data.
A comprehensive account of the program development process is provided, emphasizing the integration of a multi-criteria decision-making framework to achieve a nuanced understanding of text-to-topic relevance. This multi-faceted approach considers various linguistic features and statistical indicators beyond the basic TF-IDF scores, thereby enabling a more robust and accurate classification outcome. Furthermore, the research addresses the crucial pre-processing steps involved in handling textual data. Detailed attention is given to the methodologies of text normalization, which includes techniques such as stemming, lemmatization, and case conversion, to reduce data dimensionality and improve the consistency of feature representation.
The study also rigorously explores the application of effective filtering techniques, particularly the identification and removal of stop words, which are common words that carry minimal semantic weight and can often introduce noise into the classification process. The strategic implementation of these normalization and filtering methods is shown to significantly contribute to the overall precision and recall of the proposed text classification system.
The outcome of this research is a highly effective and adaptable solution for the identifycation of documents that are genuinely relevant to a user's specified thematic focus. The potential applications of this solution span a wide spectrum of information management and retrieval systems. In the context of search engines, the enhanced classification capabilities can lead to more precise and contextually appropriate search results, improving user satisfaction and information discovery. Similarly, in information filtering systems, the proposed approach can facilitate the delivery of tailored content streams, reducing information overload and enhancing user engagement.
Ultimately, this research underscores the growing significance of employing innovative and statistically sound methods for the automated filtering and categorization of textual data in the age of big data. The proposed multi-criteria TF-IDF-based algorithm offers a valuable contribution to the field, providing a practical and efficient means of navigating and extracting relevant information from the vast digital landscape. The findings highlight the potential for significant advancements in various information-intensive applications through the intelligent automation of text analysis and classification processes.

Keywords: processing; data; TF-IDF method; document classification; multi-criteria recognition; stop words; text normalization; data set; deep learning; technology.

Published

2025-07-20

Issue

Section

Articles