Methods of data cleaning for forecasting investments in education

DOI: 10.31673/2412-9070.2025.061205

Authors

  • Т. О. Бажан, (Bazhan T.) State University of Information and Communication Technologies, Kyiv
  • В. Ф. Криворучко, (Kryvoruchko V.) State University of Information and Communication Technologies, Kyiv

DOI:

https://doi.org/10.31673/2412-9070.2025.061205

Abstract

This article addresses the relevance of data cleaning in the context of forecasting investments in the educational sector, emphasizing that the quality of input data is crucial for the accuracy and reliability of machine learning prognostic models. Poor quality data inevitably leads to distorted patterns and, consequently, to erroneous investment forecasts, which can negatively impact the allocation of financial resources and the development of the educational system. The specificity of educational data, its diversity, and susceptibility to errors highlight the urgent need for thorough cleaning.
Based on an analysis of existing literature, it was found that while there is a significant body of research on general data cleaning methods and the application of machine learning in education, there is a lack of focused studies that specifically investigate the effectiveness of various data cleaning methods for improving the accuracy of investment forecasting in the educational field. This underscores the scientific novelty and relevance of the conducted research.
The aim of the study is to develop and substantiate an effective data cleaning method aimed at improving the accuracy of forecasting investments in education. To achieve this goal, a number of tasks were set, including analyzing existing methods, conducting a comparative analysis of their effectiveness, identifying the most suitable approaches, developing possible enhancements, creating a block diagram of the proposed method, and formulating practical recommendations.
The paper thoroughly examines the fundamental stage of data cleaning in the machine learning pipeline, which precedes the creation and training of models. A comparative analysis of key data cleaning methods is presented, including handling missing values (row/column deletion, mean/median/mode imputation, predictive imputation), outlier detection and treatment (visualization, statistical methods, machine learning algorithms, outlier transformation), duplicate removal, error and inconsistency correction (spell check/format validation, source reconciliation, rule-based validation), as well as data scaling and normalization (Min-Max Scaling, StandardScaler) and data type conversion.
Selection of the best cleaning methods for forecasting investments in education is proposed, considering the specificity of educational data. These include comprehensive missing value handling, robust outlier detection and treatment, thorough duplicate detection and elimination, strict rule-based data validation using domain knowledge, and format harmonization and data type conversion. Opportunities for improving data cleaning methods are discussed, specifically the development of hybrid approaches, consideration of the context of educational data, automation of the cleaning process using machine learning, creation of interactive tools, and evaluation of the impact of cleaning methods on forecast quality.
Practical recommendations for using data cleaning methods for educational investment forecasts are provided, with an emphasis on understanding the specific characteristics of educational data (its origin, hierarchical structure, temporal dependencies, categorical features, sensitivity to policy changes), comprehensive handling of missing values, robust outlier detection and treatment, specificcleaning methods for educational data (standardization of categorical features, consistency control between levels, validation based on standards), integration and reconciliation of data from different sources, and evaluation of the impact of cleaning on forecast accuracy. The importance of involving experts from the educational sector at all stages of the process is highlighted.
In the conclusions, it is stated that high-quality data cleaning is critically important for building reliable prognostic models in the field of educational investment. The proposed comprehensive approach, combining missing value handling and robust outlier detection and treatment methods, significantly improves the quality of input data and enhances forecast accuracy. Prospects for further research include testing the method on larger volumes of real-world data and comparing its effectiveness with other existing data cleaning approaches.

Keywords: data cleaning; data quality; machine learning; investment forecasting; educational data; predictive models.

Published

2025-12-30

Issue

Section

Articles