The impact of data preparation techniques on house price prediction task

Authors

DOI:

https://doi.org/10.17308/sait/1995-5499/2025/1/133-142

Keywords:

real estate price prediction, Feature engineering, dimentionality reduction, Pca, autoencoders, One-Hot encoding, handling outliers, Target encoding

Abstract

Accurate house price prediction is considered critical for decision-making in the real estate sector, where datasets are often characterized by missing values, outliers, and skewed distributions. In this study, the impact of various data preprocessing techniques on the performance of the XGBoost algorithm for predicting house prices is investigated. A real estate dataset from Kaggle is used to analyze and compare methods such as missing value imputation, categorical encoding, log transformation, and dimensionality reduction. The results show that preprocessing techniques significantly improve model performance, with certain approaches greatly reducing prediction errors and improving efficiency. Advanced methods, such as PCA with normalization and log transformation, produced the best results, showing the importance of choosing effective preprocessing steps. This study provides practical guidance for using data preprocessing to improve machine learning models, offering insights particularly relevant to real estate price prediction and other structured data applications.

Author Biography

  • Ihcene Zitoune, Kazan Federal University

    2nd year PhD student, department of data analysis and programming technologies, Kazan Federal University

References

Published

2025-05-12

Issue

Section

Intelligent Information Systems, Data Analysis and Machine Learning

How to Cite

The impact of data preparation techniques on house price prediction task. (2025). Proceedings of Voronezh State University. Series: Systems Analysis and Information Technologies, 1, 133-142. https://doi.org/10.17308/sait/1995-5499/2025/1/133-142