The impact of data preparation techniques on house price prediction task

Ihcene Zitoune

doi:10.17308/sait/1995-5499/2025/1/133-142

Authors

Ihcene Zitoune Kazan Federal University https://orcid.org/0009-0003-1641-0266 (unauthenticated)

DOI:

https://doi.org/10.17308/sait/1995-5499/2025/1/133-142

Keywords:

real estate price prediction, Feature engineering, dimentionality reduction, Pca, autoencoders, One-Hot encoding, handling outliers, Target encoding

Abstract

Accurate house price prediction is considered critical for decision-making in the real estate sector, where datasets are often characterized by missing values, outliers, and skewed distributions. In this study, the impact of various data preprocessing techniques on the performance of the XGBoost algorithm for predicting house prices is investigated. A real estate dataset from Kaggle is used to analyze and compare methods such as missing value imputation, categorical encoding, log transformation, and dimensionality reduction. The results show that preprocessing techniques significantly improve model performance, with certain approaches greatly reducing prediction errors and improving efficiency. Advanced methods, such as PCA with normalization and log transformation, produced the best results, showing the importance of choosing effective preprocessing steps. This study provides practical guidance for using data preprocessing to improve machine learning models, offering insights particularly relevant to real estate price prediction and other structured data applications.