The impact of data preparation techniques on house price prediction task

Ihcene Zitoune

doi:10.17308/sait/1995-5499/2025/1/133-142

Ihcene Zitoune Kazan Federal University https://orcid.org/0009-0003-1641-0266

DOI: https://doi.org/10.17308/sait/1995-5499/2025/1/133-142

Аннотация

Accurate house price prediction is considered critical for decision-making in the real estate sector, where datasets are often characterized by missing values, outliers, and skewed distributions. In this study, the impact of various data preprocessing techniques on the performance of the XGBoost algorithm for predicting house prices is investigated. A real estate dataset from Kaggle is used to analyze and compare methods such as missing value imputation, categorical encoding, log transformation, and dimensionality reduction. The results show that preprocessing techniques significantly improve model performance, with certain approaches greatly reducing prediction errors and improving efficiency. Advanced methods, such as PCA with normalization and log transformation, produced the best results, showing the importance of choosing effective preprocessing steps. This study provides practical guidance for using data preprocessing to improve machine learning models, offering insights particularly relevant to real estate price prediction and other structured data applications.

Скачивания

Данные скачивания пока не доступны.

Биография автора

Ihcene Zitoune, Kazan Federal University

2nd year PhD student, department of data analysis and programming technologies, Kazan Federal University

Литература

1. Little R. J. A., Rubin D. B. (2002). Statistical Analysis with Missing Data. 2nd ed. Wiley, New York.
2. Barnett V., Lewis T. (1994). Outliers in Statistical Data. 3rd ed. Wiley, Chichester.
3. Aggarwal C. C. (2015). Data Mining: The Textbook. Springer.
4. Vincent P., Larochelle H., Bengio Y., Manzagol P. A. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning. P. 1096–1103.
5. Goodfellow I., Bengio Y., Courville A. (2016). Deep Learning. MIT Press, Cambridge.
6. Badue C., Guidolini R., Carneiro R. V., Azevedo P., Cardoso V. B., Forechi A. [et al.] (2021). Self-driving cars: A survey. Expert Systems with Applications. 165. Article 113816. DOI
7. Chu X., Ilyas I. F., Krishnan S., Wang J. (2016). Data cleaning: Overview and emerging challenges. Proceedings of the 2016 International Conference on Management of Data (SIGMOD). P. 2201–2206.
8. Jolliffe I. T., Cadima J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 374(2065). Article 20150202. DOI
9. Arpit D., Jastrzębski S., Ballas N. [et al.] (2017). A closer look at memorization in deep networks. Proceedings of the 34th International Conference on Machine Learning. P. 233–242.
10. Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 16. P. 321–357.
11. Krishnan S., Wu E. (2019). AlphaClean: Automated data cleaning. Proceedings of the VLDB Endowment. 12(12). P. 1792–1795.
12. Kaggle (n.d.). World’s Real Estate Data (147k). Dataset link: URL
13. McKinney W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference.
14. Allison P. D. (2001). Missing Data. Sage Publications, Thousand Oaks.
15. Enders C. K. (2010). Applied Missing Data Analysis. Guilford Press.
16. Hastie T., Tibshirani R., Friedman J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer, New York.
17. Rousseeuw P. J., Leroy A. M. (2003). Robust Regression and Outlier Detection. Wiley, New York.
18. Boukerche A., Zheng L., Alfandi O. (2020). Outlier Detection: Methods, Models, and Classification. ACM Computing Surveys.
19. Bishop C. M. (2006). Pattern Recognition and Machine Learning. Springer.
20. Box G. E. P., Cox D. R. (1964). An Analysis of Transformations. Journal of the Royal Statistical Society, Series B.
21. Micci-Barreca D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. SIGKDD Explorations. 3(1). P. 27–32.
22. Hammouri H., Sabo R. T., Alsaadawi R., Kheirallah K. A. (2020). Handling Skewed Data: A Comparison of Two Popular Methods. Applied Sciences. 10(18). Article 6247. DOI
23. Friedman J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics. 29(5). P. 1189–1232.
24. Zitoune I., Arabov M. K. (2024). Comparative Analysis of Ensemble and Linear Machine Learning Models in the Task of House Price Prediction. Proceedings of the 2024 International Russian Automation Conference (RusAutoCon), Sochi, Russian Federation. P. 50–55.