Viscous gravitational algorithm for clustering inacurate data
DOI:
https://doi.org/10.17308/sait.2022.1/9203Keywords:
data clustering, imprecise data, gravity algorithm, viscosity, Pauli repulsionAbstract
Clustering is one of the basic problems of machine learning, along with pattern recognition, classification and forecasting. The role of clustering is especially important in the analysis of Big Data, work with which can only be carried out using computer technologies. At the same time, the problem of automatic partitioning into clusters, taking into account the errors of the initial data, has not up to now an unambiguous solution and requires a search for more adequate approaches, including automatic determination of the number of clusters. The paper proposes a new method for data clustering, based on a modification of the gravitational algorithm, which uses an analogy with the formation of stellar clusters due to the attraction of masses in accordance with the law of universal gravitation. When applying this approach to data clustering, real physical masses are replaced by points in a multidimensional data space, and the motion of these points, taking into account their attraction, leads to the formation of clusters. The disadvantage of this method is the manifestation of the effects of inertia, which can hinder the clustering process and lead to the ejection of accelerated particles from the cluster at the stage of its formation. To exclude such phenomena, we use a model of the dynamics of viscous motion of particles representing the data and the natural limitation of the cluster size due to the repulsion of particles. When simulating the repulsive force of particles, the interaction in the Pauli form was taken for fermions with the same spins and the Gaussian distribution of the error density. The basic equations describing the steps of the presented modification of the gravitational algorithm are written. A numerical example demonstrates the features and advantages of the viscous gravity algorithm in comparison with the k-means method and the density-based DBSCAN method, including automatic termination of the procedure when the main clustering process is completed. The results obtained allow for blind clustering of Big Data, and can be generalized to solving multidimensional optimization problems.
References
Downloads
Published
Issue
Section
License
Условия передачи авторских прав in English













