PCA Algorithms in the Visualization of Big Data from Polish Digital Libraries

The visualization of large data sets from Polish digital libraries requires proper preparation of a comprehensive consolidated data set. Differences in the organizational systems of digital resources, and other factors affecting the heterogeneity of distributed data and metadata, require the use of clustering algorithms. To achieve this goal, the authors decided to use the PCA method and compare it with k-means results. PCA fulfills the condition of efficient size reduction for multidimensional data but is largely sensitive to deviations and differences in stochastic distributions. To eliminate the problem of noise in the input data, the deterministic model in the form of the Langevin function was used first. This leads to the “flattening” of the distribution of factors influencing the data structure. Due to such an approach, the most relevant categories to information systems were distinguished and Polish digital libraries were visualized.