Classification – Preprocessed and high-dimensional data sets
Classification – Preprocessed and high-dimensional data sets
DFG-Forschungsgruppen
Disciplines
Computer Sciences (50%); Mathematics (50%)
Keywords
-
Classification,
Differential Privacy,
High-Dimensional Data,
Imbalanced Data
In this project, so-called "classification methods" are to be developed and mathematically investigated, which address different new challenges in modern data processing. A "classification method" is a precisely defined algorithm that assigns an object to one of several possible classes based on the objects properties. These objects can, for example, be people who are assigned either to the class "healthy" or to the class "sick" based on the results of medical examinations. Another example would be the automatic assignment of pieces of music to genres such as `Rock`, `Country`, `Classic`, etc. Nowadays there are of course a large number of statistical methods for such classification problems available, which in their respective area of application also deliver very good or even in a certain sense optimal results. However, many modern classification problems, such as those mentioned above, pose particular challenges for data analysis, which are often no longer met by traditional methods. The project specifically addresses two of these new challenges. On the one hand, classification procedures are to be developed for which the original data are not available, due to considerations of data privacy protection, but only suitably anonymized data are provided. The original data for example sensitive patient data in medical research must first be preprocessed in order to ensure the anonymity of the patient. The diagnostic classification method that is subsequently used must take this anonymization step into account in a suitable manner in order to minimize the expected number of possible misclassifications. Another challenge with modern classification problems is the large amount of available data. Even though the basic principle "the more information there is about the object to be classified, the better" still holds true, it is by no means always clear how the relevant information, that ensures reliable assignment to one of the possible classes, is to be extracted from the existing flood of data. This becomes clear, for instance, from the above example of music classification, where, from the computer`s point of view, the available information is given as millions of zeros and ones. Classification methods for such large amounts of data are also confronted with the additional problem that large scale data processing is very time consuming. Conventional methods would therefore require too much time to generate a classification, which is why computationally more efficient approximate methods are to be developed and investigated in this project.
The goal of this project was to develop and mathematically analyze so-called "classification methods" that address various new challenges arising in modern data processing. A "classification method" is a precisely defined algorithm that assigns an object to one of several possible classes based on its properties. Such objects might, for example, be individuals who-based on the results of medical examinations-are assigned to either the "healthy" or "ill" class. Another example would be the automatic categorization of musical pieces into classes such as "rock," "country," "classical," etc. Two of the challenges in classification that are of great interest today concern, on the one hand, the protection of data owners' privacy (as in the example of patient data) and, on the other hand, the enormous computational effort required to make classification decisions due to the complexity and high dimensionality of the data (as in the example of musical recordings). However, as the project progressed, it soon became clear that in order to develop suitable classification methods, the mathematically simpler case of regression methods needed to be addressed first, since there was no sufficient preliminary work in this area for our purposes. Therefore, the main project work focused on the development of regression methods that account for the issues of privacy protection and computational feasibility in high-dimensional settings. For the problem of data processing under privacy protection, we succeeded-using the increasingly popular concept of differential privacy-in developing a comprehensive statistical theory of optimal estimation methods, along with numerical techniques for their practical computation. The idea of differential privacy is to modify potentially sensitive raw data using an appropriate random mechanism before sharing or publishing them, so that the data no longer reveal information about any particular individual but still, in aggregate, allow valid conclusions about the population. Here, we managed to identify the random mechanism that is optimal under the strong criterion of statistical efficiency, and we developed an algorithm for its practical implementation. For the problem of time-consuming computation of regression methods in high-dimensional data, we succeeded in developing an algorithm based on a slight modification of the popular gradient method. It can be shown mathematically that this algorithm is not only faster to execute than the standard statistical methods currently in use, but also yields, in a strong sense, the best possible predictions. In a follow-up project, the insights gained and the methods developed will be extended to the classification setting.
- Universität Wien - 100%
- Cristina Butucea, ENSAE - France
- Angelika Rohde, Universität Freiburg - Germany
Research Output
- 11 Citations
- 8 Publications
-
2025
Title Comparing regularisation paths of (conjugate) gradient estimators in ridge regression DOI 10.48550/arxiv.2503.05542 Type Preprint Author Hucker L Link Publication -
2025
Title Implicit vs. explicit regularization for high-dimensional gradient descent DOI 10.48550/arxiv.2502.10578 Type Preprint Author Stark T Link Publication -
2025
Title Robust signal recovery in Hadamard spaces DOI 10.1016/j.jmva.2025.105469 Type Journal Article Author Köstenberger G Journal Journal of Multivariate Analysis Pages 105469 Link Publication -
2023
Title Uncertainty quantification via cross-validation and its variants under algorithmic stability DOI 10.48550/arxiv.2312.14596 Type Preprint Author Amann N Link Publication -
2023
Title Interactive versus noninteractive locally differentially private estimation: Two elbows for the quadratic functional DOI 10.1214/22-aos2254 Type Journal Article Author Butucea C Journal The Annals of Statistics -
2024
Title Efficiency in local differential privacy DOI 10.1214/24-aos2425 Type Journal Article Author Steinberger L Journal The Annals of Statistics -
2024
Title Efficient Estimation of a Gaussian Mean with Local Differential Privacy DOI 10.48550/arxiv.2402.04840 Type Preprint Author Kalinin N Link Publication -
2023
Title Conditional predictive inference for stable algorithms DOI 10.1214/22-aos2250 Type Journal Article Author Steinberger L Journal The Annals of Statistics