Projectdetail

DFG-Forschungsgruppen

Disciplines

Computer Sciences (50%); Mathematics (50%)

Keywords

Classification, Differential Privacy, High-Dimensional Data, Imbalanced Data

Abstract

Final report

In this project, so-called "classification methods" are to be developed and mathematically investigated, which address different new challenges in modern data processing. A "classification method" is a precisely defined algorithm that assigns an object to one of several possible classes based on the objects properties. These objects can, for example, be people who are assigned either to the class "healthy" or to the class "sick" based on the results of medical examinations. Another example would be the automatic assignment of pieces of music to genres such as `Rock`, `Country`, `Classic`, etc. Nowadays there are of course a large number of statistical methods for such classification problems available, which in their respective area of application also deliver very good or even in a certain sense optimal results. However, many modern classification problems, such as those mentioned above, pose particular challenges for data analysis, which are often no longer met by traditional methods. The project specifically addresses two of these new challenges. On the one hand, classification procedures are to be developed for which the original data are not available, due to considerations of data privacy protection, but only suitably anonymized data are provided. The original data for example sensitive patient data in medical research must first be preprocessed in order to ensure the anonymity of the patient. The diagnostic classification method that is subsequently used must take this anonymization step into account in a suitable manner in order to minimize the expected number of possible misclassifications. Another challenge with modern classification problems is the large amount of available data. Even though the basic principle "the more information there is about the object to be classified, the better" still holds true, it is by no means always clear how the relevant information, that ensures reliable assignment to one of the possible classes, is to be extracted from the existing flood of data. This becomes clear, for instance, from the above example of music classification, where, from the computer`s point of view, the available information is given as millions of zeros and ones. Classification methods for such large amounts of data are also confronted with the additional problem that large scale data processing is very time consuming. Conventional methods would therefore require too much time to generate a classification, which is why computationally more efficient approximate methods are to be developed and investigated in this project.

The goal of this project was to develop and mathematically analyze so-called "classification methods" that address various new challenges arising in modern data processing. A "classification method" is a precisely defined algorithm that assigns an object to one of several possible classes based on its properties. Such objects might, for example, be individuals who-based on the results of medical examinations-are assigned to either the "healthy" or "ill" class. Another example would be the automatic categorization of musical pieces into classes such as "rock," "country," "classical," etc. Two of the challenges in classification that are of great interest today concern, on the one hand, the protection of data owners' privacy (as in the example of patient data) and, on the other hand, the enormous computational effort required to make classification decisions due to the complexity and high dimensionality of the data (as in the example of musical recordings). However, as the project progressed, it soon became clear that in order to develop suitable classification methods, the mathematically simpler case of regression methods needed to be addressed first, since there was no sufficient preliminary work in this area for our purposes. Therefore, the main project work focused on the development of regression methods that account for the issues of privacy protection and computational feasibility in high-dimensional settings. For the problem of data processing under privacy protection, we succeeded-using the increasingly popular concept of differential privacy-in developing a comprehensive statistical theory of optimal estimation methods, along with numerical techniques for their practical computation. The idea of differential privacy is to modify potentially sensitive raw data using an appropriate random mechanism before sharing or publishing them, so that the data no longer reveal information about any particular individual but still, in aggregate, allow valid conclusions about the population. Here, we managed to identify the random mechanism that is optimal under the strong criterion of statistical efficiency, and we developed an algorithm for its practical implementation. For the problem of time-consuming computation of regression methods in high-dimensional data, we succeeded in developing an algorithm based on a slight modification of the popular gradient method. It can be shown mathematically that this algorithm is not only faster to execute than the standard statistical methods currently in use, but also yields, in a strong sense, the best possible predictions. In a follow-up project, the insights gained and the methods developed will be extended to the classification setting.

Research institution(s)

Universität Wien - 100%

International project participants

Cristina Butucea, ENSAE - France
Angelika Rohde, Universität Freiburg - Germany

Research Output

8 Publications

Publications

Title	Efficiency in local differential privacy
DOI	10.1214/24-aos2425
Type	Journal Article
Author	Steinberger L
Journal	The Annals of Statistics

Title	Efficient Estimation of a Gaussian Mean with Local Differential Privacy
DOI	10.48550/arxiv.2402.04840
Type	Preprint
Author	Kalinin N
Link	Publication

Title	Comparing regularisation paths of (conjugate) gradient estimators in ridge regression
DOI	10.48550/arxiv.2503.05542
Type	Preprint
Author	Hucker L
Link	Publication

Title	Implicit vs. explicit regularization for high-dimensional gradient descent
DOI	10.48550/arxiv.2502.10578
Type	Preprint
Author	Stark T
Link	Publication

Title	Robust signal recovery in Hadamard spaces
DOI	10.1016/j.jmva.2025.105469
Type	Journal Article
Author	Köstenberger G
Journal	Journal of Multivariate Analysis

Title	Conditional predictive inference for stable algorithms
DOI	10.1214/22-aos2250
Type	Journal Article
Author	Leeb H
Journal	The Annals of Statistics

Title	Interactive versus noninteractive locally differentially private estimation: Two elbows for the quadratic functional
DOI	10.1214/22-aos2254
Type	Journal Article
Author	Butucea C
Journal	The Annals of Statistics

Title	Uncertainty quantification via cross-validation and its variants under algorithmic stability
DOI	10.48550/arxiv.2312.14596
Type	Preprint
Author	Amann N
Link	Publication

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

Classification – Preprocessed and high-dimensional data sets

Classification – Preprocessed and high-dimensional data sets

Disciplines

Keywords

Research Output

Contact

General information

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

SOCIAL MEDIA

SCILOG

Classification – Preprocessed and high-dimensional data sets

Classification – Preprocessed and high-dimensional data sets

Disciplines

Keywords

Research Output