High-dimensional data sets in discriminant analysis
DFG Research Units
Disciplines
Computer Sciences (20%); Mathematics (80%)
Keywords
- Discriminant Analysis,
- High-Dimension,
- Regularization,
- Gradient Descent
Despite the unprecedented success of modern artificial intelligence, the precise reasons for the effectiveness of these complex methods are still far from being fully understood. Given the widespread use and application of AI, however, a systematic understanding of the strengths, weaknesses, and safety of these technologies is of great societal interest. The aim of this project is therefore to provide a mathematically precise description and analysis of a particular type of AI, namely so-called classification algorithms, with respect to their statistical reliability and computational feasibility. In addition, statistical methods will be developed that ensure the protection of individual privacy when such algorithms are applied. A classification algorithm is a computational procedure that is able to assign digitized observational units (e.g., hospital patients, texts, videos, etc.) to an appropriate class (e.g., healthy or ill, spam or email, film genre, etc.). Modern classification problems are primarily characterized by their high dimensionality. This means that the observational units can be very complex digital objects such as images or videos, and that the algorithms used are themselves characterized by a large number of free parameters. This enormous complexity is very difficult to capture and analyze in a mathematically rigorous manner. For this reason, in this project we will initially focus on a particular form of high-dimensional classification algorithm, namely linear discriminant analysis. However, when the dimension becomes too high, the problem of interpolation arises, in which every available data point is assigned to its class without error. We are interested in the predictive performance of such an interpolating classifier. Classical statistical theory suggests that this performance will be rather inadequate and instead recommends the approach of so-called l2- regularization for stabilization. However, it is largely unknown whether this approach remains effective in the high-dimensional setting and how its computational implementation can be designed efficiently. To address this, new methodological approaches based on gradient methods will be developed and their statistical accuracy will be investigated. With regard to the privacy protection problem mentioned above, the widely discussed statistical paradigm of differential privacy (DP) is particularly relevant. In particular, the recent development of f-DP provides intuitively interpretable statements about the type of protection that can be guaranteed. However, it is largely unclear how to select an appropriate randomization strategy from the vast number of possible options. In this project, we will investigate to what extent existing methods can be extended to higher dimensions and to the new framework of f-DP.
- Universität Wien - 100%
- Angelika Rohde, Universität Freiburg - Germany, project partner