C-Perform: Methods and Tools for Collocation Extraction and Performance-Oriented Parsing
C-Perform: Methods and Tools for Collocation Extraction and Performance-Oriented Parsing
Disciplines
Computer Sciences (75%); Mathematics (10%); Linguistics and Literature (15%)
Keywords
-
COMPUTATIONAL LINGUISTICS,
CORPUS-BASED NATURAL LANG.PROC,
COLLOCATIONS,
LEXICALIZATION,
NATURAL LANGUAGE PROCESSING,
PARSING
The aim of this project is to lay the foundations for a new generation of systems that enable fast, efficient and robust natural language processing and are still sufficiently general. Based on the assumption that particular aspects of performance are grammaticalized, we pursue a novel approach to grammar where performance and competence aspects are already interleaved within the grammar model. In particular, we aim at modeling the interaction of generativity which is the distinctive feature of competence, and lexicalization which is a feature of language usage. To achieve this goal, the influence of lexicalization on generativity is studied within the phenomenon of collocations. The interaction of lexical and structural information is modeled by means of corpus-based statistical techniques. Due to the impact of generative grammar on linguistics, collocations have been regarded as a phenomenon outside the grammar. In general, reduction of grammar to competence aspects has lead to grammar models that account for the dichotomy of syntactically correct versus incorrect utterances, but ignore the fact that some of the correct analyses are more adequate than others. This emphasis on competence information leads to ambiguity - a severe problem for processing as the search space becomes large - and thus leads to fairly slow systems. Control and compilation strategies have been developed in computational linguistics to reduce ambiguity and thus gain processing effciency. These approaches are useful means to mimic performance, but do not tackle the fundamental problem. Concurrently, we have witnessed a renaissance of statistics within natural language processing. Performance aspects influence the stochastic language models as they are reflected in the language data (corpora). Likelihood replaces the true-false dichotomy which enables the processing of unrestricted text. But statistical models are linguistically poor which makes them reliable only for very restricted domains. This is where results of this project shall bring improvement. In order to come up with efficient and sufliciently general systems we need to combine statistical models with elaborate linguistic knowledge. One possibility to achieve this goal is to provide corpora with linguistically elaborate annotation schemes. Grammatical competence can also alleviate another inherent problem of statistical models. Since the number of model parameters is limited by the size of the training corpus a linguistically guided pre-selection of appropriate candidate parameters is crucial. Within the project, stochastic grammars with different degrees of lexicalization will be induced from a German newspaper corpus. Parametrization of the grammar models is guided by insights gained from corpus-based retrieval of collocations. The initial model will be trained on annotated portions of the corpus. The parameters will be systematically varied and tested in a number of parsing experiments. With parsing, an additional aspect of performance comes into play. With respect to collocation extraction, corpus pre-processing tools will be adapted in order to automatically enrich raw text with structural information required for collocation extraction. As theoretical result, the project will provide insights into the interaction of generativity and lexicalization within collocations, and as a consequence insights into the interaction of competence and performance aspects of natural language. As practical outcome, the project provides methods and tools for automatic high precision extraction of collocations from raw text, methods and tools to induce a highly lexicalized stochastic grammar model from arbitrary corpora, and a CKY-type stochastic parser parametrizable with respect to the grammar. Both, grammar model and parser are particularly designed for the requirements of robust and efficient processing of real world German text, and thus overcome the disadvantages of existing stochastic parsers for German which have largely been developed on the basis of English - a language which in contrast to German has little inflection, rigid word order and a fairly restricted amount of non-local phenomena. Interest in performance-oriented grammar models is not restricted to computational linguistics but also a topic of research in theoretical linguistics and psycholinguistics. Thus the work within the project can benefit from a broader range of research, and results achieved in the project are expected to influence research in the other fields.