Architecture and Development of High-Quality PoS Tagger
Architecture and Development of High-Quality PoS Tagger
Disciplines
Computer Sciences (60%); Linguistics and Literature (40%)
Keywords
-
Computational Linguistics,
Constraint Grammars,
Linguistic Methology,
Human Language Technology,
Part-of-Speech tagging,
Natural Language Processing
The project aims at the development and implementation of a new architecture for a high-quality Part-of-Speech (PoS) tagging. PoS taggers resolve the ambiguity of word forms in text - at least on the level of part-of-speech (e.g., German "sieben" is ambiguous between numeral and verb) or on some finer level (e.g., the gender of German "Leiter"). Currently, two types of approaches exist: - statistical taggers, assigning each word its "most probable" reading as (automatically) learned from a tagged traning text (i.e. avoiding the usage of explicit rules of the language) - "Constraint Grammar" taggers, using explicit, linguistics-based grammar rules, which in current systems are completely hand-crafted. Both approaches have their assets as well as drawbacks. The project aims at combining these two approaches into a single tagging architecture (tagging system) where the strengths of both approach are accented while the weaknesses are mutually compensated for. Thus, the tagging architecture should be able to overcome the current quality barrier of about 93-96% reliability. Training a statistical tagger can proceed swiftly (provided a tagged training corpus is available), since the methods are well-understood and wide-spread, but methods for the efficient development of the rules for a Constraint Grammar tagger are still missing. Hence, building up such a tagger requires nowadays an extraordinary experience and skill for writing down the large number of individual, language-specific and (sometimes) complicated rules. In order to improve this situation, defining an effective methodology for the creation of rules of a Constraint Grammar tagger will constitute an important subtask of the project. Apart from these more theoretical aims, a validation / practical demonstration of the developed methodology is also due, together with an evaluation of the practical results achieved. This sums up to the following three main objectives of (and simultaneously to the three innovations in the field of PoS tagging contributed by) the project: 1. proposing and advocating a novel tagger architecture combining the statistical and the Constraint Grammar based tagging scheme into a tagging system with higher accuracy than any of its components taken alone; 2. developing a systematic method for writing rules of a Constraint Grammar tagger, together with a novel and (provably) more powerful method of their application; 3. implementing and evaluating a combined tagger for German.
Part-of-Speech (PoS) tagging describes the process of automatically labeling each word in a text with its correct PoS label. For example, the sentence "Time flies like an arrow" should be labelled like this: "Time (Noun) flies (Verb) like (Prep) an (Article) arrow (Noun)". PoS tags convey important linguistic information and many natural language processing systems use PoS tagging as a pre-processing step. Why is PoS tagging difficult? Because words are ambiguous: Time can be a verb or a noun, flies a verb or a noun and so on. State-of-the-art taggers make use of statistical knowledge gained from large corpora to disambiguate between these possibilities. They perform generally quite well, selecting about 97 times out of 100 the correct Part-of-Speech tag. But though errors are few they are sometimes embarrassing, i.e. they are errors no human would ever make. And, for many applications one would wish to get an even better performance: 99 out of 100 should be achievable. The primary goal of the project was to develop and implement a methodology for a high-quality linguistically motivated partial PoS tagger for German that avoids "embarrassing" errors. Such a tagger performs disambiguation strictly on a linguistic basis, i.e. its architecture has the following properties: - the initialization step labels each word with all its morphological readings ("PoS tags"); - the tagger proper removes all those morphological readings of a word which are (gramatically) impossible in the particular context. In the course of the project we developed a methodology to express linguistic constraints in a concise form; i.e. we concentrated on designing rules for impossible sequences of PoS tags. When applied to a sentence, these rules help eliminate any such sequences of PoS tags. Altogether we discovered about 160 such rules for German. The principal advantage of the approach is that the output of the system is fully reliable in the sense that the tagger commits no errors during its operation. But this method does not normally perform full disambiguation; full disambiguation only takes place where the linguistic knowledge employed allows for it (this corresponds to the linguistic reality - many sentences are inherently ambiguous). To achieve full disambiguation the system is complemented by a standard statistical tagger finalizing the disambiguation down to a single tag per word. Because our system reduces the number of possible PoS tags - which form the input to the statistical tagger - the overall quality of the combined system surpasses purely statistical taggers. An evaluation on a large corpus of newspaper articles demonstrated that the combined tagger incorporating our system comes close to the ideal - it almost reached 98 of 100 correct tag assignments.