{Utrecht University}

Steven Krauwer


CLASK: Combining Linguistic and Statistical Knowledge (1993-96)

This page will expire and be deleted by April 2000 The aim of the CLASK project was to outline directions for research into robustness techniques, covering both the ambiguity problem and the problem of ill-formed or unexpected input. At the heart of the enterprise is the belief that, given the current state of linguistic knowledge, solutions have to be based in part on what we understand (i.e. the linguistic knowledge which can be expressed in discrete rules), and in part on what we do not quite understand, but are capable of measuring precisely enough to allow for extrapolation (i.e. patterns of linguistic behavior which can be described statistically). Rather than to put rule-based and statistical approaches in opposition, and argue which one is to be preferred, the project aims at combining them in such a way that the strong points are exploited to a maximum, and the negative effects of their shortcomings are reduced to a minimum. The framework adopted by the project is the DOP framework, which has the clear advantage that linguistic knowledge is not sacrificed to existing probabilistic methods, and which integrates statistical knowledge in a way compatible with the description of complex linguistic phenomena and the building of rich interpretations in conformance with mainstream linguistic (or semantic) representation theories. The intended output of the project is a collection of tools, methods and techniques, that should help to reduce the robustness problem, and that should be general enough to be applicable in different contexts and environments.

During the first phase of CLASK, the activities were directed toward designing and implementing an efficient deterministic integrated parser and disambiguator for DOP grammars. A pilot implementation has been already operational since February 1995 and is exhibiting very efficient performance (time and space) on realistic DOP grammars (in comparison to the previous non-deterministic unit implemented at Alfa-Informatica in Amsterdam). The design of this unit is based on the observation that DOP models are stochastically enriched constrained Context-Free Grammars (CFGs), which implies that the parser+disambiguator unit is an extension to a CFG parser (CKY based). The second phase of CLASK is directed toward both improving the performance of this unit and toward the problem of dealing with ill-formed text (error-correction). The error-correction task can also be seen as a case-study of the effectiveness of integrating statistical and syntactic knowledge.

Some papers and reports

  1. Efficient Disambiguation by means of Stochastic Tree Substitution Grammars.
    Khalil Sima'an, Bod, R., Krauwer, S., and Scha, R. (1994).
    In Proceedings International Conference on New Methods in Language Processing (NeMLaP'94), Manchester, pages 50--58. Centre for Computational Linguistics, UMIST.

Steven Krauwer (s.krauwer@uu.nl) Utrecht Institute of Linguistics UiL OTS
Phone +31 30 253 6050 Faculty of Humanities, Utrecht University
[Page last modified: 28-04-2020] Drift 10, 3512 BS Utrecht, Netherlands