Whatever linguistic theory we consider, the
processing of natural language cannot be accomplished regardless the
representation of structured data. As soon as our natural language model
becomes richer than simple bagofwords,
the data representation is no longer linear, e.g. POS tag sequences vs.
syntactic parse trees.
Classical machine learning approaches attempt to
represent structural syntactic/semantic objects by using a flat feature representation, i.e.
attributevalue vectors. However, this raises two problems:
1. There is no well defined theoretical motivation for the feature
model. Structural properties may not fit in any flat feature representation.
2. When the linguistic phenomenon is complex we may not be able to find
any suitable linear representation.
Kernel methods for NLP aim to
solve both of the above problems. First, the kernel function, allows us to
express the similarity between two objects without explicitly defining their
feature space. As a result we do not have major feature representation
problems.
Second, a linguistic phenomenon can be modeled at
a more abstract level where the modeling processing is easier. For example,
which features would you use to learn the difference between a correct and
incorrect syntactic parse tree? By using the parse tree itself rather than
any of its feature representations, we leave the learner to focus only on
the properties useful to decide. The tree kernel proposed in (Collins and Duffy
2002) measures the similarity between trees in terms of all common
substructures.
Third, even if kernel functions can be seen as
scalar products in feature spaces, we still preserve the advantage of
including a large (possibly infinite) number of features. Moreover, kernel
methods can be used along with the Support Vector Machines which are one of
the most accurate classification approaches.
Finally, the mathematical formalism behind kernel
methods allows us to clearly separate the learning algorithms from features
and representation spaces. This increases the assessment of performance
between spaces (i.e. baseline vs. more complex spaces).
Given the above properties we believe that kernel
methods are a useful mathematical tool to study the reciprocal impact of
natural language structures, e.g. syntactic structures over semantic frames
and vice versa.
