Dependency Parsing and Neural Networks

by Jungo Kasai



There have been two major classes of data-driven methods that are often called transition-based and graph-based methods (Kubler et al., 2009)1.

Transition-based methods start by defining a transition system, or state machine, for mapping a sentence to its dependency graph. The parsing process is then reduced to inducing a model for predicting the next transition, given the current state and the transition history so far.

Graph-based methods instead induce a model for assigning scores to the candidate dependency graphs for a sentence, and the parsing problem is posed as finding the maximum-scoring dependency tree for the input sentence, given the induced model. This is often called maximum spanning tree (MST) parsing.

Transition-based vs Graph-based

Empirical studies showed that a transition-based parser and a graph-based parser yielded similar performance across languages (McDonald and Nivre, 2011) 2, but the two strands of data-driven parsing methods manifest the fundamental trade-off. The former prefers rich feature representations with parsing history over global training and exhaustive search, and the latter allows for global training and inference at the expense of rich feature representations (Kubler et al., 2009) 1. It should be noted that despite the similar overall performance, the two parsers produce different types of errors, corresponding to the theoretical trade-off. See McDonald and Nivre (2011) for details 2.

Remedies for the Trade-off

Recent neural network models for transition-based and graph-based parsing are designed to target such trade-off. Andor et al. (2016) 3 developed a transition-based parser using feed-forward neural networks that performs global training approximated by beam search in contrast with local training (Chen and Manning, 2014) 4. The globally normalized objective addresses the label bias problem and makes global training effective in the transition-based parsing setting. Kiperwasser and Goldberg (2016) 5 incorporated a dynamic oracle in a BiLSTM transition-based parser that remedies global error propagation. Kiperwasser and Goldberg (2016) 5 and Dozat and Manning (2017) 6 proposed a graph-based parser that has access to rich feature representations obtained from BiLSTMs.

It should be noted, however that there still remains a fundamental difference between transition-based and graph-based parsing: there is no notion of parsing history in graph-based parsing. Zheng 2017 7, for instance, proposes incremental graph-based parsing that performs rescoring and relaxes first-order factorization i.e. the arc-factorization assumption. Integration of the two classes of algorithms should be one direction for further improvement in parsing.


This blog post is inspired by the conversation that I had with Professor Joakim Nivre at EMNLP 2017.