Publications

A Linguistic Perspective on Reference: Choosing a Feature Set for Generating Referring Expressions in Context

Published in Proceedings of the 28th International Conference on Computational Linguistics, 2020

This paper reports on a structured evaluation of feature-based Machine Learning algorithms for selecting the form of a referring expression in discourse context. Based on this evaluation, we selected seven feature sets from the literature, amounting to 65 distinct linguistic features. The features were then grouped into 9 broad classes. After building Random Forest models, we used Feature Importance Ranking and Sequential Forward Search methods to assess the importance of the features. Combining the results of the two methods, we propose a consensus feature set. The 6 features in our consensus set come from 4 different classes, namely grammatical role, inherent features of the referent, antecedent form and recency.

Download here

Computational Interpretations of Recency for the Choice of Referring Expressions in Discourse

Published in Proceedings of the First Workshop on Computational Approaches to Discourse, 2020

First, we discuss the most common linguistic perspectives on the concept of recency and propose a taxonomy of recency metrics employed in Machine Learning studies for choosing the form of referring expressions in discourse context. We then report on a Multi-Layer Perceptron study and a Sequential Forward Search experiment, followed by Bayes Factor analysis of the outcomes. The results suggest that recency metrics counting paragraphs and sentences contribute to referential choice prediction more than other recency-related metrics. Based on the results of our analysis, we argue that, sensitivity to discourse structure is important for recency metrics used in determining referring expression forms.

Download here

20 years of UK Budget speeches: correspondence analysis vs. networks of n-grams

Published in JADT 2016 : 13th International Conference on the Statistical Analysis of Textual Data, 2016

In computational linguistics, repeated segments in a text can be derived and named in various ways. The notion of n-gram is very popular but retains a mathematical flavor that is ill-suited to hermeneutical efforts. On the other hand, corpus linguistics uses many definitions for multi-words expressions, each one suited to a specific field of linguistics. The present paper contributes to the debate on term extraction, a particular notion regarding multi-word expressions that is widely used in summarization of scientific texts. We compare and contrast two methods of term extraction: distance-based maps vs. graph-based maps. The use of network theory for the automated analysis of texts is here expanded to include the concept of community around newly identified keywords.

Download here