Toggle Main Menu Toggle Search

Open Access padlockePrints

Measurement of nonlinear distance in data derived from linguistic corpora

Lookup NU author(s): Dr Hermann Moisl

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


Abstract

Cluster analysis is a family of mathematically-based computational methods for identification and graphical display of proximity relations among data objects, where ‘proximity’ is the generic term for the degree of similarity or dissimilarity between and among those objects. It has long been used for this purpose in applications like hypothesis generation, hypothesis confirmation, and dimensionality reduction across a broad range of science and engineering disciplines, and, as digital electronic language corpora become increasingly important in most branches of linguistics, its application to data derived from such corpora becomes ever more relevant to linguistic research .Hierarchical cluster analysis is a long-established and widely used clustering method. Its popularity derives from the intuitive accessibility of its results. On the one hand, output from a hierarchical analysis is a constituency tree which, when visually represented, provides a detailed and exhaustive map of the similarity relations among data objects. On the other, because hierarchical clustering has been well studied, interpretation is supported by a thorough understanding of the characteristics of hierarchical clustering and its associated problems. Textbook accounts of hierarchical analysis all have a self-imposed limitation, however: they assume that the relationships between and among the variables describing data objects are linear, and cluster those objects on the basis of linear measurement of proximity among them. It has become increasingly clear in recent decades that nonlinearity is the default case in natural processes. This nonlinearity can manifest itself in data describing these processes, and hierarchical analysis of nonlinear data based on linear proximity measurement can give results that are inaccurate in some proportion to the nature and degree of the nonlinearity. Linguistic communication among humans is generated by a natural process known to be highly nonlinear, the brain. Any analysis of data derived from language use, and cluster analysis in particular, must therefore consider the possibility that nonlinearity will be present. The aim of this discussion is to extend the applicability of hierarchical cluster analysis to nonlinear data. It is in four main parts. The first part outlines the nature of nonlinearity in data. The second shows why linear proximity measurement of nonlinear data generates inaccurate cluster analytic results. Using concepts from mathematical topology and graph theory, the third proposes a way of simultaneously identifying nonlinearity in data and, where found, of transforming linear to nonlinear proximity measurement for use with hierarchical cluster analysis. And, finally, the fourth shows that the concepts introduced in the preceding three parts are of more than theoretical interest in corpus-based linguistic research by identifying nonlinearity in data derived from a speech corpus, and showing that hierarchical cluster analyses of that data based on linear and nonlinear proximity measurement give substantially different results.


Publication metadata

Author(s): Moisl H

Editor(s): Obradović, I., Kelih, E., Kohler, R.

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Published

Conference Name: Methods and Applications of Quantitative Linguistics: Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO)

Year of Conference: 2012

Pages: 172-183

Print publication date: 01/07/2013

Publisher: University of Belgrade and Academic Mind

Library holdings: Search Newcastle University Library for this item

ISBN: 9788674664650


Share