Cluster Analysis for Corpus Linguistics

Moisl, H

Cluster Analysis for Corpus Linguistics

Downloads

Full text is not currently available for this publication.

Abstract

The rapid development of digital electronic information technology since the second half of the twentieth century has seen the emergence of corpus linguistics, the analysis of collections or corpora of spoken and written language, as a distinct subdiscipline within linguistics. As it has matured, the nature of its relationship to the parent discipline has clarified, and the dominant view is that corpus linguistics is a methodology for hypothesis testing. Like the other sciences, linguistics uses the de facto standard hypothetico-deductive scientific method in which (i) some aspect of the natural world is selected for study, and a research question that will substantially further scientific knowledge of the domain of interest is posed, (ii) a hypothesis that answers the research question is stated, and (iii) the validity of the hypothesis is tested by observation of the domain; it is the role of corpus linguistics to provide the language resources and analytical tools to facilitate testing. The proposed book argues that corpus linguistics can offer the parent discipline much more than this. Because falsifiable hypotheses are central in scientific method, it is natural to ask how they are generated. The consensus in philosophy of science is that hypothesis generation is non-algorithmic, that is, not reducible to a formula, but is rather driven by human intellectual creativity in response to a research question. In principle it doesn’t matter where hypotheses comes from. Any one of us, whatever our background, could suddenly state an utterly novel hypothesis that, say, unifies quantum mechanics and Einsteinian cosmology, but this kind of inspiration is highly unlikely and must be exceedingly rare. In practice, there are two general approaches to hypothesis generation. One approach is to select a theoretical framework, become familiar with existing theorems in that framework, and then attempt to derive from them new theorems relevant to the research question by deductive inference. The other is to become familiar with the domain of interest by observation of it, and on that basis attempt to formulate generalisations relevant to the research question by inductive inference. These approaches are typically used in conjunction. Cluster analysis is a family of mathematically based computational methods whose aim is to discover structure which may be latent in data and to represent such structure in an intuitively accessible way. It is extensively used in science and engineering, where awareness of latent structure revealed by analysis serves as a basis for generation of hypotheses about the domain from which the data was abstracted. The proposed book argues that it can also be used for this purpose in linguistics. More specifically, the argument is that cluster analysis can serve as an empirical methodology for the generation of linguistic hypotheses by inductive inference based on discovery of latent structure in data abstracted from natural language corpora. In developing this argument the book integrates existing theoretical knowledge of cluster analysis and its application with original research by the author into a coherent exposition.

Publication metadata

Author(s): Moisl H

Series Editor(s): Koehler, R; Grzybek, P; Altmann, G

Publication type: Authored Book

Publication status: Published

Edition: 1

Series Title: Quantitative Linguistics

Year: 2015

Number of Volumes: 1

Number of Pages: 381

Print publication date: 01/01/2015

Publisher: De Gruyter Mouton

Place Published: Berlin

URL: http://www.degruyter.com/view/product/248853

Library holdings: Search Newcastle University Library for this item

ISBN: 9783110363814

ePrints

Cluster Analysis for Corpus Linguistics

Downloads

Abstract

Publication metadata

Share