Introduction

INTRODUCTION
Corpora are often used by linguists as the raw material from which language description may be
fashioned - the role is no less relevant for CALL package designers. Corpora can provide the basis of
accurate, empirically justified, linguistic observations on which to base CALL materials. Additionally, the
corpora themselves, typically via concordancing, may become the raw material of CALL based teaching
itself. The corpus may be viewed, in certain contexts, as an item bank. The use of corpora in CALL are
many. A knowledge of the corpus method to CALL package designers is increasingly indispensable.
EARLY CORPUS LINGUISTICS
"Early corpus linguistics" is a term we use here to describe linguistics before the advent of
Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and
later linguists of the structuralist tradition all used a corpus-based methodology. However, that
does not mean that the term "corpus linguistics" was used in texts and studies from this era.
Below is a brief overview of some interesting corpus-based studies predating 1950.
Definition of a corpus
The concept of carrying out research on written or spoken texts is not restricted to corpus
linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic
analysis - the stylistic analysis of a poem, or a conversation analysis of a TV talk show.
However, the notion of a corpus as the basis for a form of empirical linguistics is different from
the examination of single texts in several fundamental ways.
In principle, any collection of more than one text can be called a corpus, (corpus being Latin for
"body", hence a corpus is any body of text). But the term "corpus" when used in the context of
modern linguistics tends most frequently to have more specific connotations than this simple
definition. The following list describes the four main characteristics of the modern corpus.
2 Sampling and representativeness
Often in linguistics we are not merely interested in an individual text or author, but a whole
variety of language. In such cases we have two options for data collection:
 We could analyse every single utterance in that variety - however, this option is impracticable
except in a few cases, for example with a dead language which only has a few texts. Usually,
however, analysing every utterance would be an unending and impossible task.
 We could construct a smaller sample of that variety. This is a more realistic option.
As discussed in Section 1.4, one of Chomsky's criticisms of the corpus approach was that
language is infinite - therefore, any corpus would be skewed. In other words, some utterances
would be excluded because they are rare, others which are much more common might be
excluded by chance, and alternatively, extremely rare utterances might also be included several
times. Although nowadays modern computer technology allows us to collect much larger
corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously.
This does not mean that we should abandon corpus linguistics, but instead try to establish ways
in which a much less biased and representative corpus may be constructed.
We are therefore interested in creating a corpus which is maximally representative of the variety
under examination, that is, which provides us with an as accurate a picture as possible of the
tendencies of that variety, as well as their proportions. What we are looking for is a broad range
of authors and genres which, when taken together, may be considered to "average out" and
provide a reasonably accurate picture of the entire language population in which we are
interested.
3 Finite size
The term "corpus" also implies a body of text of finite size, for example, one million words. This
is not universally so. For example John Sinclair's Cobuild team at the University of Birmingham
initiated the construction and analysis of a monitor corpus in the 1980s. Such a "collection of
texts", as Sinclair's team preferred to call the Cobuild corpus, is an open-ended entity - texts are
constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to
lexicographers who can trawl a stream of new texts looking for the occurrence of new words, or
for changing meanings of old words. Their main advantages are:
 They are not static - new texts can always be added, unlike the synchronic "snapshot" provided
by finite corpora.
 Their scope - they provide for a large and broad sample of language.
Their main disadvantage is:
 They are not such a reliable source of quantitative data (as opposed to qualitative data) because
they are constantly changing in size and are less rigorously sampled than finite corpora.
With the exception of monitor corpora, it should be noted that it is more often the case that a
corpus consists of a finite number of words. Usually this figure is determined at the beginning of
a corpus-building project. For example, the Brown Corpus contains 1,000,000 running words of
text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops
and the corpus is not increased in size. (An exception is the London-Lund corpus, which was
increased in the mid-1970s to cover a wider variety of genres.)
4 Machine-readable form
Nowadays the term "corpus" nearly always implies the additional feature "machine-readable".
This was not always the case as in the past the word "corpus" was only used in reference to
printed text.
Today few corpora are available in book form - one which does exist in this way is "A Corpus of
English Conversation" (Svartvik and Quirk 1980) which represents the "original" London-Lund
corpus. Corpus data (not excluding context-free frequency lists) is occasionally available in other
forms of media. For example, a complete key-word-in-context concordance of the LOB corpus is
available on microfiche, and with spoken corpora copies of the actual recordings are sometimes
available - this is the case with the Lancaster/IBM Spoken English Corpus but not with the
London-Lund corpus.
Machine-readable corpora possess the following advantages over written or spoken formats:
 They can be searched and manipulated at speed. (This is something which we covered at the end
of Part One).
 They can easily be enriched with extra information. (We will examine this in detail later.)
If you haven't already done so you can now read about other characteristics of the modern
corpus.
5 A standard reference
There is often a tacit understanding that a corpus constitutes a standard reference for the
language variety that it represents. This presupposes that it will be widely available to other
researchers, which is indeed the case with many corpora - e.g. the Brown Corpus, the LOB
corpus and the London-Lund corpus.
 One advantage of a widely available corpus is that it provides a yardstick by which successive
studies can be measured. So long as the methodology is made clear, new results on related topics
can be directly compared with already published results without the need for re-computation.
 Also, a standard corpus also means that a continuous base of data is being used. This implies that
any variation between studies is less likely to be attributed to differences in the data and more to
the adequacy of the assumptions and methodology contained in the study.
6 Multilingual corpora
Not all corpora are monolingual, and an increasing amount of work in being carried out on the
building of multilingual corpora, which contain texts of several different languages.
First we must make a distinction between two types of multilingual corpora: the first can really
be described as small collections of individual monolingual corpora in the sense that the same
procedures and categories are used for each language, but each contains completely different
texts in those several languages. For example, the Aarhus corpus of Danish, French and English
contract law consists of a set of three monolingual law corpora, which is not comprised of
translations of the same texts.
The second type of multilingual corpora (and the one which receives the most attention) is
parallel corpora. This refers to corpora which hold the same texts in more than one language.
The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which
contained the biblical texts side by side in Hebrew, Latin and Greek etc.
A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to
identify which sentences in the sub-corpora are translations of each other, and which words are
translations of each other. A corpus which shows these identifications is known as an aligned
corpus as it makes an explicit link between the elements which are mutual translations of each
other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on
the table" might be aligned to one another. At a further level, specific words might be aligned,
e.g. "Das" with "The". This is not always a simple process, however, as often one word in one
language might be equal to two words in another language, e.g. the German word "raucht" could
be equivalent to "is smoking" in English.
At present there are few cases of annotated parallel corpora, and those which exist tend to be
bilingual rather than multilingual. However, two EU-funded projects (CRATER and
MULTEXT) are aiming to produce genuinely multilingual parallel corpora. The Canadian
Hansard corpus is annotated, and contains parallel texts in French and English, but it only covers
a restricted range of text types (proceedings of the Canadian Parliament). However, this is an
area of growth, and the situation is likely to change dramatically in the near future.
7 Text encoding and annotation
If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas
annotated corpora has been enhanced with various types of linguistic information.
Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no
longer a body of text where linguistic information is implicitly present, but one which may be
considered a repository of linguistic information. The implicit information has been made
explicit through the process of concrete annotation.
For example, the form "gives" contains the implicit part-of-speech information "third person
singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-
existing knowledge of the grammar of English. However, in an annotated corpus the form
"gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person
singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and
easier to retrieve and analyse information about the language contained in the corpus.
8 Types of annotation
Certain kinds of linguistic annotation, which involve the attachment of special codes to words in
order to indicate particular features, are often known as "tagging" rather than annotation, and the
codes which are assigned to features are known as "tags". These terms will be used in the
sections which follow.
This is the most basic type of linguistic corpus annotation - the aim being to assign to each
lexical unit in the text a code indicating its part of speech. Part-of-speech annotation is useful
because it increases the specificity of data retrieval from corpora, and also forms an essential
foundation for further forms of analysis (such as syntactic parsing and semantic field annotation).
Part-of-speech annotation also allows us to distinguish between homographs.
Part-of-speech annotation was one of the first types of annotation to be formed on corpora and is
the most common today. One reason for this is because it is a task that can be carried out to a
high degree of accuracy by a computer. Greene & Rubin (1971) achieved a 71% accuracy rate of
correctly tagged words with their early part-of-speech tagging program (TAGGIT). In the early
1980s the UCREL team at Lancaster University reported a success rate of 95% using their
program CLAWS.
9 Phonetic transcription
Spoken language corpora can also be transcribed using a form of phonetic transcription. Not
many examples of publicly available phonetically transcribed corpora exist at the time of writing.
This is possibly because phonetic transcription is a form of annotation which needs to be carried
out by humans rather than computers. Such humans have to be well skilled in the perception and
transcription of speech sounds. Phonetic transcription is therefore a very time consuming task.
Another problem is that phonetic transcription works on the assumption that the speech signal
can be divided into single, clearly demarcated "sounds", while in fact, these "sounds" do not
have such clear boundaries, therefore what phonetic transcription takes to be the same sound,
might be different according to context.
Nevertheless, phonetically transcribed corpora is extremely useful to the linguist who lacks the
technological tools and expertise for the laboratory analysis of recorded speech. One such
example is the MARSEC corpus (which is derived from the Lancaster/IBM Spoken English
Corpus) and has been manipulated by the Universities of Lancaster and Leeds. The MARSEC
corpus will include a phonetic transcription.
10 Problem-oriented tagging
Problem-oriented tagging, as described by de Haan (1984), is the phenomenon whereby users

will take a corpus, either already annotated, or unannotated, and add to it their own form of
annotation, oriented particularly towards their own research goal. This differs in two ways from
the other types of annotation we have examined in this section.
i. It is not exhaustive. Not every word (or sentence) is tagged - only those which are directly
relevant to the research. This is something which problem-oriented tagging has in common with
anaphoric annotation.
ii. Annotation schemes are selected, not for broad coverage and theory-neutrality, but for the
relevance of the distinctions which it makes to the specific questions that the researcher wishes to
ask of his/her data.
Although it is difficult to generalise further about this form of corpus annotation, it is an

important type to keep in mind in the context of practical research using corpora.
11 Corpora in language studies
In this section we will examine a few of the roles which corpora may play in the study of
language. The importance of corpora to language study is aligned to the importance of empirical
data. Empirical data enable the linguist to make objective statements, rather than those which are
subjective, or based upon the individual's own internalised cognitive perception of language.
Empirical data also allows us to study language varieties such as dialects or earlier periods in a
language for which it is not possible to carry out a rationalist approach.
It is important to note that although many linguists may use the term "corpus" to refer to any
collection of texts, when it is used here it refers to a body of text which is carefully sampled to be
maximally representative of the language or language variety. Corpus linguistics, proper, should
be seen as a subset of the activity within an empirical approach to linguistics. Although corpus
linguistics entails an empirical approach, empirical linguistics does not always entail the use of a
corpus.
In the following pages we'll consider the roles which corpora use may play in a number of
different fields of study related to language. We will focus on the conceptual issues of why
corpus data are important to these areas, and how they can contribute to the advancement of
knowledge in each, providing real examples of corpus use.
12 Corpora in lexical studies
Empirical data has been used in lexicography long before the discipline of corpus linguistics was
invented. Samuel Johnson, for example, illustrated his dictionary with examples from literature,
and in the 19th Century the Oxford Dictionary used citation slips to study and illustrate word
usage. Corpora, however, have changed the way in which linguists can look at language.
A linguist who has access to a corpus, or other (non-representative) collection of machine

readable text can call up all the examples of a word or phrase from many millions of words of
text in a few seconds. Dictionaries can be produced and revised much more quickly than before,
thus providing up-to-date information about language. Also, definitions can be more complete
and precise since a larger number of natural examples are examined.
Examples extracted from corpora can be easily organised into more meaningful groups for
analysis. For example, by sorting the right-hand context of the word alphabetically so that it is
possible to see all instances of a particular collocate together. Furthermore, because corpus data
contains a rich amount of textual information - regional variety, author, date, genre, part-of-
speech tags etc. it is easier to tie down usages of particular words or phrases as being typical of
particular regional varieties, genres and so on.
The open-ended (constantly growing) monitor corpus has its greatest role in dictionary building
as it enables lexicographers to keep on top of new words entering the language, or existing words
changing their meanings, or the balance of their use according to genre etc. However, finite
corpora also have an important role in lexical studies - in the area of quantification. It is possible
to rapidly produce reliable frequency counts and to subdivide these areas across various
dimensions according to the varieties of language in which a word is used.
Finally, the ability to call up word combinations rather than individual words, and the existence
of mutual information tools which establish relationships between co-occurring words mean that
we can treat phrases and collocations more systematically than was previously possible. A
phraseological unit may constitute a piece of technical terminology or an idiom, and collocations
are important clues to specific word senses.
13 Corpora and grammar
Grammatical (or syntactic) studies have, along with lexical studies, been the most frequent types
of research which have used corpora. Corpora are a useful tool for syntactical research because
of:
 The potential for the representative quantification of a whole language variety.

 Their role as empirical data for the testing of hypotheses derived from grammatical theory.
Many smaller-scale studies of grammar using corpora have included quantitative data analysis
(for example, Schmied's 1993 study of relative clauses). There is now a greater interest in the
more systematic study of grammatical frequency - for example, Oostdijk & de Haan (1994) are
aiming to analyse the frequency of the various English clause types.
Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see

Section 1) has often meant that these two approaches have been viewed as separate and in
competition with each other. However, there is a group of researchers who have used corpora in
order to test essentially rationalist grammatical theory, rather than use it for pure description or
the inductive generation of theory.
At Nijmegen University, for instance, primarily rationalist formal grammars are tested on real-
life language found in computer corpora (Aarts 1991). The formal grammar is first devised by
reference to introspective techniques and to existing accounts of the grammar of the language.
The grammar is then loaded into a computer parser and is run over a corpus to test how far it
accounts for the data in the corpus: see Section 5, Module 3.5, headed Parsing and tagging. The
grammar is then modified to take account of those analyses which it missed or got wrong.
14 Corpora and sociolinguistics
Although sociolinguistics is an empirical field of research it has hitherto relied primarily upon
the collection of research-specific data which is often not intended for quantitative study and is
thus not often rigorously sampled. Sometimes the data are also elicited rather than naturalistic
data. A corpus can provide what these kinds of data cannot provide - a representative sample of
naturalistic data which can be quantified. Although corpora have not as yet been used to a great
extent in sociolinguistics, there is evidence that this is a growing field.
The majority of studies in this area have concerned themselves with lexical studies in the area of
language and gender. Kjellmer (1986), for example, used the Brown and LOB corpora to
examine the masculine bias in American and British English. He looked at the occurrence of
masculine and feminine pronouns, and at the occurrence of the items man/men and
woman/women. As one would expect, the frequencies of the female items were much lower than
the male items in both corpora. Interestingly, however, the female items were more common in
British English than in American English. Another hypothesis of Kjellmer's was not supported in
the corpora - that woman would be less "active", that is would be more frequently the objects
rather than the subjects of verbs. In fact men and women had similar subject/object ratios.
Holmes (1994) makes two important points about the methodology of these kinds of study,
which are worth bearing in mind. First, when classifying and counting occurrences the context of
the lexical item should be considered. For instance, whilst there is a non-gender marked
alternative for policeman/policewoman, namely police officer, there is no such alternative for the
-ess form in Duchess of York. The latter form should therefore be excluded from counts of
"sexist" suffixes when looking at gender bias in writing. Second, Holmes points out the difficulty
of classifying a form when it is actively undergoing semantic change. She argues that the word
man can refer both to a single male (such as in the phrase A 35 year old man was killed, or can
have a generic meaning which refers to mankind (such as Man has engaged in warfare for
centuries. In phrases such as we need the right man for the job it is difficult to decide whether
man is gender specific or could be replaced by person. These simple points should incite a more
critical approach to data classification in further sociolinguistic work using corpora, both within
and without the area of gender studies.
Conclusion
In this section we have seen how language study has benefited from exploiting corpus data. To
summarise, the main important advantages of corpora are:
 Sampling and quantification: Because a corpus is sampled to maximally represent the

population, any findings taken from the corpus can be generalised to the larger population. Hence
quantification in corpus linguistics is more meaningful than other forms of linguistic
quantification because it can tell us about a variety of language, not just that which is being
analysed.
 Ease of access: As all of the data collection has been dealt with by someone else, the researcher
does not have to go through the issues of sampling, collection and encoding. The majority of
corpora are readily available, either free or at low-cost price. Once the corpora have been
obtained, it is usually easy to access the data within it, e.g. by using a concordance program.
 Enriched data: Many corpora have already been enriched with additional linguistic information
such as part-of-speech annotation, parsing and prosodic transcription. Hence data retrieval from
annotated corpora can be easier and more specific than with unannotated data.
 Naturalistic data: Corpus data is not always completely unmonitored in the sense that the people
producing the spoken or written texts are unaware until after the fact that they are being asked to
participate in the building of a corpus. But for the most part, the data are largely naturalistic,
unmonitored and the product of real social contexts. Thus the corpus provides one of the most
reliable sources of naturally occurring data that can be examined.

Introduction

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Introduction

Cargado por

Copyright:

Formatos disponibles

INTRODUCTION

EARLY CORPUS LINGUISTICS

2 Sampling and representativeness

Their main disadvantage is:

7 Text encoding and annotation

Problem-oriented tagging, as described by de Haan (1984), is the phenomenon whereby users

Although it is difficult to generalise further about this form of corpus annotation, it is an

11 Corpora in language studies

12 Corpora in lexical studies

A linguist who has access to a corpus, or other (non-representative) collection of machine

 The potential for the representative quantification of a whole language variety.

Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see

14 Corpora and sociolinguistics

 Sampling and quantification: Because a corpus is sampled to maximally represent the

También podría gustarte