Sections
Uni-Logo
Artikelaktionen

Corpus linguistics - an introduction

 by Friederike Müller and Birgit Waibel

  1. Basics
  2. List of corpora available in Freiburg
  3. Software
  4. Exercises
  5. Points to consider when conducting corpus-linguistic research
  6. Further reading

 

 

 

1. Basics

 

What is a corpus?

A corpus (plural corpora, German “das Korpus”, not “der”) is a collection of texts used for linguistic analyses, usually stored in an electronic database so that the data can be accessed easily by means of a computer. Corpus texts usually consist of thousands or millions of words and are not made up of the linguist’s or a native speaker’s invented examples but on authentic (naturally occurring) spoken and written language.

The majority of present-day corpora are “balanced” or “systematic”. This means that the texts are collected (“compiled”) according to specific principles, such as different genres, registers or styles of English (e.g. written or spoken English, newspaper editorials or technical writing); these sampling principles do not follow language-internal but language-external criteria. For example, the texts for a corpus are not selected because of their high number of relative clauses but because they are instances of a predefined text type, say broadcast English in a hypothetical corpus of Australian British English. Examples of balanced corpora are the International Corpus of English (ICE), the British National Corpus (BNC), or the Brown and Lancaster-Oslo/Bergen (LOB) corpora and their Freiburg updates (Frown and F-LOB).

A corpus is thus a systematic, computerised collection of authentic language used for linguistic analysis.

 

What is corpus linguistics and why is it useful?

Based on the above definition of a corpus, corpus linguistics is the study of language by means of naturally occurring language samples; analyses are usually carried out with specialised software programmes on a computer. Corpus linguistics is thus a method to obtain and analyse data quantitatively and qualitatively rather than a theory of language or even a separate branch of linguistics on a par with e.g. sociolinguistics or applied linguistics. The corpus-linguistic approach can be used to describe language features and to test hypotheses formulated in various linguistic frameworks. To name but a few examples, corpora recording different stages of learner language (beginners, intermediate, and advanced learners) can provide information for foreign language acquisition research; by means of historical corpora it is possible to track the development of specific features in the history of English like the emergence of the modal verbs gonna and wanna; or sociolinguistic markers of specific age groups such as the use of like as a discourse marker can be investigated for purposes of sociolinguistic or discourse-analytical research.

The great advantage of the corpus-linguistic method is that language researchers do not have to rely on their own or other native speakers’ intuition or even on made-up examples. Rather, they can draw on a large amount of authentic, naturally occurring language data produced by a variety of speakers or writers in order to confirm or refute their own hypotheses about specific language features on the basis of an empirical foundation.

 

 

What types of corpora are there?

In the following, a list of some of the most common types of corpora is provided.

 

  • General corpora, such as the British National Corpus or the Bank of English, contain a large variety of both written and spoken language, as well as different text types, by speakers of different ages, from different regions and from different social classes.
  • Synchronic corpora, such as F-LOB and Frown, record language data collected for one specific point in time, e.g. written British and American English of the early 1990s.
  • Historical corpora, such as ARCHER and the Helsinki corpus, consist of corpus texts from earlier periods of time. They usually span several decades or centuries, thus providing diachronic coverage of earlier stages of language.
  • Learner corpora, such as the International Corpus of Learner English and the Cambridge Learner Corpus, are collections of data produced by foreign language learners, such as essays or written exams.
  • Corpora for the study of varieties, such as the International Corpus of English and the Freiburg English Dialect Corpus, represent different regional varieties of a language.

 

There is also a large variety of specialized corpora, e.g. Michigan Corpus of Academic Spoken English (MICASE), useful for various types of research (cf. e.g. http://www.helsinki.fi/varieng/CoRD/corpora/index.html).

It should be pointed out that the above listed types of corpora are not necessarily mutually exclusive – F-LOB and Frown, for example, are both synchronic and regional corpora, and even “become” historical when paired with their 1960s counterparts LOB and Brown.


back to top

2. List of corpora available in Freiburg


You can download a list of corpora here. Please note that with new corpora constantly being compiled this list is not exhaustive but constitutes a selection of well known and widely used corpora of the English language.
 


back to top

3. Software

 

In order to analyse a corpus and search for certain words or phrases (strings), you need special software. Some software packages are designed for a specific corpus, for example Sara for the BNC or ICECUP for the ICE Great Britain. ‘Concordancers’, on the other hand, can be used for the analysis of almost any corpus.

One of the most frequently used concordancers is Wordsmith Tools. Its two most important tools, Concord and WordList, will be explained in more detail below. 

As an alternative to Wordsmith, you can also use a concordancer called AntConc which can be downloaded for free. At  www.antlab.sci.waseda.ac.jp/antconc_index.html you will find both links for the download and an online help system explaining its basic functions. The most useful functions of AntConc are explained below.

 

 

3.1    WordSmith Concord

 

Click on the Wordsmith icon on the desktop to open the program. Select concord in order to search a corpus for a certain word or phrase. You can now choose a corpus and select those files of the corpus you want to analyse.

As a case study, let us analyse the use of English prepositional phrases by German and Italian learners of English in the International Corpus of Learner English (ICLE). The underlying assumption is that German learners frequently use ‘possiblity to do something’ (native language interference from German ‘Möglichkeit, etwas zu tun’) while Italian learners prefer an of-construction as the direct translation of possibilità di fare qc (possibility of doing sth.).

In the ‘choose text’ option, mark all texts written by Italian learners (those beginning with ‘it’) and put them into the directory by drag and drop. Click ok.

Screenshot 1: Getting started

 Then go to concord – settings – search word, e.g. ‘possibility’, and click go now (see below for further options to type in a search word or phrase). This will get all occurrences of ‘possibility’ for you to analyse. For a better overview, you can sort them e.g. according to the first word to the right.

Screenshot 2: Concordance 1

The number on the left indicates the number of occurrences; on the right, further information, such as the source files, is provided. In the toolbar, you find a number of functions which are useful to work with the data.

In order to view the samples with larger or smaller context, click on view - grow or view - shrink.

You can also resort (edit - resort) data according to words on the first to fifth item to the left or the right of ‘possibility’. To compare the use of prepositions occurring with ‘possibility’, for example, sort according to first word right (R1).

Additionally, you can delete irrelevant examples by pressing delete on your keyboard and then choosing the zap-option (edit - zap) from the toolbar. In our example, all hits where ‘possibility’ is not used with a verb phrase (e.g. ‘the possibility for women’, ‘possibility of an adoption’) are deleted, leaving only the occurrences of ‘possibility’ + prepositional verb phrase.

Furthermore, you can view the most frequent collocations or the most frequent clusters by clicking on the respective tabs below. This will show you that Italian learners use ‘possibility to’ (39 occurrences) slightly less often thanpossibility of’ (43 occurrences), thus neither firmly corroborating nor contradicting the above hypothesis:

Screenshot 3: Clusters

Now you can start a new concordance, e.g. with the German ICLE subcorpus and compare the results.

To continue working with this data at a later point in time, you can save it either as a concord file (file - save as). If you do not have Wordsmith on your computer at home, it is better to save the data as a .txt file or as an excel spreadsheet as you will not be able to open a Concord file without the Wordsmith software. Saving it as an excel table allows you to work with the data later on, e.g. to copy and paste some of the examples, or to count the occurrences and compare frequencies of different corpora/ points in time and finally draw graphics.

 

Some further options for entering a search word or phrase:

By using the asterisk *, you can widen the scope of your search. For example, entering going as a search word will provide you only with all instances of going; entering going to with all instances of going to. If you type in go*, on the other hand, you will get all words beginning with go-, e.g. going, goes, gold. Searching for *ing, you will get all words ending in –ing, e.g. swimming, dancing, sing.

You can also type in several words as your search words, e.g. go/ going/ goes.

Additionally, it is possible to analyse the co-occurrence of two words within a certain distance. In order to do this, you need to type in one word as ‘Search Word or Phrase’ and the other as ‘Context Words’ in the tab ‘advanced’. Additionally, indicate the ‘search horizons’, i.e. how many words to the left or right the second word might occur with respect to the first. For example, if you want to analyse a collocation such as have a look in a span of five words, enter have/ has/ had having as search word and a look as context word; click on ‘0L’ and ‘5R’ to find all instances where look is found within five words right of book.

Screenshot 4: Searching collocates

Screenshot 5: Creating a WordList

 

WordSmith WordList

The tool WordList generates word lists of the selected text files and enables you to compare the length of text files or corpora. Moreover, you can use WordList to compare the frequency of a word in different text files or across genres and to identify common clusters.

Choose the text files for your analysis as described in the section above and use WordList now instead of Concord

Screenshot 6: Wordlist statistics

 

In the tabs below you can select between three different types of word lists being listed by their frequency, occurring in alphabetical order or containing statistical information.

‘Wordlist statistics’ compares the frequencies of words in each category of the respective corpus (e.g. in each text of the German and Italian sub-corpora of the ICLE), providing information on the number of words, the average length of words and sentences, etc.

 

For further information and explanations of the different tools, you can always resort to the Wordsmith Tools Help window.

 

3.2     AntConc Concordance tool


This tool shows the words or word strings you want to analyse in their textual context.

1. Start the AntConc programme (download from www.antlab.sci.waseda.ac.jp/antconc_index.html)

2. Select the files you want to analyse: File > Open file(s)

3. Choose the tab "Concordance"

4. Type in a search word (“Search Term”, bottom left-hand corner)

 

Example: how to find all occurrences of make:

-         only one word form: type in make, makes, made, making separately

-         several word forms: use of wildcards

          i.      ma* gives you all of the above word forms, but also all other words beginning in ma-, e.g. man, mankind, marry, etc.: * stands for 0 or more characters

          ii.      ma?e gives you make and made, but also maze, male and mate: ? stands for any 1 character

-         other wildcards

          i.      @ stands for 0 or 1 word

          ii.      # stands for any one word

          iii.      | stands for OR

2)      Determine how large the context of the concordance line is supposed to be: Default setting of “Search Window Size” is 50 characters, but generally you need more context à 200 or 250 characters

3)      Click “Start”

4)      “Concordance Hits” shows you the overall amount of occurrences (remember that not all occurrences need to be relevant for your analysis!)

5)      If you want to see the whole text of one concordance line, move the mouse over the highlighted search term in the concordance line and click.

6)      Deleting unwanted concordance lines: on your keyboard press “control” + click on the line you want to delete, then press “delete” on your keyboard. Click on “Sort” (under the “Search Term” box to reorder the remaining concordance lines so that you are left with consistent numbering.

7)      Save your results: File > Save output to text file

 

 

How to refine your search:

 

Example 1: Finding the phrasal verb make up

-         Click on “Advanced” next to the “Search Term” box

-         Type in make in the “Search Term” box

-         Activate the box “Contexts Words and Horizons”

-         Type in “up” in the box under “Context Words”, then click on “Add”

-         Define the search horizon (e.g. 0 words to the right and 5 words to the left of make)

-         Click on “Apply”

-         Click on “Start”

 

 

Example 2: Finding words clustering around take

-         Select the tab “Clusters”

-         Type in take as search term

-         “Search Term Position”: Decide if you want to find the words preceding (activate “on right”, i.e. take is on the right) or following take (activate “on left”, i.e. take is on the left)

-         Using “Cluster Size”, define how long you want your cluster to be (e.g. at least 3 words including take)

-         Click on “Start”

 

 

Example 3: Finding collocates of take

-         Select the tab “Collocates”

-         Type in take as search term

-         Define the span of words to the left and right of take: “Window span” from e.g. 0L to 5R

-         Click on “Start”

 

 

WATCH OUT:

 

-         When you have done a search with context words via the “Advanced” search function, and then want to do a search without context words, make sure to clear the context words you used for your previous search.

-         When you are using a part-of-speech-tagged corpus like F-LOB or Frown, and you do not want the tags to show up, go to “Global Settings” > “Tag Settings” > “Hide Tags”



back to top

4. Exercises

 

1. The BNC online (www.natcorp.ox.ac.uk) offers a free search facility for simple searches. For example, you can check whether a certain collocation is used by British native speakers. The BNC online counts all instances of your search items but displays at most 50 random examples.

 

 

a) Here you can find some translations of German collocations. Do they exist in English and are they used by native British speakers?

 

  • arm wie eine Kirchenmaus = poor as a church mouse
  • die Ansprüche runterschrauben = to screw down one’s standards
  • ausflippen = to flip out
  • den Nagel auf den Kopf treffen = to hit the nail on the head
  • das Geld zum Fenster rausschmeißen = to throw money out of the window
  • den Wald vor lauter Bäumen nicht mehr sehen = to not see the wood for the trees


b) The following words have a different meaning in German and English. Check their use in the BNC and verify their meaning in the OED (www.oed.com; can only be accessed in the campus network):

 

  • beamer
  • smoking (noun)
  • sympathetic

 

 

2. If you want to analyse different varieties of English, you can use the ICE (International Corpus of English) corpora. For this exercise, refer to the ICE New Zealand.

  

a) ‘Wahine’ is a word from Maori meaning woman/female/wife. How many occurrences can you find both in the written and in the spoken part of ICE NZ?

b) The word ‘panache’ occurs once in ICE NZ - in which file?

c) Find the word nice in ICE NZ. Which adverb does it most frequently collocate with?

d) Which is the most efficient search strategy for finding all instances of to shake x’s head (including its inflected forms)?


back to top

5. Points to consider when conducting corpus-linguistic research

 

  • Make sure you have enough time to conduct your corpus-linguistic research! Don’t start two or three days before your actual presentation – you should be finished by then! Depending on your topic/research question, you’ll need two or three weeks to analyse your features.
  • Choose your corpus/corpora carefully: a large corpus is usually suitable for any kind of linguistic research (1,000,000 words or more), while a small corpus (200,000 to 500,000 words) may only be sufficient for frequent syntactic structures such as the present perfect or the analysis of the more common modal verbs.
  • Get to know your corpus/corpora: text types, size, language variety, etc.
  • If you compare two or more different corpora, e.g. the German and Swedish ICLE sub-corpora, be aware that each sub-corpus may consist of a different number of words (e.g. 265,341 words in the German ICLE, 248,578 words in the Swedish ICLE). When you present your corpus-linguistic results, you have to make sure that your figures are comparable. It is no use saying that feature X occurred 5 times in the German ICLE and 5 times in the Swedish ICLE, if the total numbers of words differ in the two corpora – you have to have a common basis. You can solve this problem by extrapolating your figures to a common denominator. A frequently used common denominator in corpus-linguistic research is 1 million words, but you can also use other figures, e.g. 250,000 words. This is how you calculate the extrapolation: Multiply the feature you counted in corpus A by 1,000,000, then divide this figure by the actual size of corpus A. E.g. feature X occurred 5 times in 265,341 words in the German ICLE à 5 multiplied by 1,000,000 = 5,000,000 divided by 265,341 = 18.84 occurrences of feature X in 1 million words. You then do the same calculation with the results from the Swedish ICLE: 5 multiplied by 1,000,000 divided by 258,978 = 20.11 occurrences of feature X in 1 million words. Although the differences between the two corpora used here are only minimal you still have to do the extrapolation. Otherwise you would be comparing apples and oranges!
  • Be careful to find all occurrences of your feature! If, for example, you search for the collocation make a decision, your search strategy has to be such that you find all inflectional variants of MAKE (mak* would give you make, makes, making but not made. However, it also gives you maker. Ma* would give you also made but then you are faced with any word starting in ma-, such as man, mankind, mad, Mary, etc.) Also, DECISION might be pre-modified by an adjective such as useful or personal which you might want to include in your analysis as well, so make sure you don’t forget these examples during your search (e.g. by using the search string a * decision).
  • Not all concordance lines need to be relevant for your research. If you search for the phrasal verb make up, you will find a number of nominal or adjectival uses of this phrasal verb, such as “She put on her make up” or “Her beautifully made up face”. In WordSmith, you can discard such unwanted concordance lines by highlighting them, then pressing delete. When you have marked all unwanted examples in your concordance in this way, you use the “zap” function so that the unwanted examples are discarded and you are left only with those occurrences you actually need.
  • A high frequency of your researched feature does not necessarily mean that your feature is distributed evenly across the entire corpus you used. Check the corpus’ file names in order to exclude that maybe only one or two authors or speakers produced all the examples you have found.
  • Make sure you don’t over-generalise your results. If, for example, you used a very small corpus of written academic American English, you mustn’t claim that your results are valid for American English as a whole or even for English in general. Qualify your research results by saying that your results hold only as far as written academic American English is concerned and that further research into other types of English needs to be conducted for more general conclusions about the features you researched.


back to top

6. Further reading

 

  • Baker, Paul, Andrew Hardie & Tony McEnery. 2006. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press.
  • Biber, Douglas, Conrad, Susan, & Reppen, Randi. 1998. Corpus Linguistics: Investigating Language Structure and Use. Cambridge: CUP.
  • Fillmore, Charles J. 1992. ““Corpus linguistics” or “Computer-aided armchair linguistics””. In: Svartvik, Jan (ed.) Directions in Corpus Linguistics. Berlin: de Gruyter. 35-60.
  • Kennedy, Graeme. 1998. An introduction to Corpus Linguistics. London & New York: Longman.
  • Leech, Geoffrey. 1992. “Corpora and theories of linguistic performance”. In: Svartvik, Jan (ed.) Directions in Corpus Linguistics. Berlin: de Gruyter. 105-122.
  • McEnery, Tony & Andrew Wilson. 2001. Corpus Linguistics. Edinburgh: Edinburgh UP.
  • Meyer, Charles F. 2002. English Corpus Linguistics: An Introduction. Cambridge: CUP.
  • Scherer, Carmen. 2006. Korpuslinguistik. Heidelberg: Winter.
  • Teubert, Wolfgang & Anna Cermáková. 2007. Corpus linguistics. A short introduction. London: Continuum.


back to top

Benutzerspezifische Werkzeuge