Interview with Mark Davies

June 2, 2009

I had the pleasure of interviewing Mark Davies, creator of COCA (Corpus of Contemporary American English via Skype. During our conversation, he explained the workings of his electronic corpus, corpus linguistics and some differences between the Spanish and English languages. Below is an extract of what was covered:

MD:        Tell me about what you do.

RJ:        I'm a Spanish-English freelance translator in Santiago, Chile. I specialize in legal texts, especially certificates and divorce decrees from Mexico and Central and South America. I also translate marketing documents from Spain.

MD:        You said that you used COCA. As a native speaker, how does it help you? In most cases, you've already got the intuitions.

RJ:        True. I am a native speaker of English, but the language I know is limited in scope. When I translate, I like to know not just how you could say a word or phrase, but rather the best way to say it.

RJ:        What is a corpus?

MD:        A corpus is a collection of texts almost always in electronic form that can be efficiently searched to get answers to language-oriented questions. An online newspaper could potentially be a corpus. However, corpus linguists typically distinguish between a corpus and a text archive.

A key concept of a corpus is that it should be representative of whatever language it is trying to model. If you want to have a corpus of Spanish and all you have are newspapers, that is not a good corpus. It tells you plenty about newspaper Spanish, but it doesn't tell you about other genres.

In COCA, I try to include an equal number of texts from four different genres to give a sample of the range of English. This is an important concept in corpus linguistics.

RJ: How do you decide what genres to choose?

MD:        In corpus linguistics, during the past twenty or thirty years, there are four main genres. On one end is spoken conversation. On the other end are academic journals. In the middle, you have fiction, which can look spoken because of dialog. Then there are magazines and newspapers. They aren't as formal as academic journals, but are more formal than fiction. Those are the four accepted genres to be included in a corpus.

RJ:        Do you have a team to compile the corpora?

MD:        No, I do this work myself.

RJ:        How do you avoid being overwhelmed by the sheer amount of information?

MD:        It is an issue in efficiently searching the data. Once you get to a size of a few million words, if you don't have the right architecture, queries are going to slow down. If you're searching for exact words and phrases, it's not too bad. Think of Google: trillions of words and it takes you a quarter of a second. It's not difficult to search text for exact words and phrases. It is more difficult to search hundreds of millions of words for complex linguistic structures (parts of speech, synonyms, etc.). That is where the right architecture comes into play. Otherwise it isn't feasible.

RJ:        What do you mean exactly by architecture?

MD:        Architecture is the way you store data and the way the words and sentences are annotated (parts of speech, lemmatization). In COCA you can say "Find me all of the forms of all of the synonyms of clean as a verb followed by a definite article or an article followed by a noun. There are few architectures that work for large amounts of data.

RJ:        What software do you use?

MD:        I use relational databases. I use Microsoft Sequel Server, but I have to write all of the search algorithms.

RJ:        What have you learned from your corpus? What have you been able to glean from it?

MD:        Wow. Lots. I'm an historical linguist by training. I've created corpora to study how language changes. I just got a large grant from the National Endowment for the Humanities to create a large corpus of historical American English. I've done that for Portuguese and for Spanish. Insight into how and why language changes. Another important thing is genre. When we say "English does this." "Spanish does this". That is simplistic. We need to say "Spoken English does this" or "Newspaper Spanish does this". The language of different genres is utterly different. Corpora can help us figure that out. I also use corpora for teaching. When I taught Spanish, I would have my advanced students look at the corpora and realize that things were usually a lot more complicated than the simplistic grammar rules.

Corpora also help us with semantics, i.e. finding out what words mean from context. Corpora give us collocates, i.e. words that occur near other words.

RJ:        Would you say it has a use for translating?

MD:        Definitely. At least 30% of COCA users are for translations. You'll have someone from Belgium or Chile or Hong Kong who speaks English well but is not a native speaker of English. They make lots of queries every day on nuances; what is used in which genre.

RJ:        I must confess that I really enjoy using COCA and seeing how words fit together. Sometimes I do it just for the fun of it.

MD:        Even for non-linguists and non-translators, people interested in language, there are a lot of users who use it at that level. In COCA for any given week, there are 7,000-8,000 unique users.

RJ:        I found COCA to be quite user friendly and much better than Google which I use as something of a giant corpus.

MD:        Google has fundamental problems as a corpus. It's not tagged for parts of speech or lemmas, so it's difficult to perform grammatical-oriented queries. You can't do substrings, suffixes and prefixes. You can't do morphology, word formation. The two worst problems on Google is that you will see that a certain word or phrase occurs x number of times. But it doesn't give you any sense of whether it's formal or informal. In other words, Google doesn't know about genres. A serious problem with Google is when you enter a phrase, more than one word, the number of hits that Google returns (e.g. 79,000 times) is just a wild guess. Google doesn't know much about frequencies, just guesses. If you try to determine which is more common, A or B, we have no way of knowing. The number of hits is meaningless in this case. With a corpus, you can figure that out.

RJ:        I wanted to ask you about the difference between Old Spanish and Old English. I got the impression that Old Spanish is closer to Modern Spanish than Old English is to Modern English. What is your impression?

MD:        In English, you have a Germanic language, 99% Germanic in 1000 A.D. Then Romance was added on top of Germanic. In Spanish, however, you don't have that. That's why Old English is so hard to understand. It's a different language because we have to go back beyond the Norman Conquest.

RJ:        Would you say that Spanish is more continuous in that regard?

MD:        Yes, definitely. English is languages A and B, Germanic and Romance, going to language C. Spanish is really just language A going to language A. Sure, there are changes, but it isn't this huge mix of two totally different languages like you get in English. Also, Spain and England were in very different situations in the 1700s. England was open to the scientific advances in the rest of Europe, whereas Spain had cut itself off from this. That's why English has always been more open than Spanish.

RJ:        You mentioned you were publishing a book. Could you share that with our readers?

MD:        Back in 2005, I did a frequency dictionary in Spanish with Routledge. It was based on the Corpus del Español; 5,000 lemmas in Spanish. Then in 2007 I did that for Portuguese, and when COCA came out last year, I contacted Routledge. They said they wanted to do a frequency dictionary of English. This will be the top 5,000 words based on frequency. Unlike the other dictionaries, for each word, this one will give you the top collocates (words that co-occur with the entry). This will help give us a picture on what that word does. That book will probably be available in December of this year.

RJ:        Listening to you talk about these materials makes me want to hole up and do nothing else but devour them.

MD:        It gets addictive. Sometimes I'm reading the newspaper or a magazine article and I see a word and I say to myself: "I wonder what's going on with that phrase. Is it used more now than it was twenty years ago? How does it compare with this other word? I get the feeling it's informal. Is it informal? I have intuitions as we do as native speakers. But to be able to go to a corpus, put in a query and answer all of those questions within two or three seconds gets addictive.

The fun thing about language is that on one hand it is complex. On the other hand, we're surrounded by it every second of the day. Particle physics is complex, but we're not surrounded by it every day. Language, however, combines those two factors.

RJ:        I'm through with my questions. Is there anything you would like to add, Mark?

MD:        Yes. I have set up a portal for corpora that you might like to mention. It allows you to say "I use the corpus for translation. I am from Spain or France or Chile." Then you can find other people who use the corpus for the same types of reasons from the same part of the world and you can get in touch with those people.

RJ:        Thank you very much for your time, Mark.

MD:        Certainly. Good luck.
About Mark Davies:

I am a professor of Corpus Linguistics in the Department of Linguistics and English Language at Brigham Young University in Provo, Utah.  From 1992-2003, I was a professor of Spanish Linguistics at Illinois State University.

My primary areas of research and activity are:
--            Corpus and computational linguistics
--            Design and optimization of linguistic databases
--            Web scripting and web-database integration
--            Historical linguistics and syntactic variation
--            English, Spanish, and Portuguese
Reed D. James

