This is going to start out a bit technical and academic, but I promise that if you’ll wade through it you will be rewarded with some very, very useful practical information that you can use to significantly enhance your learning of the Spanish language, or any other second language, by not just doing it better but doing it more efficiently and therefore requiring far less time to become fluent. Plus, it will help you design a study system based on precisely what it is that you want to do with Spanish: speak with native speakers, read fiction, read and/or write in a technical or academic field, etc., or some combination thereof. First, let’s start with some definitions so that we can understand what’s going on here:
Lexeme: A lexeme is a reduction of a word to it’s most basic meaning. For example: the word “water” could be a noun referring to H2O, or it could be a verb referring to the act of giving water to a plant, so in this case that counts as two separate and distinct lexemes, even though it’s the same word, “water”. The reason it is done this way is that if you have to learn both definitions then it is, for our purposes, the same as learning two different words that each have only one definition–it requires the same amount of time and effort and memory space in your head, so when we say “how many words do you need to know?” we’re counting lexemes, or in lay terms, “definitions”. In other words, we’re saying that each definition is to be counted as a separate “word” (when it’s done this way, it’s called a lexeme), regardless of whether those definitions refer to the same precise combination of letters (what would commonly be called a word) or not: the noun “water” and the verb “water” are two separate words (or lexemes, more accurately), right? Right. I should also note that different forms (such as with verbs) of the same word, as long as the same basic definition is maintained, count as one lexeme, so “is” and “are” are not two separate lexemes, but one.
Corpus: Latin for “body”. The body of knowledge that you based your information on, in this case books, newspapers, transcripts of spoken language, etc. Basically it means your data set. With regards to determining word frequency, the corpus is what it is that you looked at to determine which words occur and with what frequency. If your corpus for making a frequency list for English is 13th century bibles, then your data isn’t going to be too relevant to contemporary language.
Register: What setting the language is used in. We’re going to have three distinct registers that we’ll use: oral (spoken language), written fiction, and written non-fiction.
Range: How widely used the word is. In other words, if your corpus consists of four books, and a particular word shows up as 5% of all words in one book and never occurs in any of the others, that one book is going to incorrectly weight that word as being a lot more important than it likely really is–you would say that particular word “has a very narrow range”. An example would be if you’ve got a diet book as part of your corpus, we’ll presume it’s small, you might find that the words “protein”, “cardiovascular”, and “glycemic” end up on your frequency list when they probably shouldn’t be because these words aren’t often used in daily conversation or most written communication–this is because you’ve got something that’s part of your corpus that has a very narrow and specific subject matter that wouldn’t normally be discussed very frequently, and therefore some of the words used therein have a very narrow range. Usually it wouldn’t be something this extreme, since such obvious outliers would be removed from the corpus by a competent researcher, but what you will see are words that are used very frequently in written communication but hardly at all in spoken communication and vice-versa, that is important and something to take note of.
First, some data
The primary study that I’m going off of here is that done by Mark Davies at Brigham Young University, which I will embed below so that you can read it, download it, whatever you want to do (I know it’s a bit small, just click “Fullscreen”):
This is a truly unique and valuable study: you would think that there would be beaucoup data out there with regards to word frequency lists/dictionaries in Spanish, but there aren’t. The primary reason is that it’s so damned difficult, time-consuming, and expensive to do a study like this properly. The next most recent comparable study was one done in 1964 (he makes mention of it in the beginning).
They extracted the 6000 most frequent lexemes and broke it down by written fiction, written non-fiction, and oral (spoken); they then further organized the data by lexeme type (noun, adjective, adverb, etc.) so you’ll see which particular type of word is the most used thereby allowing you to focus your studies appropriately.
By the way, the frequency dictionary they mention, which this study was written about, that contains the entire list of the 6000 highest-frequency lexemes, is available on Amazon if you’re interested. For some reason, the hardcover is ridiculously expensive at $135, but they’ve got a paperback edition here and a kindle edition here for much more reasonable prices ($34.15 and $28.76 , respectively)
Right, let’s get down to brass tacks. According to the above study, for Spanish:
- Learning the first 1000 most frequently used words in the entire language will allow you to understand 76.0% of all non-fiction writing, 79.6% of all fiction writing, and an astounding 87.8% of all oral speech.
- Learning the top 2000 most frequently used words will get you to 84% for non-fiction, 86.1% for fiction, and 92.7% for oral speech.
- And learning the top 3000 most frequently used words will get you to 88.2% for non-fiction, 89.6% for fiction, and 94.0% for oral speech.
Essentially, just learning the top 1000 words will, if you’re primarily interested in speaking to people as most language learners are, get you to the point where you can understand roughly 90% of the spoken language–this is more than enough to be able to muddle through nearly any conversation. Sure, you’ll have to stop the speaker frequently to get them to define words for you and/or you may have to pull out your dictionary quite frequently, but my point is that it’s enough of a base for you to actually start speaking to people (which is the most important part of learning any language: actually talking to native speakers)–you’ll be able to say nearly anything you need to in some way or another, and you should be able to understand the general gist of what someone else is saying to you, even if you do have to stop and ask them for help a few times.
Professor Arguelles, arguably one of the world’s foremost experts on language learning and who, himself, is fluent in eleven languages and has studied 58 at some point or another, has addressed this in a fascinating thread on my favorite language-learning forum, HTLAL, concerning how many words you need to learn (he is directly addressing the above study in this quote) and does a superb job of boiling this down for us language-learners in practical terms that are useful to us:
“The maddening thing about these numbers and statistics is that they are impossible to pin down precisely and thus they vary from source to source. The rounded numbers that I use to explain this to my students I usually write in a bull’s eye target on the whiteboard, but I don’t have the computer skills to draw circles in this post, so I will just have to give a list:
250 words constitute the essential core of a language, those without which you cannot construct any sentence.
750 words constitute those that are used every single day by every person who speaks the language.
2500 words constitute those that should enable you to express everything you could possibly want to say, albeit often by awkward circumlocutions.
5000 words constitute the active vocabulary of native speakers without higher education.
10,000 words constitute the active vocabulary of native speakers with higher education.
20,000 words constitute what you need to recognize passively in order to read, understand, and enjoy a work of literature such as a novel by a notable author.”
Now, in the above study by Davies, here’s where things start to get really interesting:
“Assume that a language learner is aiming for 90% coverage in each of the four parts of speech that represent open classes — nouns, verbs, adjectives, and adverbs. This 90% figure will be obtained by knowing about 2600 nouns, 230 verbs, 980 adjectives, and 50 adverbs, or a total of about 3800 total forms.” [refer to page 110 of the study for a detailed table that breaks down these four word types in much greater detail]
So you can see that nouns completely dominate the average spoken vocabulary (the above data is from the spoken, not written, corpus), constituting 2600 out of 3800 lexemes, which is 68.4%, more than two-thirds, of all lexemes used. You should keep in mind, however, that each verb is counted as a single lexeme no matter how it is conjugated: so saying that you only need to know 230 verbs is a bit disingenuous when you not only have to know each of those verbs but you also have to ‘know’ a bunch of different conjugations for each one as well (e.g. you don’t just have to learn ‘ser’, you have to learn ‘soy’, ‘eres’, ‘es’, ‘somos’, ‘son’, ‘fui’, ‘fuiste’, ‘fue’, ‘sea’, ‘seamos’, ‘sean’, etc., etc., etc.).
Also, they found that (here’s where we get into register and range) certain words had a very high frequency of use in one of the three registers (oral, written fiction, written non-fiction) but barely appeared at all in the other two, or it was present in two (typically both written registers) but not at all in one of the others. So you’ll see that there are words which are far, far more valuable to learn than certain other words depending on which register you’re most interested in becoming proficient in. Have a look at the two tables below, the first one shows the ten words with the greatest difference in range between oral and non-fiction that have an extremely high oral range (they are very, very common spoken words), whereas the second table shows the same except these are the ten words with the greatest frequency difference that have an extremely high range in written non-fiction (they’re extremely common in non-fiction writing but not at all in oral speech)…I’m not sure I explained that well, if not leave a comment and I’ll try again:
Why? For what purpose are you learning this language?
How you intend to use the language in question (Spanish or whatever the case may be for you) is very important in determining which words you should focus on, primarily this comes into play with regards to whether you’re more concerned about the spoken language or the written language. Most language-learners are far more concerned about being able to actually speak to native speakers of the language than they are with anything else, though there are exceptions (people who wish to be able to read certain specific technical journals, such as an engineer who only wants to be able to read the original German or Japanese instruction manuals and schematics for the devices used in his field and does not need to be able to actually speak the langauge) as well as certain special needs (someone who is most interested in spoken language, but they also need special emphasis in a certain area, such as the businessman who not only wants to speak basic everyday Japanese but also needs to learn certain business terms that are specific only to his job and wouldn’t be common anywhere else).
So…what are you going to use it for? Do you have any special needs or areas of interest that you would like to learn the terminology for in the language you’re learning? I’m a pretty big computer nerd, so in addition to everyday spoken Spanish, I might also like to know how to say things like “hard drive”, “TCP/IP”, “Python [the programming language]”, “blog”, “forum”, “social news”, “search engine”, “link”, etc. See what I mean? Don’t neglect areas like that, everyone has some–whether you’re into cars or rugby or chess or collecting dead insects, you’re likely going to want to know how to say the words and phrases that are common only in those specific subjects.
Practical Application, or: What’s the point of all this?
Look, if you’ll use a quality SRS (Spaced Repetition Software) like Anki and spend 30-45 minutes a day studying vocabulary, you can very easily learn 20, 30, even 50 new words per day up to the point where you’ve got a couple thousand words in your target language within a month or so, it would be very easy. If you’ll do that, and maybe practice speaking a bit by watching subtitled movies and repeating after the native speakers (pause, repeat what someone just said, rewind and repeat as necessary until you’ve got it, wash rinse repeat, etc.) for a couple of weeks, you’ll be at the point where you’ll be able to start conversing with native speakers via a good language exchange like The Mixxer–you’ll be awkward and slow at first, but you will be able to muddle through, and you will pick up speed very, VERY rapidly if you’ll make it a habit to speak with a native for an hour or so a day, every day (remember: consistency!!!). I promise you, you’ll be conversationally fluent within a couple of months of the time that you started conversing with natives. Boom, you’re there. Here’s a quote from someone commenting on that HTLAL thread I mentioned above:
“I can add from my experience that knowledge of about 1500 words allows you to get a fairly general picture of everything you read. This is the number of Hungarian words I learned since march. I write them all down on flashcards and count how much each day – that’s why I can pinpoint the number.
At the same time it is obvious that my 1500 word vocab isnt’t tweaked to efficency in basic communication. I simply write down and translate everything I read and lately also the words I manage to pick up from radio. Thats why I know the hungarian word for “voter turnout” but I don’t know yet how to book a flight or hotel room :/”
Also, I’ll tell you right now that the best way to learn vocabulary is to do it contextually. What does that mean? It means taking material that you’re actually interested in reading/listening to and using that to extract vocabulary from to learn, as opposed to going off of some kind of list. Check out Tim Ferriss’ awesome post that delves a lot more into the matter, he talks about learning Japanese through learning something he was genuinely interested in, Judo: How to Learn Any Language in 3 Months. Pick a book, podcast, movie, or whatever in the subject that you’re interested in, dive straight into it, and every word you come across that you don’t actually know you enter it into your SRS or write it on a flashcard for review. You will learn massive amounts of vocabulary that way in very short periods of time, believe me.
And this leads into my last, and most important point: all of this is just a means to and end, and that end is speaking. You must speak. The whole point of figuring out all this word frequency crap is just so you can get away from it as fast as possible and into the realm of actually talking to native speakers, because that is where you really learn the language. Memorizing all the vocabulary and grammar rules in the world, as my friend Benny loves to say, will not ever get you anywhere near fluent. I’ll leave you with a quote from a native Czech speaker and fellow language nerd (it’s the last post in that HTLAL thread):
“Yesterday I met a woman who has been taking Czech lessons twice a week for two years. I asked her some very simple questions “Do you like coffee?”, “Are you Czech?” and she was completely tongue tied. The best she could manage was “Urm, arm, yes” to the first question, and “no” to the second.
At first I imagined she didn’t know much Czech at all. I decided to probe into her vocabulary, and found it was quite extensive. She knew words like “octopus” and “hovercraft” in Czech. Yes somehow couldn’t say “To be honest, I prefer tea”.
I gave her a two hour lesson in how to construct useful conversational phrases. Starting off with simple things like “I have to say that ..” and “Don’t be upset, but” and building up and chaining these things together into more complex sentences such as “That isn’t something I have given much thought to, but … now that I reflect on it, … my personal opinion is …”
She told me it was a very uplifting lesson, since she now felt “fluent” in Czech rather than being frozen with a trapped vocabulary of thousands of words. In fact, she got back to me later that after the lesson, she went into the city and had sophisticated and stressless conversations in a couple shops and with a waitress in an ice-cream parlour.
Of course, I was delighted to hear this, and it certainly gave my ego a boost. But, what was most joyful for me to hear is that it would now give her future learning a “usefulness filter”. She said that now she wouldn’t just remember lists of words, but rather filter them through how useful they would be in real conversations, and that real conversations, with real people, will help her get a reality check on this as she goes along.”
You can learn all the vocabulary in the world, but if you don’t learn how to use it, you’re never going to be fluent, and the only way to do that is to speak with native speakers.
Additional Resources and Further Reading
Here’s an excellent paper by Paul Nation and Robert Waring at the Notre Dame Seishin University in Japan called: Vocabulary Size, Text Coverage And Word Lists
Here’s a very widely circulated list of the 1000 most common words in English.