How it works

The test estimates a respondent’s receptive vocabulary — the number of words that can be recognized in reading and listening. Measuring this precisely would require checking a person’s knowledge of tens of thousands of words one by one, which is unrealistic. Instead, we use Item Response Theory (IRT), a modern framework for designing and scoring tests.

In IRT, a respondent’s vocabulary size is treated as a latent trait that can be represented by a number. The test presents words of various difficulties and asks whether the respondent knows them. For example, “cat” is very easy, while “recusant” is very difficult. Word difficulty strongly correlates with how often a word is encountered. IRT provides the mathematical foundation for estimating a respondent’s ability from their responses. Once we know a respondent’s ability and the difficulty level of every word in our database, we can estimate the probability that the respondent knows each word. By summing these probabilities, we obtain an estimate of the respondent’s total vocabulary size.

To make the test both quick and precise, we use Computerized Adaptive Testing (CAT). After each response, the system updates the respondent’s estimated vocabulary size and selects the next word so that its difficulty is close to the respondent’s current ability level. This ensures that each test item provides the maximum possible information. The estimate becomes more accurate with every step, and the test finishes automatically once the required level of precision is reached.

Word difficulties

Our database contains more than 600 calibrated test words whose difficulties were estimated directly from test-taker data. The remaining words have difficulty values predicted using machine-learning models. These predictions draw on multiple reliable linguistic resources, each capturing a different aspect of word usage:

BNC (British National Corpus) — a large corpus of British English texts from various genres.
COCA (Corpus of Contemporary American English) — a balanced corpus of modern American English from spoken and written sources.
SUBTLEX-US — a corpus of movie and TV subtitles.
enTenTen — a very large corpus of English texts collected from the web.
Word prevalence norms — data on how familiar words are to native speakers.
VXGL (Vocabulary eXpected Grade Level) — a word list tagged with expected U.S. school grade level.
Age of acquisition norms — estimates of the age at which words are typically learned by native speakers.
GSE Teacher Toolkit — word difficulty scores aligned with the Global Scale of English and CEFR levels.

Unit of measurement

The test reports vocabulary in word families. A word family includes a base word, its regular inflections, and its derived forms, following the criteria described in Bauer & Nation (1993). For example, limit, limitation, limitations, limited, limiting, limitless, limitlessly, limits, unlimited all belong to the same family. Our database contains 25,000 word families.

CEFR thresholds

To estimate CEFR levels from vocabulary size, we combined graded word lists from three reputable sources:

These sources allowed us to estimate how many word families a learner is expected to know at levels A1–C1. For the C2 threshold, we used the vocabulary size corresponding to the 25th percentile of adult native speakers, based on data from the myVocab vocabulary test.

Reliability

To make sure each result is trustworthy, we run several checks:

Non-word traps. Every test includes a few made-up words. If a respondent marks too many of them as known, the result is flagged as unreliable.
Multiple-choice follow-ups. When a respondent says they know a word, they may be asked to choose its correct meaning from four options. Too many mistakes make the result unreliable.
Answer pattern check (in progress).
Convergence and consistency check (in progress).

These checks do not change the vocabulary estimate itself; they simply indicate whether the final result can be trusted.