TextAnalyzer
All articles
Linguistics5 min readMarch 7, 2026

What Is Hapax Legomena and How to Calculate It

A hapax legomenon is a word that appears exactly once in a text. Learn what this linguistic concept reveals about vocabulary richness and writing style.


Among all the statistics that linguists use to analyse a text, few are as elegantly simple — or as surprisingly revealing — as the hapax legomenon. It is a word that appears exactly once in a given text or corpus. Count them, compare them to the total vocabulary, and you get a surprisingly powerful signal about the richness and diversity of a writer's language.

The Etymology

The term comes from Ancient Greek: ἅπαξ λεγόμενον (hápax legómenon), meaning "(something) said only once." Classical scholars originally used it to describe words found only once in the entire surviving corpus of a language — making them notoriously difficult to translate, because there is no other context to help infer their meaning.

Today the term is used more broadly in corpus linguistics and computational text analysis to describe words that appear exactly once within any defined text or corpus.

How to Calculate Hapax Legomena

Counting hapaxes is straightforward once you have a word frequency table:

  1. Tokenise the text — split it into individual words.
  2. Normalise the tokens — convert to lowercase, strip punctuation.
  3. Count the frequency of each unique word.
  4. Count how many unique words have a frequency of exactly 1. That is your hapax count.
Hapax Count = number of unique words with frequency = 1 Hapax Ratio (%) = (Hapax Count ÷ Total Unique Words) × 100

For example, if a 1,000-word text has 400 unique words and 220 of them appear only once, the hapax ratio is 55%.

What Does the Hapax Ratio Tell You?

The hapax ratio is a proxy for vocabulary richness. The higher the proportion of words that appear only once, the more varied and diverse the vocabulary — and the less repetition there is in the text.

  • High hapax ratio (60%+): Rich, varied vocabulary. The writer is not repeating themselves. Common in literary fiction, essays, and academic writing.
  • Moderate hapax ratio (40–60%): Balanced. Some words are reused for emphasis or coherence, but vocabulary is still diverse.
  • Low hapax ratio (below 40%): Repetitive vocabulary. Common in children's books, instructional texts, or texts that deliberately repeat key terms for clarity.
Note: A high hapax ratio is not always better. Technical writing and instructional content intentionally reuse terminology to avoid confusion. Context determines what is appropriate.

Hapax Legomena in Practice

Authorship Attribution

Hapax counts are used in stylometry — the statistical analysis of writing style. Different authors exhibit characteristic hapax ratios that can help authenticate disputed texts. The hapax distribution in the Federalist Papers, for example, was used to help determine which papers were written by Madison versus Hamilton.

Language Learning

For language learners, hapax-heavy texts are more challenging because each hapax introduces a word the reader is unlikely to encounter again soon, giving fewer opportunities for natural repetition and reinforcement.

Corpus Linguistics and Zipf's Law

Hapax legomena are a natural prediction of Zipf's Law, which states that in any large corpus, the frequency of any word is inversely proportional to its rank in the frequency table. The tail of the frequency distribution — the words that appear very rarely — is always dominated by hapaxes. In a large enough corpus, roughly half of all unique words tend to be hapaxes.

How TextAnalyzer Displays Hapax Legomena

TextAnalyzer counts the total number of hapaxes in your text and shows both the raw count and its proportion relative to the total unique word count. You can explore which specific words are hapaxes by using the Word Frequency table — any word with a frequency of 1 is a hapax legomenon.

Tips for Writers

  • If your hapax ratio is very low, look for overused words in the frequency table and consider replacing some with more precise synonyms.
  • If your hapax ratio is very high in a technical document, check whether you're introducing too many new terms without reinforcement.
  • Use the frequency table alongside hapax counts to get a complete picture of your vocabulary usage patterns.

Try it yourself

Paste your text into TextAnalyzer to see all these statistics — and more — calculated instantly.

Open TextAnalyzer
Librari.io

Sponsored

Organize your personal book library with librari.io

ISBN scanning, multi-library support, custom fields, and reading analytics — free to join.