Corpus of English Children’s Literature

(COECL)

Would you like access to COECL for researching children’s literature? If so, please message me.

  • 178 unabridged texts of English children’s literature

  • published between 1900-2020

  • 7,603,947 words

  • includes many texts consistently present in primary schools and children’s libraries

COECL for Graphophonemic Analysis

(COECL-GPA)

This corpus for graphophonemic analysis takes the 5,000 most frequent words in COECL (~90% of the original corpus) and codes 8,478,734 vowels into graphotactics (spelling patterns), phonemic sequences (sound patterns), and their correspondences. This is the first such corpus of its kind. It is a resource for studying the natural distribution of English phonics in authentic children’s literature.

In January 2025, the graphophonemic coding for COECL-GPA was updated with more refined graphotactics. COECL-GPA has two forms: American English (COECL-GPA AE) and Singaporean Eglish (COECL-GPA SE). Please see below for resources from either graphophonemic corpus.

American English

164 graphotactics

91 phonemic sequences

366 graphemic-phonemic correspondences

Singaporean English

164 graphotactics

78 phonemic sequences

348 graphemic-phonemic correspondences

Read the original research.

Paquin, S. (2024). Frequency distribution of graphemic
phonemic correspondences of vowels in English children's literature
(Publication No. 31640390)  [Master's Thesis, University of Massachusetts Boston]. ProQuest Dissertations & Theses Global.

CamTESOL 2025 Presentation

overview of findings

comparison of natural graphophonemic distribution of phonics curricula

potential importance for phonics instruction in Inner Circle and World Englishes

explanation on using lexical sets to phonemically retune graphotactics to World Englishes