Close Menu

SPCL Plenaries

Title: AI-Driven Literacy Education for Creole Speakers: Towards an Inclusive Digital Future

Dr. Andre Coy



Caribbean societies are linguistically diverse, with Creole languages shaping everyday communication. However, the dominance of Standard English in formal education often marginalises Creole-speaking learners, creating barriers to literacy and academic success. This plenary explores the potential of artificial intelligence (AI) and speech technologies to support literacy education in Creole-speaking communities.

The plenary introduces the work of CARiLIT (Caribbean Applied Research in Literacy and Instructional Technologies), a team of researchers developing an AI-driven literacy tutor designed to engage learners whose first language differs structurally and lexically from Standard English. By leveraging automatic speech recognition (ASR) and natural language processing (NLP), the tutor can process Creole-influenced speech, provide adaptive feedback, and facilitate literacy development in a way that acknowledges and values learners' linguistic backgrounds. This presentation examines the challenges of training AI on underrepresented languages, including data scarcity, orthographic variation, and sociolinguistic stigma.

Additionaly, the broader implications of integrating AI into Caribbean education are discussed, emphasising the importance of culturally responsive technology. As AI continues to shape global education, it is imperative that tools be developed to reflect and accommodate linguistic diversity rather than reinforce existing inequalities.

This plenary invites Caribbean technologists, linguists, educators, and policymakers to consider how AI can be harnessed to support Creole-speaking learners while fostering linguistic inclusion. By centering AI in the conversation on literacy development, new possibilities are opened up for equitable and effective education in multilingual societies.

 

CREOLE LANGUAGE DATA IN THE AGE OF AI – GOOD DATA, BAD DATA, DATA BY WHO, DATA FOR WHO?

Professor Silvia Kouwenberg



Data are the foundation of our work as linguists. In this paper, I focus on how we obtain data and use them.  Where Artificial Intelligence appears to open new frontiers in research, we must also consider how to ethically push those boundaries. Some questions to be considered include:

• How do we determine that particular data can be used to arrive at valuable, reliable insights? Are all data equal in principle?

• What methods of data collection yield useful data? Are all methods to be trusted in principle?

• How is ownership of data determined and to whose benefit are data deployed?

There are no straightforward answers to any of these questions. In fact, challenges to the integrity of data, and the ethics of data collection and their use have been around for decades. In this presentation, I therefore look at the nature of data, ways that we obtain them, and the critical questions that can be raised regarding their validity. I will explore the traditional distinction between naturalistic data and elicited data – the latter outright rejected by some researchers as yielding artificial data which are inherently invalid. We have to ask: what data are acceptable? Is there a good data vs. bad data dichotomy?

In the study of creole languages, linguists exist on a spectrum from the armchair linguist who must trust their sources of data completely, to the non-native speaker linguist who collaborates with native speakers to obtain data, to the native speaker linguist who can rely on introspection, and on whom we tend to confer an unassailably privileged status. I will question the different positions on this spectrum, the value we attribute to them, and the ethical implications for the work produced by these researchers. Is there an implied spectrum bad (or not-so-good) data to good data? A further complication arises where a particular variety, typically the most basilectal, is reified as most worthy of our attention. Whose data are good enough?

In contrast with the direct ownership and control over data obtained by introspection, the non-native speaker linguist who is reliant on native speaker collaborators claims ownership of recordings and transcriptions which are ultimately converted to theses and publications. While university ethics committees ponder the implications of these practices, linguists also ask themselves whether sufficient benefits accrue to those whose language they work on. Answers to this question can be sought at the level of individual speakers, communities of speakers, or even scholarly communities, dependent on the beliefs one holds about the ultimate goals of one’s work, and these beliefs are not shared by all.

A new type of data source has come to the fore over the past twenty years: born-digital data scraped off the internet or collected directly from their authors – with the assumption that the native speaker status of the content creators elevates the data to an unassailable trust rating. This work has focused on textual data, but there is of course also a rich trove of video and audio available on various platforms. The performative nature of these data sources again prompts an exploration of the presumed naturalness of texts and speech produced by native speakers: what data are good enough?

Finally, I will turn to some of the ways the field may be advanced by AI tools. Recalling many tedious hours spent on transcription, I for one acknowledge the usefulness of automated transcribers, parsers and translators. But who decides whether it is ethical to feed data into the training of AI tools which were never intended for it, and which data are good enough to be so used?

Although a consideration of data types, collection methods, and ownership and control over data raises more questions than answers, I will argue that bad data do exist, that some collection methods are more likely to produce bad data, that focusing on any one variety does not do justice to the linguistic practices of communities, that born-digital data deserve to be studied as separate genres, and that a pragmatic ethics must inform our relationship with AI.

Top of Page