Using digital technology to uncover ‘invisible’ patterns in language

By Adnan Ajšić

If you have seen the 1999 movie The Matrix, you will remember the green code tumbling down the black screen like digital rain from the title scene. Later in the movie, Tank, one of the characters, ‘reads’ the code in real time to understand what’s going on inside the matrix itself. (It turns out that the green code was actually sushi recipes written out in Japanese characters.) What people such as myself do is a little bit like reading the green code of our social reality.

I am a corpus linguist interested in the ways people talk about language (language-related discourses) and the beliefs about language that underlie such talk (language ideologies). Corpus linguists build digital collections of naturally occurring language (corpora) such as newspaper articles, parliamentary debates, or social media posts, and then use specialized software and statistics to analyze these collections for any meaningful language patterns. At the same time, I am also a sociolinguist or the kind of linguist who studies the links between language and society. As our primary means of communication, language is deeply embedded in society and reflects faithfully things like social relationships or attitudes. In my case, ‘meaningful patterns’ refers to how who we are and how we think about language is encoded in more or less systematic and coherent statements we make about language, whether we do so consciously or not.

My study is based upon a collection (corpus) containing 34 million words from 53 thousand newspaper articles published over a period of five years. You can see how that’s a little bit like looking at the green code in the Matrix. To make sense of such a large amount of text, I first used several pieces of custom-written and off-the-shelf corpus software to code my data. In the next step, I analyzed this data using statistical analysis software and a variety of statistical techniques such as exploratory factor analysis to identify recurrent patterns. Factor analysis is useful here because it can help us make sense of a large amount of information by looking at correlations and automatically grouping like things together. Finally, I made reference to the relevant social, cultural, historical, and political contexts to interpret those linguistic patterns and explain what they mean.

In this specific study, the ‘matrix’ is the relationship between language and ethnonationalism in the southeastern corner of Europe called the Balkans. Ethnonationalism is a type of nationalism which insists on biological, historical, and cultural ties between people as the only appropriate criteria to determine whether someone belongs to a certain community or not. In other words, if your ancestors were members of a group and you do things in a certain way, eat certain kinds of food, are a member of a certain religion, and you speak a certain language, you are a member of this group. Or not. This is important because, in most places in the world, whether we are considered to be a member of a certain group decides whether we can do certain things, whether we can stay where we are, and in extreme cases, whether we live or not.

What I have found is interesting in a variety of ways. First, it turns out that we can use scientific methods to study the links between language and nationalism with a great deal of reliability and precision, both in the present and over time, because the way we talk about language is systematic, recurrent, and reflects who we are or want to be. Second, the talk about language in modern society (at least in print media) is mostly limited to domains such as education, literary culture, and politics, including group identity. So, language matters but only in some ways. And third, what we believe about language, consciously or not, depends to a large extent not only on our identity, ethnic, national and otherwise, but also on what we think is in our political interest, broadly defined. In sum, we can learn a great deal about our society by looking at how we speak and think about language.

Unless, of course, our corpora are made up of sushi recipes.

About the Article

The article Capturing Herder: A three-step approach to the identification of language ideologies using corpus linguistics and critical discourse analysis published in the journal Corpora outlines an innovative and robust approach to uncovering language-related discourses and language ideologies. It also shows how language-related discourses and language ideologies can help us understand the politics of ethnonationalism.

Corpora is an international, peer-reviewed journal of corpus linguistics focusing on the many and varied uses of corpora both in linguistics and beyond.

Find out how to subscribe, or recommend to your library.

About the Author

Adnan Ajšić teaches sociolinguistics and discourse analysis at the American University of Sharjah. He is the author of Language and Ethnonationalism in Contemporary West Central Balkans: A Corpus-based Approach (Palgrave Macmillan, 2021).