Since the birth of modern English-language novels in the 1700s, male and female characters from Paul Atreides to Elizabeth Bennet have laughed, grinned, felt and acted through their pages. A new study conducted using a machine learning algorithm has offered fresh perspective on their histories. “The Transformation of Gender in English-Language Fiction,” published this week in the journal Cultural Analytics, analyzed the presentation of gender in more than 100,000 novels, finding a paradox when it came to novels of the 20th century: as the rigid gender roles seemed to dissipate, indicating more equality between the sexes, the number of women characters— and proportion of women authors—decreased.

Built by study author Ted Underwood, a professor of English and of Information Sciences at the University of Illinois, and his coauthor information scientist David Bamman of University of California at Berkeley, the algorithm analyzed the characters and authors of 104,000 novels—far more than you’ll read in a lifetime. Underwood and Bamman originally built the algorithm for a previous study on characterization: they were joined in the current study by coauthor Sabrina Lee, a graduate student at the University of Illinois. The novels were selected primarily from the HathiTrust Digital Library and represented a selection of bestsellers from the years 1703 to 2009. The list includes popular titles like Pride and Prejudice, Dune and some of the novels of Raymond Chandler.

Upon looking at the data and sectioning it by time, the researchers were able to see trends over certain periods: between about 1800 and the 1970s, for instance, a “steady decline” in the proportion of women authors—from about 50 percent to less than 25 percent. In the same period, they saw a decline in the number of named women characters. Those trends start to reverse in the latter part of the 20th century. And over the course of their study, dramatic and rapid shifts in the words used to characterize gender–as well as a decrease in the number of specifically gendered words. 

Many of those words weren’t explicitly gendered, like “heart” or “house,” although potentially gendered words like “skirt” or “mustache” weren’t excluded. For instance, in the 1800s the verb “felt” was more associated with women, while the verb “got” was more often associated with men. These trends declined over time, until by the 1900s, other words were more prominently associated with men and women. In the 1900s, words related to mirth became more associated with women and there was a corresponding decline in the use of those words in relation to men. “Women smile and laugh,” the authors write, “but mid-century men, apparently, can only grin and chuckle.”  Similarly, in the 19th century, there’s much more discussion of feelings, at first mostly in regards to women characters. In the 20th century, there’s a lot more about bodies and clothes—for example, mid-century men are constantly putting things in pockets or taking them out.

It’s the kind of result that demonstrates the need for machine learning approaches, Underwood says: “The reality is, culture doesn’t come with clear definitions of what gender is or what even a literary genre is,” he says. “And machine learning is letting us work with concepts that are fuzzy.”

The method has more frequently been used to work with banking data or to help self-driving cars stay safe, so it might seem like a strange fit for analyzing the novel. But Underwood, and other scholars in the field of digital humanities, see great potential.

Seth Long, an English professor at the University of Nebraska who also works in the field of digital humanities, says these unexpected results demonstrate the power of big data for humanities scholarship. “Statistical modelling is going to require a very different way of understanding literary history,” he says. An algorithm is a blank slate until given information, but once it has that information, it can pull things out of it that people can’t. In this case, that interrupts scholarly assumptions about how the history of literature should track with the history of women’s social progress. 

“When you see [the study] alongside more traditional literary historical projects, you can see connections that you may not have otherwise seen,” says Claire Jarvis, a professor of English at Stanford University. That confirms some of the “hunches” she’s had about the path of literature, in a quantitative way. This includes the decrease in the proportion of women authors over the whole time period studied, which surprised Underwood.

“I would have expected to see some progress, just in terms of equality of representation in women in fiction,” Underwood says. “Maybe not a lot of progress, but some progress. And we really don’t see any.”

The first novels to use modern English were viewed more as entertainment and less as a legitimate literary endeavor. But “as the novel becomes more and more respectable,” says Jarvis, “it becomes less associated with female authorship.” In other words: men got in on writing novels when it started to look like a “serious” pursuit.

Although literary historians have talked about women’s departure from the novel at certain points before, says Underwood, nobody’s done the kind of broad-scale work that would demonstrate continuous trends.  That’s where machine learning comes in.

Says Lee, “Literature scholars, we’re very aware that there are silences,”—that is, places in literary history where books weren’t written. Another silence she feels is important is the growing absence of named women characters in the novels studied. She’s a fan of the novels of pseudonymous Italian author Elena Ferrante, and says that the characterization of female friendship in Ferrante’s books highlights the “silence” of female friendship in fiction elsewhere, from both the past and the present. For her, the study underscores the same thing, and highlights “the importance of works with women seeing women.” The absence of women from the novel “has quietly shaped the way we feel about literary history,” Underwood says.

The authors note that their study doesn’t cover all novels written during this time period, and is missing representation from genre fiction such as romance novels and detective fiction, which became popular in the 20th century. However, the researchers took steps to correct for their bias by testing their database against other databases. The books they selected represent literature that was considered important by academic libraries, and the authors note that there’s more work to be done on genre fiction. “Literary gender may be constructed differently in different genres, or in different parts of the literary field,” the authors write.

Machine learning methods offer a new way to look at the silences and presences of the past—oddly, through the lens of prediction.  Generally, algorithms are used to make predictions or detect patterns based on a set of information, but Long says their use for history is that they can detect long-term trends in the past, as well as the present or future. “I think that’s such a powerful way of keeping our own interpretations in check,” he says.