National Geographic News: NATIONALGEOGRAPHIC.COM/NEWS
 

 

Computer Programs Pass Judgment on Students' Writing

Brian Kladko
The Record (Hackensack, N.J.)
July 24, 2001
 
How devastating would it be to find these comments on your latest assignment in Composition 101?

"Your essay does not coherently focus or communicate your ideas; is organized very weakly or doesn't develop ideas enough; generalizes and does not provide examples or support to make your points clear."




Now imagine getting that criticism from a computer.

Demoralizing? How about dehumanizing? Well, too bad. It's already happening.

Some schools and universities are now using software programs that evaluate writing—not just spelling and grammar, but content, structure, even tone. Educators find the technology helpful to score standardized tests, grade final exams, and give students instant feedback on their writing.

The essay portion of the Graduate Management Admissions Test (GMAT), the admissions exam required by business schools, is now graded by a human and a computer, with the two scores averaged. At Camden County College in New Jersey, the same technology, known as "e-rater," helps the English department faculty wade through the thousands of final exam essays written each semester for Composition I and II. The e-rater is one of two graders of the final exam.

In Boulder, Colorado, sixth-grade teachers are using another computer program, the Intelligent Essay Assessor, to help students refine their compositions before handing them in to the teacher for a grade.

"I don't believe a program like this will ever replace a teacher, and I don't believe that's the intent of it, either," said Ronald Lamb, one of the Boulder teachers using the technology. "I think the intention is to give kids much more feedback, and much more immediate feedback, than a teacher could provide to a class of 20 or 30 students."

Human Input

The technology has yet to win over academics such as William C. Dowling, an English professor at Rutgers University in New Jersey. He believes that evaluating a piece of writing is too subtle a task for a machine.

"To recognize various kinds of good writing, you're going to encounter examples that the machine can't deal with, that the grader has to be thinking to recognize. You need a grader that can think," said Dowling, a linguistics expert.

The companies that market the new computer programs don't claim their products can think. But the creators say that by using concepts such as "natural language processing" and "latent semantic analysis," their programs agree with human graders as often as—and sometimes more often than—two human graders agree with each other.

The Educational Testing Service, which designs and grades the GMAT and other widely used standardized tests, said its e-rater program comes within one point of a human grader 98 percent of the time, using the six-point scale that is now a common approach to grading essays on standardized tests.

If there is a difference of more than one point between the scores of the computer and a human evaluator, the essay is read by another person and the three scores are averaged.

ETS, which began using e-rater to grade the test two years ago, has cut its GMAT costs by U.S. $1.7 million a year because graders now have to read fewer essays. The organization can also return scores to test takers in ten days, instead of the four weeks it used to take.

But Sam Graziano, who took the GMAT last month, wasn't thrilled to learn that a computer would evaluate his writing, and thereby help decide whether he is admitted to a top business school.

"I'm a computer science major, and it's kind of hard for me to understand an algorithm that could grade an essay," said Graziano. "At this time, I wouldn't really trust it."

Some Limitations

Another essay-grading program, called IntelliMetric, is muscling its way into the standardized testing industry. And Accuplacer is a new program that decides the appropriate course level for incoming college students.

The programs take different approaches to their task. But they all use a database of essays that have been graded by humans. The programs are smart enough, according to their inventors, to recognize what characteristics correspond to higher scores.

ETS's e-rater focuses mostly on how an essay is written, not its meaning. For example, it looks for cue words—such as "however," "because," and "therefore"—that are key to framing an argument. It also looks for variety in the arrangement of phrases, clauses, and sentences. And to recognize whether an essay is on topic, it looks for certain words based on the previously graded essays in its database.

The Intelligent Essay Assessor is geared more toward the content of a composition. The program is primed by feeding it a batch of essays already graded by humans, or text that serves as the basis for the essays, such as a history or science book.

The program analyzes the relationships between the words, looking for patterns. It recognizes how the words fit together—for example, it recognizes that "the doctor operated on the patient" is similar to "the surgeon wielded the scalpel." In that way, its creators say, the Intelligent Essay Assessor comes to understand the words. It can then compare that meaning with the essays to be graded.

"It isn't as simple as looking at which words occur together," said Thomas Landauer, a University of Colorado professor who has done research on the technology. "It's a much deeper process than that."

The Intelligent Essay Assessor, Landauer said, is best at evaluating answers in fact-filled subjects, such as science and history. The program can look at a student's essay and decide what points are missing.

A study that compared essays written under the program's tutelage with those written without such help concluded that the computer-aided essays were consistently better.

The programs do have their limits. They can't deal with creativity, such as metaphors or unconventional writing styles. If confronted by quirks, the computer is supposed to alert its handlers that the essay is unusual and needs to be read by a human.

The e-rater also can be fooled. For example, if the word "therefore" is one of the words it's looking for, it will probably give the writer credit for using it even if it's the first word in the essay, said Marisa Farnum, a writing assessment specialist at ETS. A teacher, on the other hand, might consider such a use of "therefore" completely inappropriate and penalize the student for it.

Some professors, such as William Dowling at Rutgers, think the programs will be unable to process students' more complex and original writing. Dennis Baron, the head of the English department at the University of Illinois in Urbana, has the opposite fear: It won't be able to get past a student's weaknesses.

"I've been reading student writing for 35 years, and you just get a feel for what the student is trying to say," Baron said. "They don't always hit the nail right on the head, and it's those times—when you know they're on the right track, they've almost got it, and they haven't quite said it—that you want to give them some credit for this. I don't know that you can program a machine to do that kind of gray area."

Copyright 2001, The Record (Hackensack, N.J.)
 

© 1996-2008 National Geographic Society. All rights reserved.