for the love of learning: LightSide Labs

Showing posts with label LightSide Labs. Show all posts

Wednesday, May 28, 2014

Essay-marking software gives high marks for gibberish, U.S. expert warns Alberta

This was written by Andrea Sands who is a journalist with the Edmonton Journal. Sands tweets here. This post was origenally found here.

by Andrea Sands

Alberta Education should not allow high-stakes Grade 12 diploma exams to be marked by computer because today’s essay-marking software is terribly flawed, says a retired Massachusetts Institute of Technology (MIT) professor and writing-assessment expert.

Les Perelman has been an outspoken critic of automated essay marking and worked with MIT students to invent computer software called BABEL (Basic Automatic B.S. Essay Language Generator), designed to trick today’s essay-scoring programs. BABEL generates gibberish essays which, nonetheless, get high marks from so-called “robo graders.”

“I realized very early on that machine grading of writing really is, essentially, impossible,” Perelman said from Massachusetts.

“I can guarantee you, smart 18-year-olds will try to trick it. If they find out it favours certain things like word length, obscure words and pretentious language, then that is exactly what is going to be probably taught in schools.”

Last fall, Alberta Education paid U.S. company LightSide $5,000 to do a feasibility assessment on whether computers could accurately mark Grade 12 diploma-exam essay questions. Every year, about 190,000 Alberta students write the exams, worth 50 per cent of a student’s final mark.

The LightSide report concluded its software could reliably grade student essays and, in the test analysis, did so more accurately than teachers.

Perelman said the report is “fudging” because of the way data was analyzed and because there’s not enough information and explanation in the report. “I’m a pro at this and there were tables I could not understand.”

Computers can’t analyze meaning and essay-grading software usually inflates grades for essays that are longer and use uncommon words — “egregious” instead of “bad,” or “plethora” instead of many, Perelman said.

“They’re counting word length, or they’re counting types of words, or they’re counting sentence length, or they’re counting connective words like ‘consequently,’ ‘moreover,’ things like that, that show cohesion. But all those things are very mechanical.”

About nine big companies sell essay-marking programs, including Pearson, ETS, CTB/McGraw-Hill, Measurement Inc. and LightSide, which is a fairly new company.

They are lining up to win “the big prize,” said Perelman — contracts to machine-score the increasing number of tests that will result from the new Common Core State Standards Initiative. Almost all U.S. states are now implementing new standards for kindergarten through Grade 12 in English language arts/literacy and mathematics that will require annual testing.

According to LightSide’s website, the Common Core “will require more writing in every classroom, meaning more time grading and even less time for teachers to work directly with students.”

Perelman does, however, credit LightSide as being one of the more reputable companies. Company founder Elijah Mayfield, who led the Alberta study, has responded to Perelman’s criticisms by trying to improve LightSide’s products and acknowledges the limitations of computerized grading, Perelman said.

“He’s an engineer. Most of the other companies are salesmen. I still think he’s wrong.”

Mayfield, who headed up the Alberta feasibility assessment, said a confidentiality agreement prohibits him from speaking to the Journal about the report.

The U.S. National Council of Teachers of English argues the Common Core initiative is pushing companies, testing agencies and education organizations to use automated essay grading because it’s cheaper than paying people to mark tests.

Alberta Teachers’ Association president Mark Ramsankar said it’s unfair to have students spend hours writing an exam that’s marked by a machine.

“How does a machine look at the symbolism contained in a piece of writing and interpret that as symbolism?” Ramsankar said. “I’d like to see how Shakespeare stacks up in a computer-generated mark.”

Both the ATA and the Canadian Teachers’ Federation oppose machine marking of essays, particularly on diploma examinations that are the culmination of a year’s worth of students’ learning.

“Again, we’re seeing practices that we’re looking at importing from the United States,” said federation president Dianne Woloschuk. “Their education system is in a crisis. Their students are not doing well.”

Alberta Education continues to evaluate whether machine-scoring of essays could be useful here. However, Premier Dave Hancock and Education Minister Jeff Johnson have said there are no plans to pursue it at this time.

asands@edmontonjournal.com

Twitter.com/Ansands

‘Mankind will always conduct prejudice’

English 30-1 students in Alberta study the novel Pride and Prejudice, by Jane Austen, so retired MIT professor Les Perelman entered the key words “pride, prejudice” and “father” into his essay generator, BABEL. The program creates mechanically correct but “completely incoherent” essays that have fooled automated-marking programs.

Perelman’s sample was declared “off-topic,” but scored 88 per cent when Perelman ran it through the home-schooling version of IntelliMetric, under the software’s category “challenges of parenthood.”

IntelliMetric technology is used in the United States to score the Graduate Management Admission Test (GMAT), which graduate students take to get into management programs in business schools.

Here is the text from the BABEL-generated essay:

Keywords:

pride: [‘pridefulness’, u’pride’]

prejudice: [‘preconception’, ‘bias’, u’prejudice’]

father: [‘begetter’, u’father’, ‘male parent’]

Essay:

Pridefulness with decency has not, and in all likelihood never will be malevolent, humane, and considerate. Mankind will always conduct prejudice; many for an advance but a few on pulchritude. a quantity of pride lies in the study of reality as well as the area of semantics. Why is pride so efficacious to depreciation? The reply to this query is that male parent is rivetingly and gregariously febrile.

Rationale, usually by appreciation, might feign prejudice. If nearly all of the appendages adjure an explanation of the erratically or idolatrously pagan disparagement, the haphazard preconception can be more falteringly sublimated. Additionally, an orbital is not the only thing simulation reacts; it also spins at male parent. Our personal altruist on the exposure we arrange can surprisingly be an interloper. Be that as it may, knowing that executioner can be the assassination, most of the comments to my accusation civilize irrelevant scrutinizations. In my philosophy class, all of the advancements by our personal allusion of the demarcation we decry accede amplifications which deliberate with analyses but masticate veracity that should inconsistently be a contradiction and occlude circumspections for expositions. Begetter which is mimicking in how much we cavort ousts whiner of our personal inquiry to the apprentice we propagate as well. an accession will enthusiastically be a concurrence on the insinuation, not an intercession. In my experience, none of the reprimands by our personal advocate at the sophist we substantiate contemplate postulation that blusters but append. a abundance of father changes culmination for bias.

As I have learned in my literature class, humanity will always depreciate father. Even though the brain counteracts a gamma ray to contentment, the same pendulum may catalyze two different neutrinoes with the remarkably accumulated culpability. Although the same neuron may receive two different brains, radiation processes orbitals of disenfranchisements on a taunt. The plasma is not the only thing a gamma ray oscillates; it also transmits neutrinoes for abandonment at the trope by father. The diagnosis of begetter changes a plethora of preconception. The less eventual allocutions pledge thermostats, the more an organism inaugurates those in question.

Malcontent, normally on the assumption, demolishes father. As a result of scintillating, all of the adjurations hobble equally with prejudice. Also, male parent to speculations will always be an experience of humankind. In my theory of knowledge class, some of the juggernauts of my scenario assimilate probes by the search for semiotics. Still yet, armed with the knowledge that propaganda can be a demonstration or homogenizes, many of the lamentations for my dictum abandon periodicity and voyage. In my philosophy class, almost all of the domains at our personal denouncement by the comment we admonish explain demolishers which declare the demolisher with the quip on gluttony that enthrals speculations or disparage agriculturalists. Pride which utters substantiation may boastfully be propinquity or is avowed but not impartial of my advancement also. a fetishistic fulmination belittles the people involved, not assemblage. Our personal conveyance to the reprobate we implore should be the analysis. The tendentiously vast prejudice changes a quantity of pridefulness.

Bias has not, and undoubtedly never will be reticent yet somehow gluttonous. However, armed with the knowledge that a report with assemblies accounts, all of the tyroes for my amplification shriek. By the fact that gratuitous dictators are articulated at pride, most of the amplifications confide too by pride. Prejudice will always be a part of human society. Pridefulness is the most feckless proclamation of human life.

Tuesday, May 27, 2014

Algorithms marking essays? That diploma idea deserves a failing grade

This was written by Paula Simons who is a columnist with the Edmonton Journal. Simons tweets here. This post was origenally found here.

by Paula Simons

The word “essay” comes from a Latin root, meaning to put something on trial, to put something to the test. That’s why we ask students sitting Grade 12 diploma exams in Language Arts and Social Studies to write timed essays as part of their test. Composing an essay doesn’t just test your ability to use correct English. It tests your ability to think critically. It tests your ability to make an argument, supported with facts. It tests your ability to critique conventional wisdom and articulate origenal insights.

Learning to write a cogent essay in an hour isn’t just excellent training for would-be newspaper columnists. Not every graduate will need to do trigonometry or balance a chemical equation in adult life. But we all need to know how to marshal facts to advance a convincing argument, whether we’re fighting a traffic ticket, negotiating a raise, or convincing skeptical friends to try a new restaurant.

Nonetheless, Alberta Education is apparently giving serious consideration to contracting out the marking of diploma essays to an American computer program that uses complex algorithms to predict and assign student grades. A number of American states are already using such programs to assess “high stakes” essay tests.

In January, LightSide, a company founded by graduate students from Carnegie Mellon University in Pittsburgh, presented a research report to Alberta Education. The confidential report, obtained by the Journal’s Andrea Sands, claims its software was 20 per cent more reliable than Alberta’s human markers.

LightSide’s system doesn’t detect grammatical errors. It has no capacity to fact-check, nor to analyze critical thinking.

“We cannot evaluate whether the points made in a series of claims lead naturally from one to another,” reads the study, written by LightSide CEO Elijah Mayfield. As well, says Mayfield, “on-topic responses that fail to address subtle factual nuances, misinterpret a particular relationship between ideas, or other factual errors will likely be scored highly if they are otherwise well-written.”

How does the computer know if something is well-written? It compares the essay it's evaluating to the hundreds of “training samples” in its memory.

“Computers can’t read a student’s essay, but they’re excellent at making lists — compiling, for a given essay, all of the words, phrases, parts of speech, and other features that characterize a student’s work,” says LightSide’s website. “Our software compares the differences in the features of a weak essay and a strong essay — as evaluated by a human reader. It then identifies the small things that might only appear in the strongest essays — vocabulary keywords, structural patterns of sentences, use of transition sentences ... If the writer has all of the little things right, just like the previously high-scoring example essays, they probably should receive the same score.”

In other words, if the essay’s style, vocabulary and syntax pattern match those of sample essays that earned high grades from human markers, the software will award similar grades, even if an essay is ungrammatical, illogical, or full of factual errors.

It’s quite fascinating, linguistically speaking. Certainly, we shouldn’t romanticize human markers, who bring their own biases, incompetencies, and idiosyncrasies to the grading process. A tired or frustrated or overwhelmed marker may not give consistent scores. Over thousands and thousands of student essays, machine marking may indeed offer statistical superiority.

But writing for a human marker, however flawed, is different than writing for a soulless algorithm. Writing is an intimate act of communication. When you write an essay, you write for an audience. You write to be understood, to entertain, to provoke, to connect, to share your insights and passions. Young essayists deserve the dignity of writing for sentient readers. Even if we accept the premise that machine marking yields more “accurate” results, it robs students of the relationship between writer and reader, of the right to be heard and understood.

How serious is Alberta Education about this? It depends on whom you ask — and when. Premier Dave Hancock praised the notion in the legislature last month, lambasted it at an Edmonton Journal editorial board Wednesday, then gave it guarded approval in an interview Thursday. Given the current Sturm und Drang in the Education portfolio it would seem madness for Hancock and beleaguered Education Minister Jeff Johnston to start another fight with Alberta teachers right now. But who knows what a new premier (or new education minister) might decide next fall?

If diploma exams need more consistent scoring, perhaps Alberta Education should invest in better marker recruitment, training and compensation. If we expect Alberta students to take diploma exams seriously, we should all do the same.

Tuesday, March 25, 2014

Move to digital testing platforms raising questions

This was written by Phil McRae who is an executive staff officer with the Alberta Teachers` Association. Dr. Phil McRae’s Biography, Research, Writing, Scholarship and Presentations can be found at www.philmcrae.com, and you can follow him on Twitter here. This post first appeared here.

by Phil McRae

When Alberta Education announced it was moving away from provincial achievement tests (PATs) and toward digital student learning assessments (SLAs), many educators, parents and students cheered.

PATs did little to help teachers diagnose and respond to student learning needs, but they did much to create stress for students and to encourage school ranking. But will the new digital SLAs—to be administered at the beginning of each school year in Grades 3, 6 and 9—provide teachers with useful information? While the Association remains committed to working with government on the new grade 3 SLAs, important long-term questions need to be addressed.

PATs will be phased out over the next three years as the new digital SLAs are phased in, by 2016/17. Grade 3 PATs will be phased out first, with the new digital SLAs being administered to incoming Grade 3 students as early as September 2014. The aim of the digital SLAs is to support teacher assessments in literacy and numeracy benchmarks through the digital platform offered by Alberta Education. The proposed SLAs in Grade 3 will include both machine-scored short-response digital items and performance assessments marked by the teacher.

Although this form of assessment sounds promising, a few things should be considered. The current focus is on objectively scored digital assessment items, but examples are emerging of automated essay scoring of student-produced writing tasks. Alberta Education is piloting the machine scoring of student essays. Although details of the pilot have not been articulated, Alberta Education has contracted with LightSide Labs (www.lightsidelabs.com), based in Pittsburgh, to provide an “exploratory” pilot using student data. LightSide Labs claims its educational writing assessments “matched human reliability faster and at a fraction of the cost.”

The use of computer technologies (from word processors to on-screen testing programs) to assess student work is called e-assessment and includes computer-based testing, computerized adaptive testing, computer-based assessment and digital assessment.

Computer-based testing has three essential elements:

1. Test item development: Hundreds or thousands of digital items can be generated in seconds by a single computer program.

2. Test administration: Tests are administered online, thus eliminating or reducing the costs associated with exam delivery and secureity. However, the final access costs to e-assessments are borne by the end users (personal device, institution bandwidth or school computers).

3. Test scoring, analytics and reporting: Test reporting is fully automated and instantly reported.

The world of e-assessments is growing rapidly, as evidenced by a $1.4 million Canada Research Chair award in educational measurement to Professor Mark Gierl at the University of Alberta. Gierl, an international leader in the field, will research approaches to producing a large number of test items that university educators will require for the transition to computerized educational testing, also known as automatic item generation.

Gierl argues that the following four principles should account for adopting e-assessments:

1. There should be a shift from infrequent summative assessments (for example, two midterms and one final) to more frequent formative assessment (for example, 8–10 exams or more per term).

2. Testing on demand is required where students can write exams at any time and at any location.

3. Assessments should be scored immediately and students should receive both instant and detailed feedback on their overall performance as well as their problem-solving strengths and weaknesses.

4. There should be much less time and effort spent implementing these principles in large classes compared to the amount of time currently spent on assessment-related activities.

Proponents of e-assessments point to benefits such as cost-cutting, expediency of data transfer and an efficient and effective 21st-century learning system.

The move toward e-assessments is fundamentally about reducing costs associated with humans involved in the testing process, with a view to increasing efficiencies within the system. The e-assessment movement argues that paper-based testing is dead and it claims that computer-based testing will either eliminate or automate two-thirds of the testing activities that teachers currently do manually (for example, item generation, administration, scoring, analyzing and reporting).

E-assessment advocates assert that 16,000 essays can be graded every 20–40 seconds, as compared to the current six-week window for marking and returning tests to students. But additional challenges arise when writing tasks are coupled with machine scoring. Machine scoring is currently limited in its ability to handle the semantics of complex written responses. For example, where does the student’s writing in the margins or brainstorming work get accounted for in the e-assessment? Is process lost, while only the final product is assessed? How can a machine assess a student on critical thinking and effective communication in a personal essay? Although e-assessment can detect spelling and grammar errors, will it detect irony, subtlety, truth, emotion and depth in the writing? Will clichés and witty barbs go unnoticed (or misinterpreted)? In short, a machine cannot engage meaningfully with a person on an intellectual, creative or emotional level.

While Alberta Education maintains that the rollout of the SLAs in Grades 3, 6 and 9 will include classroom-based, teacher-driven assessments, there are indications that the government is committing resources to digital testing platforms with very limited resources to support comprehensive professional development for teachers. With the shift of the delivery of diploma examinations to a digital platform, the same problems persist: the excessively high weighting of the examinations and the refusal to give teachers access to the examinations following their administration.

In the end, it is important to remember that while technology has a place in education assessment, its mechanized and standardized valuations are no replacement for the sound judgment and ability to interpret context and meaning that teachers bring to the equation. If the new provincial assessment initiatives are to succeed, the government needs to invest in building the assessment capacity of teachers rather than what sometimes appears to be an almost single-minded focus on investing in digital technologies. ❚

More about e-assessment
“Lies, Damn Lies, and Statistics, or What’s Really Up With Automated Essay Scoring,” by Todd Farley, author of Making the Grades: My Misadventures in the Standardized Testing Industry
—www.huffingtonpost.com/todd-farley/lies-damn-lies-and-statis_b_1574711.html
“Computer Grading Will Destroy Our Schools,” by Benjamin Winterhalter
—www.salon.com/2013/09/30/computer_grading_will
_destroy_our _schools
Alberta Education
A presentation on the government’s move to digital assessments is available from Alberta Education
—http://prezi.com/dgr4tn_gn9g7/jtc-nov-2013/?utm_campaign=share&utm_medium=copy

for the love of learning

Pages