OpenAI’s chat software ChatGPT, if let loose on the world, would score between a B and a B- on Wharton business school’s Operations Management exam, and would approach or exceed the score needed to pass the US Medical Licensing Exam (USMLE).

While this may say more about the static, document-centric nature of testing material than the intellectual prowess of software, it’s nonetheless a matter of concern and interest for educators, and just about everyone else living in the age of automation.

Academics have been fretting that assistive systems like ChatGPT and GitHub’s Copilot (based on an OpenAI model called Codex) will require teachers to reevaluate how they teach and mark exams because assistive technology based on machine learning has become so capable.

In educational settings, AI advice is becoming commonplace: The Stanford Daily just reported, “a large number of students have already used ChatGPT on their final exams.” An estimated 17 percent of students, based on an anonymous poll of 4,497 respondents, said they had used ChatGPT to assist in fall quarter assignments and exams, with 5 percent saying they had submitted material directly from ChatGPT with little or no editing – which is presumably an honor code violation.

Separately, Christian Terwiesch, a professor at the Wharton School of the University of Pennsylvania, and a group of medical researchers mostly affiliated with Ansible Health, decided to put ChatGPT, an arguably amoral automated advisor and factually-challenged expert system, to the test.

Both Terwiesch and the Ansible Health boffins made clear that ChatGPT has limitations and gets things wrong. Overall, they gave it middling marks but they made it clear that they expect AI assistive systems will find a place in teaching and in other sectors.

The model has, after all, been trained on countless pieces of human-made writing, and so its ability to guesstimate a satisfactory answer to a question from all that inhaled knowledge and factoids isn’t unexpected.

“First, it does an amazing job at basic operations management and process analysis questions including those that are based on case studies,” said Terwiesch in his paper. “Not only are the answers correct, but the explanations are excellent.”

That said, he observed that ChatGPT makes simple math mistakes and fumbles advanced process analysis questions. However, the AI model is responsive to hints from people about how to improve – it can successfully correct itself when given hints from a human expert.

Human guidance has also served as a source of malicious input, as demonstrated by Microsoft’s Tay chatbot and by subsequent research.

Doctor, doctor

The medical research group that wrote “Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models” includes “ChatGPT” as a co-author.

“ChatGPT contributed to the writing of several sections of this manuscript,” the biological authors state in their paper.

Other organizational affiliations of the authors include: Massachusetts General Hospital, Harvard School of Medicine, in Boston, Mass; Warren Alpert Medical School, Brown University, in Providence, Rhode Island; and Department of Medical Education at UWorld, LLC, a health e-learning firm based in Dallas, Texas.

The authors – Tiffany Kung, Morgan Cheatham, ChatGPT, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng – came to a similar conclusion as Wharton’s Terwiesch. Specifically, they found that ChatGPT performed passably – above the variable passing threshold of about 60 percent – on the USMLE exam, if given the benefit of indeterminate answers. And they expect large language models (LLMs) will play a growing role in medical education and in clinical decision making.

“ChatGPT yields moderate accuracy approaching passing performance on USMLE,” the authors state in their paper. “Exam items were first encoded as open-ended questions with variable lead-in prompts. This input format simulates a free natural user query pattern. With indeterminate responses censored/included, ChatGPT accuracy for USMLE Steps 1, 2CK, and 3 was 68.0 percent/42.9 percent, 58.3 percent/51.4 percent, and 62.4 percent/55.7 percent, respectively.”

Describing ChatGPT’s performance as “approaching passing” is a generous way of phrasing it, particularly with the AI being given credit for indeterminate answers. Arriving in a physician’s office and seeing a diploma advertising a grade of D might provoke a bit more concern among patients.

But the researchers maintain that the things ChatGPT did get right conformed closely with accepted answers and that the AI model has improved remarkably, having months earlier achieved a success rate of only about 36.7 percent.

Interestingly, they observed that ChatGPT performed better than PubMedGPT, an LLM based solely on biomedical data that managed accuracy of only about 50.8 percent (based on unpublished data).

“We speculate that domain-specific training may have created greater ambivalence in the PubMedGPT model, as it absorbs real-world text from ongoing academic discourse that tends to be inconclusive, contradictory, or highly conservative or noncommittal in its language,” the authors state.

Essentially, the less scientific, more opinionated material that went into ChatGPT’s training, like patient-facing disease explanation pamphlets, appears to have made ChatGPT more opinionated.

“As AI becomes increasingly proficient, it will soon become ubiquitous, transforming clinical medicine across all healthcare sectors,” the authors conclude, adding that the clinicians associated with AnsibleHealth have been using ChatGPT in their workflows and have reported a 33 percent reduction in the time required to complete documentation and indirect patient care tasks.

This perhaps explains Microsoft’s decision to funnel billions into OpenAI for its future software.

The utility of ChatGPT in an education setting – despite the fact that it’s often wrong – was underscored in a blog post published Sunday by Thomas Rid, professor of strategic studies and the founder director of the Alperovich Institute for Cybersecurity Studies.

Rid describes a recent five-day Malware Analysis and Reverse Engineering course taught by Juan Andres Guerrero-Saade.

“Five days later I no longer had any doubt: this thing will transform higher education,” said Rid. “I was one of the students. And I was blown away by what machine learning was able to do for us, in real time. And I say this as somebody who had been a hardened skeptic of the artificial intelligence hype for many years. Note that I didn’t say ‘likely’ transform. It will transform higher education.”

Guerrero-Saade, in a Twitter thread, acknowledges that ChatGPT got things wrong but insists the tool helped students come up with better answers. He suggests that it functions like a personal teaching assistant for each student.

“Fearmongering around AI (or outsized expectations of perfect outputs) cloud the recognition of this LLMs staggering utility: as an assistant able to quickly coalesce information (right or wrong) with extreme relevance for a more discerning intelligence (the user) to work with,” he wrote.

Rid argues that while concerns about AI as a mechanism for plagiarism and cheating in education need to be addressed, the more important conversation has to do with how AI tools can improve educational outcomes. ®