I fact-checked ChatGPT with Bard, Claude, and Copilot – and this AI was essentially the most confidently incorrect

Generative synthetic intelligence (AI) is notoriously liable to factual errors. So, what do you do if you’ve requested ChatGPT to generate 150 presumed information and you do not need to spend a whole weekend confirming every by hand?

Additionally: AI in 2023: A yr of breakthroughs that left no human factor unchanged

Properly, in my case, I turned to different AIs. On this article, I will clarify the challenge, take into account how every AI carried out in a fact-checking showdown, and supply some ultimate ideas and cautions if you happen to additionally need to enterprise down this maze of twisty, little passages which can be all alike.

The challenge

Final week, we revealed a really enjoyable challenge the place we had DALL-E 3, working inside ChatGPT, generate 50 picturesque photos that it thought represented every US state. I additionally had ChatGPT record “the three most attention-grabbing information you realize in regards to the state”. The outcomes had been, as my editor put within the article’s title, “gloriously unusual”.

ChatGPT put the Golden Gate Bridge someplace in Canada. The software put Woman Liberty each within the midwest US and someplace on Manhattan island. And it generated two Empire State Buildings. Briefly, ChatGPT acquired its summary expressionism funk on, however the outcomes had been fairly cool.

Additionally: I requested DALL-E 3 to create a portrait of each US state, and the outcomes had been gloriously unusual

As for the person information, they had been totally on goal. I am fairly good with US geography and historical past, and thought that few of ChatGPT’s generated information stood out as wildly incorrect. However I did not do any impartial truth checking. I simply learn the outcomes over and pronounced them ok.

However what if we actually need to know the accuracy of these 150 truth bullets? That type of query looks as if a great challenge for an AI.

Methodology

So here is the factor. If GPT-4, the OpenAI giant language mannequin (LLM) utilized by ChatGPT Plus, generated the actual fact statements, I wasn’t completely satisfied it ought to be checking them. That is like asking highschool college students to put in writing a historical past paper with out utilizing any references, after which self-correct their work. They’re already beginning with suspect info — and then you definitely’re letting them appropriate themselves? No, that does not sound correct to me.

Additionally: Two breakthroughs made 2023 tech’s most revolutionary yr in over a decade

However what if we fed these information to different LLMs within different AIs? Each Google’s Bard and Anthropic’s Claude have their very own LLMs. Bing makes use of GPT-4, however I figured I might take a look at its responses simply to be completionist.

As you may see, I acquired the perfect suggestions from Bard, so I fed its responses again into ChatGPT in a round-robin perversion of the pure order of the universe. It was a cool challenge.

Anthropic Claude

Claude makes use of the Claude 2 LLM, which can be used within Notion’s AI implementation. Claude allowed me to feed it a PDF containing the complete set of information (with out the photographs). This is what I acquired again:

anthropic-claude
Screenshot by David Gewirtz/ZDNET

General, Claude discovered the actual fact record to be largely correct, however it did have some clarifications for 3 gadgets. I restricted how lengthy the ChatGPT information may very well be, and that restrict inhibited nuance within the truth descriptions. Claude’s truth test took problem with a few of that lack of nuance.

General, it was an encouraging response.

Copilot… or nopilot?

Then we get to Microsoft’s Copilot, the renamed Bing Chat AI. Copilot does not enable PDFs to be uploaded, so I attempted pasting within the textual content from all 50 state information. This method failed instantly, as a result of Copilot solely accepts prompts as much as 2,000 characters:

ms-copilot-limits
Screenshot by David Gewirtz/ZDNET

I requested Copilot the next:

The next textual content incorporates state names adopted by three information for every state. Please study the information and determine any which can be in error for that state

This is what I acquired again:

copilot3
Screenshot by David Gewirtz/ZDNET

It just about repeated the actual fact knowledge I requested it to test. So, I attempted to information it with a extra forceful immediate:

copilot4
Screenshot by David Gewirtz/ZDNET

As soon as once more, it gave me again the info I requested it to confirm. I discovered this output very odd as a result of Copilot makes use of the identical LLM as ChatGPT. Clearly, Microsoft has tuned it in a different way than ChatGPT.

I gave up, and moved onto Bard.

Bard

Google has simply introduced their new Gemini LLM. I do not but have entry to Gemini, so I ran these assessments on Google’s PaLM 2 mannequin.

Additionally: What’s Gemini? Every part it’s best to find out about Google’s new AI mannequin

By comparability to Claude and Copilot, Bard knocked it out of the park, or, extra Shakespearianish, it “doth bestride the slender world like a Colossus.”

Try the outcomes beneath:

bard
Screenshot by David Gewirtz/ZDNET

It is vital to notice that many state information aren’t even agreed upon by the states or there are nuances. As I will present you within the subsequent part, I fed this record again to ChatGPT and it discovered two discrepancies within the Alaska and Ohio solutions.

However there are different misses right here. In some methods, Bard overcompensated for the task. For instance, Bard appropriately acknowledged that different states apart from Maine produce lobsters. However Maine goes all-in on its lobster manufacturing. I’ve by no means been to a different state that has miniature lobster traps as one of the crucial in style vacationer entice trinkets.

Additionally: I spent a weekend with Amazon’s free AI programs, and extremely suggest you do too

Or let’s decide Nevada and Space 51. ChatGPT stated, “Prime-secret navy base, rumored UFO sightings.” Bard tried to appropriate, saying “Space 51 is not simply rumored to have UFO sightings. It is an actual top-secret navy facility, and its function is unknown.” They’re saying just about the identical factor. Bard simply missed the nuance that comes from having a decent phrase restrict.

One other place Bard picked on ChatGPT with out understanding context was Minnesota. Sure, Wisconsin has a number of lakes, too. However Bard did not declare Minnesota had essentially the most lakes. It simply described Minnesota because the “Land of 10,000 lakes,” which is one among Minnesota’s most typical slogans.

Bard acquired hung up on Kansas as effectively. ChatGPT stated Kansas is “Dwelling to the geographic heart of the contiguous US.” Bard claimed it was South Dakota. And that will be true if you happen to consider Alaska and Hawaii. However ChatGPT stated “contiguous,” and that honor goes to some extent close to Lebanon, Kansas.

Additionally: These are the roles almost certainly to be taken over by AI

I might go on, and I’ll within the subsequent part, however you get the purpose. Bard’s fact-checking appears spectacular, however it typically misses the purpose and will get issues simply as incorrect as another AI.

Earlier than we transfer on to ChatGPT’s restricted truth test of Bard’s truth test, let me level out that the majority of Bard’s entries had been both incorrect or wrong-headed. And but, Google places its AI solutions in entrance of most search outcomes. Does that concern you? It certain worries me.

Such a marvel, my lords and girls, is to not be spoken of.

ChatGPT

Proper off the highest, I might inform Bard acquired one among its information incorrect — Alaska is much larger than Texas. So, I believed, let’s have a look at if ChatGPT can fact-check Bard’s truth test. For a second, I believed this little bit of AI tail chasing would possibly knock the moon out of Earth’s orbit, however then I made a decision that I might danger the whole construction of our universe as a result of I knew you’d need to know what occurred:

This is what I fed ChatGPT:

chatgpt-query
Screenshot by David Gewirtz/ZDNET

And here is what ChatGPT stated (and, for readability, the moon did stay in orbit):

chatgpt-cross-check
Screenshot by David Gewirtz/ZDNET

As you possibly can see, ChatGPT took problem with Bard’s misguided declare that Texas is the most important state. It additionally had a little bit of tizzy over Ohio vs. Kansas because the delivery of aviation, which is extra controversial than most colleges educate.

Additionally: 7 methods to ensure your knowledge is prepared for generative AI

It is generally accepted that Wilbur and Orville Wright flew the primary plane (really in Kitty Hawk, North Carolina), though they constructed their Wright Flyer in Dayton, Ohio. That stated, Sir George Cayley (1804), Henri Giffard (1852), Félix du Temple (1874), Clément Ader (1890), Otto Lilienthal (1891), Samuel Langley (1896), Gustave Whitehead (1901), and Richard Pearse (1902) — from New Zealand, the UK, France, Germany, and different elements of the US — all have considerably professional claims to being the primary in flight.

However we’ll give the purpose to ChatGPT, as a result of it solely has 10 phrases to make a declare, and Ohio was the place the Wright Brothers had their bike store.

Conclusions and caveats

Let’s get one thing out of the way in which upfront: if you happen to’re delivering a paper or a doc the place you want your information to be proper, do your individual fact-checking. In any other case, your Texas-sized ambitions would possibly get buried below an Alaska-sized downside.

As we noticed in our assessments, the outcomes (as with Bard) can look fairly spectacular, however be fully or partially incorrect. General, it was attention-grabbing to ask the varied AIs to crosscheck one another, and it is a course of I will most likely discover additional, however the outcomes had been solely conclusive in how inconclusive they had been.

Copilot gave up fully, and easily requested to return to its nap. Claude took problem with the nuance of some solutions. Bard hit arduous on an entire slew of solutions — however, apparently, to err just isn’t solely human, it is AI as effectively.

Additionally: These 5 main tech advances of 2023 had been the most important game-changers

In conclusion, I have to quote the actual Bard and say, “Confusion now hath made his masterpiece!”

What do you assume? What kind of egregious errors have you ever seen out of your favourite AI? Are you content material in trusting the AIs for information, or will you now do your individual fact-checking processes? Tell us within the feedback beneath.


You possibly can comply with my day-to-day challenge updates on social media. Make sure you subscribe to my weekly replace publication on Substack, and comply with me on Twitter at @DavidGewirtz, on Fb at Fb.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Leave a Reply

Your email address will not be published. Required fields are marked *