I put GPT-5.5 through a 10-round test: It scored 93/100, losing points only for exuberance

Elyse Betters Picaro / ZDNET

Follow ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • GPT-5.5 delivers polished, useful answers across tasks.
  • Strong performance across writing, coding, and reasoning tasks.
  • Overeagerness hurts accuracy and instruction following.

OpenAI has released GPT-5.5, which can be reductively described as better and faster than GPT-5.4. The new large language model shows improvements in agentic coding, conceptual clarity, scientific research ability, and accuracy during knowledge work.

This release follows closely on the heels of the introduction of ChatGPT Images 2.0 earlier this week, which combines AI intelligence with image generation. And if it also feels like we just discussed the release of GPT-5.4, you’re not wrong.

Also: ChatGPT just made it easy to find and edit all the AI images you’ve ever generated

As the following chart shows, the release cadence for OpenAI releases has sped up dramatically, most likely because AI coding has significantly reduced OpenAI’s development time.

David Gewirtz via ChatGPT Images/ZDNET

That chart was generated entirely by ChatGPT 5.5 Thinking using Images 2.0. All I did was tell the AI that I wanted to visualize the release cadence between GPT releases and wanted it presented in the ZDNET brand style. I also provided a PNG of the ZDNET logo.

The whole process, including some minor corrections, took less than 10 minutes. I have been researching data and creating professional-looking informational charts like this by hand since the invention of computer graphics. Something like this would take at least two hours to create, not 10 minutes.

Also: I got an early look at ChatGPT Images 2.0, and it’s impressive – with one exception

I have already done some testing of the Images 2.0 capabilities. I’ll be back with more next week. In this article, I’m focusing on GPT-5.5’s knowledge capabilities.

I ran GPT-5.5 through my 10-point testing process. I was both impressed and annoyed. The results were solid, but the model tended to be a little too exuberant, doing work I didn’t ask it to do.

Since GPT-5.5 is only available in paid tiers (Plus and above), I used ChatGPT Plus for my tests. Right now, my Plus account only shows GPT-5.5 available for the Thinking effort level in both Standard and Extended. I picked Standard Thinking. That’s the effort I used for these tests.

Screenshot by David Gewirtz/ZDNET

Let’s get started.

Test 1: Summarize a news story

  • Available points: 10
  • Awarded points: 5

This test looks at how well the AI can read a story on the web and explain it. I used Yahoo News because Yahoo doesn’t block AI access. I also looked for a story that’s as non-political as possible. Today, that meant I had to go a good way down the news page to find a story on the recent LaGuardia runway crash.

GPT-5.5 did correctly summarize the meat of the story, but it didn’t follow my instructions to use Yahoo News as the source. For GPT-5.2, I deducted one point because ChatGPT used information from Axios and Yahoo. This time, I took off five points, because it used information from AP, The Sun, Wall Street Journal, The Guardian, and even Wikipedia.

Also: I tested ChatGPT Plus vs. Gemini Pro to see which is better – and if it’s worth switching

If I had wanted a comprehensive news answer, that would have been fine. But the prompt specifically said to look at Yahoo News, and GPT-5.5 pretty much ignored that instruction.

There’s a big push from all the AI companies about running autonomous agents. But if even a simple summary prompt can’t be followed correctly, it does not give me confidence that it’s safe to let agents run wild on long-horizon projects. Just sayin’.

Test 2: Academic concept explanation

  • Available points: 10
  • Awarded points: 10

This challenge asked the AI to explain educational constructivism to a five-year-old. It tested how well the AI can research and report on a concept, and then adjust its explanation style to the desired target level.

GPT-5.5 provided a very clear answer that included an example that would be something a five-year-old could picture and understand. All 10 points were awarded.

Test 3: Math and analysis

  • Available points: 10
  • Awarded points: 10

This test was designed to test the AI’s math and pattern-recognition abilities. I passed the model a sequence of numbers. Those numbers were part of a math trope called the Fibonacci Sequence, but I didn’t tell the AI that.

When asked to fill in some numbers in the sequence, the AI had to understand the pattern and perform the calculations to provide the sequence. It did the math correctly.

Also: The best AI image generators of 2026: There’s only one clear winner now

The AI was also instructed to “explain your reasoning.” All I got back was, “The sequence is the Fibonacci sequence: each number is the sum of the two numbers before it.” This was a correct explanation and comparable to the results from earlier releases.

I awarded this test 10 points because, although brief, it was correct.

Test 4: Cultural discussion

  • Available points 10
  • Awarded points: 10

This test asked the AI to construct a case, form a coherent argument, and present an opinion on an issue that doesn’t have a definitive right or wrong answer. I asked, “Do you think social media has improved or worsened communication in society? Provide two reasons for your view.”

Interestingly, GPT-5.5 thought social media “has worsened communication overall.” I tended to agree. The model provided two solid reasons. The first was that it “often rewards speed and reaction over thoughtfulness.” The second was that social media “tends to create information bubbles.” For each reason, GPT-5.5 provided a supporting paragraph.

Also: How to switch from ChatGPT to Gemini

Both of those reasons were valid. It also shared a quick list of the positive benefits of social media, including helping people stay connected, organize for causes, and share information widely.

GPT-5.5 gave an answer that was concise, well-considered, and clear. It got 10 points for this test.

Test 5: Literary analysis

  • Available points: 10
  • Awarded points: 10

This approach tested the AI’s understanding of a piece of contemporary literature, the first Game of Thrones book, A Song of Ice and Fire. The test asked what the main themes are, and why they’re important.

GPT-5.5 gave me back a 632-word response that broke the book down into the following themes:

  • Power and its cost
  • The collapse of heroic fantasy ideals
  • Family, loyalty, and inherited conflict
  • Honor versus pragmatism
  • Identity and self-invention
  • The human cost of war
  • The danger of political distraction
  • Prophecy, religion, and uncertainty
  • Justice and revenge
  • The return of the ignored past

GPT-5.5 provided clear explanations for each theme, why it was included, how it related to the book, and what it meant to the overall series. It’s hard to be strictly objective with something like this, but I really got the feeling this was the most nuanced answer I’ve seen to this question from my various GPT version tests.

All 10 points were awarded.

Test 6: Travel itinerary

  • Available points: 10
  • Awarded points: 9

This test evaluated the AI’s knowledge of geographic regions and its ability to create a helpful travel itinerary based on specific interests. I asked it to plan a week-long vacation in Boston in March focused on technology and history.

Of all the times I’ve asked this question of AIs, GPT-5.5 produced the best version for points of interest and day schedules. The model didn’t just hit the major tourist landmarks; it also pointed out a nice mix of historical and tech points of interest. GPT-5.5 took into account that March is likely to be a bit unpleasant, so it mixed in both indoor and outdoor activities, including fallback plans.

While it did not recommend a wide range of eateries, GPT-5.5 did recommend Legal Seafoods, which is one of my personal favorite locations. The model lost a point because it made absolutely no reference to costs.

Also: I tried Personal Intelligence, and it was accurate (but unsettling)

I feel like GPT-5.5 really grokked (yes, I did that) what someone would want in an itinerary by providing a strong list of activities to get excited about. But the AI didn’t fulfill the travel advisor part of the process because it didn’t cover budgeting.

Test 7: Emotional support

  • Available points: 10
  • Awarded points: 10

The emotional support question asked for advice and words of encouragement for an upcoming job interview. I have to say I really liked this AI’s response.

The AI included some encouragement, like “The interview is not an interrogation. It’s a mutual fit conversation.” It also gave some practical advice. First, GPT-5.5 suggested preparing three stories the job seeker could use during the interview, one about solving a problem, one about working with others, and one about learning or recovering from something difficult.

The model gave a simple breathing exercise. It said that it’s okay to pause before answering a question. It was also encouraging, and the interview meant there was already something about the candidate that the hiring company found interesting.

Also: I tried Google Photos’ new AI Enhance tool: How it crops, relights, and fixes your shots

Good, solid, useful answers: 10 points.

Test 8: Translation and cultural relevance

  • Available points: 10
  • Awarded points: 9

My test prompt asked GPT-5.5 to translate a phrase from English to Latin and then explain the cultural relevance of Latin in today’s world.

The phrase I asked it to translate was, “The celebration will take place tomorrow in the town square.” GPT-5.5 gave me back two choices, “Celebratio cras in foro oppidi fiet,” and what it called a slightly more formal alternative, “Celebratio cras in foro publico oppidi habebitur.”

Also: This powerful Gemini setting made my AI results way more personal and accurate

The first version is a word-for-word translation of the requested phrase. But the second one translates back to English as, “The celebration will be held tomorrow in the town’s public forum,” which was not the phrase I asked for.

GPT-5.5 may have thought it was helpful to provide an additional variation, but for someone who doesn’t speak Latin, all the approach does is confuse the issue. Which is the Latin phrase that should be used? I’m deducting a point for overeagerness that doesn’t strictly follow the prompt.

As for the second half of the question, GPT-5.5 answered briefly, but accurately.

Test 9: Coding test

  • Available points: 10
  • Awarded points: 10

Chatbot coding test results are interesting. They’re different in nature from the types of results you get when testing coding agents like Codex or Claude Code.

Also: I used GPT-5.2-Codex to find a mystery bug and hosting nightmare – it was beyond fast

While the LLMs in the chatbots and coding agents are generally similar, I’ve found that the coding agents are considerably more accurate on requests than when running in the chatbots. I haven’t been able to get any of the AI companies to explain why, but I’m guessing it has something to do with how the two different tools allocate resources and training data.

The test case for this question was the second test in my coding metrics article, which asked the AI to clean up a buggy snippet of code for validating whether a dollar amount was properly entered into a field.

The AI passed this test. The only thing the AI did that could be an issue is denying correctness to a number that included a comma. But that’s actually still a safe response. If the user enters “1,000.00,” the AI returns false. It might take the user a second to try again with “1000.00,” but it won’t harm the system. 

GPT-5.5 got all 10 points for this test.

Test 10: Creative writing

  • Available points: 10
  • Awarded points: 10

This test is among the most fun in the entire question suite. It asked GPT-5.5 to write a story longer than 1,500 words, as described in the second prompt in this article. The aim was to explore the creativity and comprehensiveness of the chatbot’s answer.

Unlike the other tests, I ran this evaluation in Extended mode to see just how good the story could get. I’m not sure the AI took much advantage of this option, because it only ran for eight seconds. Still, it was frickin’ awesome.

GPT-5.5 gave me back 4,049 words, which I think is the longest story I have gotten back from an AI in all my tests of this particular challenge.

Also: How to shop with AI: 6 ways I find deals, price track, and let agents buy for me

I liked how GPT-5.5 opened the story by saying, “By the year 2339, most of Boston had become very good at pretending it was not old.” I was hooked.

I tried to get Voice Mode to read to me like a bedtime story. However, the AI first said the story was too long. It then offered to read the story to me section by section. When I agreed to that approach, nothing happened; it just hung. I’m not deducting points for that failure because it’s not part of the standard evaluation test, but it’s disappointing nonetheless.

Unfortunately, since I asked the AI to read the story via Voice Mode, I can’t share the output from within ChatGPT. What I didn’t know is that the three-dot icon after the response had a ‘Read aloud’ option, which probably would have worked.

Screenshot by David Gewirtz/ZDNET

That said, I copied the response to Google Docs, so you can still read it there, if you so wish.

Here are a few more quotes from the full response:

  • Jackson, who had clearly been waiting all his life to hear someone say “the one in the back” in a mysterious bookstore, looked radiant. Ophelia looked as though she was beginning to calculate exits.
  • “My dear,” Archibald said, “by 2339, evidence works however the wealthy can persuade it to.”
  • One stopped before Jackson: a slim manual bound in copper mesh titled The Gentleman’s Guide to Looking Ridiculous with Conviction. Jackson gasped. “I feel seen.”
  • This time, a small envelope slid out and landed in Archibald’s lap. It was addressed in his own hand. To myself, if I become insufferable.
  • The red door stood open behind them. Beyond it, the front of the shop looked warm, ordinary, and only mildly impossible.

I’ve given this writing assignment before, and in each incarnation it’s been impressive. But this output took the delightful cozy paranormality to an entirely new level. Enthusiastically 10 out of 10.

For kicks, I asked GPT-5.5 to “draw me a picture that perfectly illustrates this story in 16:9 aspect ratio.” Here’s what was returned:

David Gewirtz via ChatGPT Images/ZDNET

The AI correctly illustrated all the characters to the point that I could identify each character. Jackson, mentioned above, is the guy with the hat. Archibald is the guy with the cane.

Overall test results

Overall, the tests can reward up to 100 points. The current version, GPT-5.5, scored 93. GPT 5.2 scored 92. GPT-5.1 scored 91. You might think this latest build would do better than a point or two improvement over the previous versions, but the model’s own overeagerness brought it down.

On the first test, the one asking about current news, I asked the AI to summarize one source. Instead, it looked for the same news from six separate sources. It overreached and lost points.

The same problem happened with the translation assignment. I asked GPT-5.5 to translate a sentence to another language, one I presumably don’t speak. It gave back two translations to choose from. Now, how is that helpful? If I don’t speak the language, how would I choose which translation I like better?

These two overzealous reactions lost the model six points. It would have scored a 99 (losing one point for skipping budget information on the travel question). But, instead, it scored a mere 93.

That said, I quite like this release. The answers were all good, notwithstanding the excessive enthusiasm. The ability to add relevant images, such as the infographic at the beginning and the bookstore illustration at the end, opens avenues for fun and work effectiveness.

I see no reason to recommend against GPT-5.5. I will be using the model as my default choice moving forward. Stay tuned, because I’ll be doing a lot more with the enhanced image features of Images 2.0 in ChatGPT with GPT-5.5.

Do you prefer a model that gives one exact answer or one that offers extra options? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, on Bluesky at @DavidGewirtz.com, and on YouTube at YouTube.com/DavidGewirtzTV.

Artificial Intelligence

Comments (0)
Add Comment