Models

Google launches new LLM, Gemini 3.5 Flash. Faces skepticism

Strong numbers on Google's own charts. In the AI community, the mood is different, especially after Cursor released Composer 2.5.

Published 19 May 2026 6 min

Gemini 3.5 Flash announced on stage at a Google keynote

On May 19, Google launched Gemini 3.5 Flash. Logan Kilpatrick (product lead for Google AI Studio and the Gemini API at Google DeepMind) called it the company's most capable Flash model yet, tuned over six months for agent workflows in the real world. The model is rolling out across Google's products now. On stage, the story was clear: more intelligence, more speed, better price.

Google's own numbers look strong. DeepMind's published eval table puts 3.5 Flash at or near the top on agentic benchmarks, multimodal tasks, and coding tests such as Terminal-Bench 2.1, close to GPT-5.5 and ahead of Claude Sonnet 4.6 on several rows.

Gemini 3.5 Flash benchmark comparison table from Google DeepMind — Google DeepMind evals for Gemini 3.5 Flash. Source: deepmind.google/models/evals-methodology/gemini-3-5-flash/

The mood in the AI community was different. The day before, Cursor had released Composer 2.5, and the conversation quickly moved to what models actually deliver in real coding work, and at what price. Benchmarks do not tell the whole story. Once the community had tried Gemini 3.5 Flash for themselves, the picture was clear: the model is behind in practice. It raises a fair question: how can one of the world's largest companies, with Google's advantages in data, infrastructure, and research talent, not quite compete at the same level as the other leading labs right now, especially when Cursor launches a model that performs far better on the tasks developers actually measure?

Bar chart comparing benchmark score and cost per task for Composer 2.5 versus Gemini 3.5 Flash and other frontier models — Community comparison (via @shiri_shh): Composer 2.5 at 63.2% and ~$0.55 per task versus Gemini 3.5 Flash at 49.8% and ~$1.94.

The community chart is not official Google data, but it captures the gap. Composer 2.5 scored higher on the coding eval at roughly a third of the cost per task.

Criticism went further than one graph. Prominent voices such as Theo, former Twitch engineer and co-founder of T3 Chat, argued that Google's launch materials talk up speed and benchmarks, but say less about what the model actually costs to run. The bill runs about $1.50 per million input tokens and $9 per million output, a steep step up from earlier Flash models. Reasoning Tokens can push the bill even higher. In his own agentic test, Gemini 3.5 Flash failed to rewrite a small game. GPT-5.5 completed it, including a 3D version.

Google launched a Gemini 3.5 Flash model that looks competitive in Google's own benchmarks, but the AI community measures something else: verified third-party tests, price per solved task, and whether the code runs. Against Cursor's Composer 2.5, that gap is hard to ignore.