We Benchmarked 5 LLMs Across 24 Writing Scenarios — Here’s Why the Cheapest Ones Won

When I set out to build Qawaid — an AI-powered writing companion — the core promise was simple: check your writing in real-time, as you type. Not after you click a button. Not after a loading spinner. As. You. Type.

That promise made model selection the single most important technical decision in the entire project. And getting there was a journey of iteration, failed assumptions, and one decision that changed everything: bringing in a judge.

It Started With a Simple Question

The first version of Qawaid’s real-time checker used a single model — Gemini Flash. It worked. Issues appeared in the margin as you typed. Ship it, right?

Not quite. Users write in wildly different contexts. A blog post isn’t a business email isn’t a legal document. And different models have different strengths. Some are precise with grammar but slow. Some are fast but hallucinate issues that don’t exist. Some handle formal writing well but fall apart on casual tweets.

I needed to know — systematically — which model was actually best for this specific task. Not which model scores highest on MMLU. Not which one writes the best poetry. Which one catches real writing issues, with the fewest false alarms, in under a second.

So I did what felt obvious in retrospect but took weeks to build: I built a benchmark harness directly into the app.

Writing 24 Test Scenarios by Hand

The first thing I needed was ground truth. If a model says “this sentence has a grammar error,” how do I know it’s right?

I wrote 24 test scenarios by hand — each one a realistic piece of writing with intentionally planted errors. Every misspelling, every comma splice, every subject-verb disagreement was deliberate and documented. I knew exactly what a perfect checker should find.

The scenarios covered the full range of what people actually write:

Core writing types:

Blog articles (23 expected issues — spelling, grammar, style)
Business emails (18 issues — professional tone mistakes)
Academic essays (25 issues — formal writing errors)
Creative writing (16 issues — style-focused)
Quick notes and meeting minutes (16 issues — typo-heavy, fast writing)
Technical documentation (14 issues — precision matters)
Long article paste (~1,600 words, 32 issues — stress test)
Short text like headlines and tweets (7 issues — minimal context)

But here’s what made the benchmark actually useful — I added scenarios designed to test what grammar checkers get wrong:

Academic jargon: Words like “epistemological” and “phenomenological” are valid. Don’t flag them.
Optional commas: The Oxford comma debate shouldn’t produce red underlines.
Quoted material: If someone writes dialogue, don’t “fix” the character’s speech.
Creative fragments: Intentional sentence fragments are a style choice, not an error.
British vs. American English: “colour” is not a misspelling.
Social media conventions: “OMG” and “TLDR” are fine in a tweet.
A perfect text control: Zero issues. If the model flags anything here, it’s hallucinating.

These edge cases are where tools like Grammarly frustrate writers. I wanted Qawaid to be smarter than that.

Each scenario defined every expected issue with its type (spelling/grammar/punctuation/style/tone), the original text, the correct fix, and a human-readable explanation:

{
  type: 'grammar',
  originalText: 'intelligence have',
  expectedFix: 'intelligence has',
  description: 'Subject-verb agreement'
}

Five Models, Three Providers

With scenarios ready, I configured five models across three cloud providers:

Model	Provider	Why I Tested It
GPT-5 Mini	OpenAI	The “smart” baseline
GPT-5 Nano	OpenAI	Reasoning model with `low` effort — fast thinking
Gemini 2.5 Flash Lite	Google	Built for speed
Gemini 3 Flash	Google	Newer, supposedly better
Amazon Nova Micro	AWS Bedrock	The dark horse — tiny and cheap

The benchmark page let me select any combination, hit “Run All Scenarios,” and watch results stream in. Each model processed the same text through the same prompt, and I captured everything: latency per request, issues found, precision, recall, F1.

One important design decision: I used per-request API latency as the speed metric, not total wall-clock time. Why? Because Bedrock has lower rate limits, so I had to run its requests sequentially with delays between them. Measuring total duration would penalize Bedrock for its infrastructure constraints, not its actual speed. The fair comparison is: how fast does each API call return?

The First Results Were… Confusing

After running all 24 scenarios across all 5 models, the initial F1 scores looked like this:

Model	Precision	Recall	F1	Latency
Amazon Nova Micro	70%	100%	82%	680ms
GPT-5 Nano	75%	86%	80%	8.4s
Gemini 2.5 Flash Lite	64%	100%	78%	1.2s
GPT-5 Mini	54%	100%	70%	31.1s
Gemini 3 Flash	54%	100%	70%	27.9s

Nova Micro and Gemini Flash Lite were clearly the fastest. But something bugged me.

The F1 scores didn’t match my gut feeling when I read the actual outputs.

GPT-5 Mini had 54% precision — meaning nearly half its flagged issues were “false positives.” But when I manually read through its output, many of those “false positives” were actually real issues that I hadn’t included in my expected list. The model was finding problems I’d missed when writing the scenarios.

Conversely, some models got credit for “finding” an issue because the text matched, but they’d categorized it wrong — calling a grammar error a spelling error, or suggesting a fix that was technically correct but awkward.

F1 score, it turned out, was measuring the wrong thing. It was testing how well models matched my predefined answer key, not how well they actually performed as writing checkers.

I needed a smarter evaluation.

The Judge Changes Everything

The solution was obvious once I thought of it: use a more powerful LLM to evaluate the outputs.

I brought in GPT-5.2 as an independent judge. It wasn’t one of the five models being tested, so there was no conflict of interest. For every issue every model detected, the judge answered three questions:

Is this a real writing issue? (Not a false positive — the model didn’t hallucinate a problem)
Is the category correct? (If it’s a grammar error, did the model call it grammar, not spelling?)
Is the suggested fix appropriate? (Would the fix actually improve the writing?)

The judge had access to the original text and each detected issue, along with clear category definitions:

SPELLING: Misspelled words (wrong letters, typos)
GRAMMAR: Wrong word choice, subject-verb agreement, tense, homophones
PUNCTUATION: Missing/incorrect punctuation marks
STYLE: Wordiness, weak verbs, redundancy, jargon
TONE: Register mismatch, formality issues

For each evaluation, the judge returned structured reasoning — not just yes/no, but why it considered each issue valid or invalid, correctly categorized or not.

This gave me three new metrics that F1 couldn’t provide:

Judged Precision: What percentage of detected issues are actually real? (Not “did it match my answer key” but “is this a legitimate writing problem?”)
Category Accuracy: Of the valid issues, how many were correctly classified?
Fix Accuracy: Of the valid issues, how many had appropriate fixes?

The Rankings Shifted

With the judge evaluation in place, the composite ranking formula became:

Score = (Accuracy × 0.6) + (Speed × 0.4)

Where accuracy is the judged precision (not F1), and speed is normalized against a 5-second baseline.

The new rankings:

#1: Amazon Nova Micro — Score: 82.6%

API Latency: 680ms
Valid Issues: 88.8% | Category Accuracy: 87.5% | Fix Accuracy: 87.5%

#2: Gemini 2.5 Flash Lite — Score: 79.7%

API Latency: 1.2s
Valid Issues: 81.8% | Category Accuracy: 88.9% | Fix Accuracy: 88.9%

#3: GPT-5 Mini — Score: 60.0%

API Latency: 31.1s
Valid Issues: 100% | Category Accuracy: 100% | Fix Accuracy: 76.9%

#4: GPT-5 Nano — Score: 60.0%

API Latency: 8.4s
Valid Issues: 100% | Category Accuracy: 75.0% | Fix Accuracy: 87.5%

#5: Gemini 3 Flash — Score: 60.0%

API Latency: 27.9s
Valid Issues: 100% | Category Accuracy: 92.3% | Fix Accuracy: 100%

The judge confirmed what the raw F1 was hiding: Nova Micro’s “false positives” were mostly real issues — the model was finding problems my hand-written answer key had missed. Its 88.8% valid issue rate was excellent. And at 680ms, it was 45x faster than GPT-5 Mini.

Meanwhile, GPT-5 Mini and Gemini 3 Flash had perfect valid issue rates (100%) but their latency destroyed their composite scores. 31 seconds for a real-time check isn’t real-time — it’s a loading screen.

The Latency Gap Is Not Subtle

Model	API Latency	How It Feels
Amazon Nova Micro	680ms	Instant — issues appear as you finish typing
Gemini 2.5 Flash Lite	1.2s	Brief pause — still feels responsive
GPT-5 Nano	8.4s	You notice the wait
Gemini 3 Flash	27.9s	You’ve moved on to the next paragraph
GPT-5 Mini	31.1s	You’ve forgotten you asked

For a feature that fires every time the user pauses typing, the difference between 680ms and 31 seconds is not a performance optimization — it’s a fundamentally different product.

What We Shipped

We ended up with a dual-model strategy:

Amazon Nova Micro as the primary real-time checker. Sub-second response, best composite score, and among the cheapest per-token models available. It runs on every typing pause.
Gemini 2.5 Flash Lite as the fallback and secondary provider. Its 88.9% fix accuracy actually beats Nova Micro, and 1.2s latency is still well within the “feels responsive” window.

For the heavier features — deep critique, full document analysis, rewriting — we use more powerful models server-side via Cloud Functions. But for the real-time squiggly underlines that make Qawaid feel alive? The small, fast, cheap models won decisively.

What I Learned Building This

1. Generic benchmarks are useless for specific tasks

MMLU, HumanEval, and LMSYS Arena tell you nothing about how a model performs at grammar-checking tweets. If your use case is specific — and it always is — you need a specific benchmark with specific ground truth.

2. F1 score has a ceiling

Exact-match metrics (precision, recall, F1 against a predefined answer key) punish models for finding issues you didn’t anticipate. They reward models for matching your biases. The LLM judge broke through this ceiling by asking “is this actually a problem?” instead of “does this match the answer key?”

3. The judge model must be independent

GPT-5.2 wasn’t one of the five competitors. If I’d used Gemini to judge Gemini’s output, or GPT-5 Mini to judge GPT-5 Nano, the results would be biased. The judge needs to be a disinterested third party.

4. Latency is a product decision, not a performance metric

680ms vs. 31 seconds isn’t a 45x improvement — it’s the difference between “real-time checking” and “batch analysis.” They’re different features. If you promise real-time, you need a model that can deliver real-time.

5. Test the failure modes, not just the happy path

Nine of my 24 scenarios were specifically designed to test false positives — academic jargon, creative fragments, British English, social media conventions. These are where grammar checkers alienate users. A model that flags “colour” as a misspelling will lose a British user’s trust instantly, regardless of how many real errors it catches.

6. Build the benchmark tool into the product

I didn’t use a separate testing framework or a Jupyter notebook. The benchmark page lives inside the app itself, using the same services, same prompts, same parsing code that production uses. When a new model comes out — and they come out constantly — I can evaluate it against my baseline in minutes, not days.

7. The cheapest model might actually be the best model

This is the hardest lesson for engineering teams to internalize. We have a bias toward bigger, newer, more expensive models. “Surely GPT-5 is better than Nova Micro.” For general reasoning? Probably. For detecting that “dont” needs an apostrophe in under a second? No.

The Benchmark Lives On

Every time Google, OpenAI, or Amazon releases a new model, the first thing I do is add it to the benchmark config and run all 24 scenarios. The ranking has shifted before and it will shift again. The infrastructure to evaluate objectively — scenarios, metrics, judge — is more valuable than any single model choice.

If you’re building AI-powered features, invest in your evaluation infrastructure early. The model landscape changes every quarter. What doesn’t change is knowing exactly how to measure what matters for your users.

Qawaid is an AI-powered writing companion with real-time analysis, deep critique, and AI-assisted rewriting. It supports multiple providers (Google Gemini, OpenAI, AWS Bedrock) with offline support via local Gemma models.

TshWsh