← back to leaderboard

// how we score

Methodology

Every agent runs the same prompts. Every output is scored on the same rubric. No vibes, no brand loyalty — just a repeatable framework applied consistently across every agent tested.

// what we test

All tasks are drawn from real marketing use cases — content types a small business owner or marketing operator would actually need. Prompts are written to be specific, not generic. The goal is to test how agents perform under realistic brief constraints, not how they respond to open-ended "write me something about X."

task 1

Blog Intro

// prompt (exact)

Write a 400-word blog intro for a Canadian small business owner who is considering using AI tools for the first time. The tone should be conversational, not technical. Hook them in the first sentence. No fluff.

what this tests

Voice and tone adherenceHook strengthCanadian specificityNo-fluff execution
task 2

Headline Generation

// prompt (exact)

Generate 10 headline variants for a collagen face mask product. The target customer is Canadian women aged 35-55. Mix of benefit-driven, curiosity, urgency, and social proof angles. No fluff, no filler — every headline should be distinct.

what this tests

Angle varietyHook strength per angleBrief adherence (10 distinct headlines)Target audience fit
task 3

Cold Email

// prompt (exact)

Write a cold email to a Canadian HVAC contractor offering exclusive leads in their service area. Subject line included. Max 150 words. No fluff. The offer: we deliver pre-qualified homeowners looking for furnace repair or replacement — they only pay per lead, no monthly fees.

what this tests

Word count constraint (≤150)Persuasion and CTA qualityNo overclaimsSubject line effectiveness

// scoring rubric

Clarity

1–5 pts

Is the message clear and easy to understand? Does it communicate without ambiguity or unnecessary complexity?

5/5 looks like: No re-reading required. The main idea lands in one pass.

Readability

1–5 pts

Does it flow? Is the rhythm natural? Are sentence lengths varied appropriately for the format?

5/5 looks like: Reads like a person wrote it. No awkward constructions or walls of text.

Human Voice

1–5 pts

Does it sound like a human or like a machine? Are AI tells present — hedging, formulaic closers, corporate vocabulary?

5/5 looks like: Contractions, opinions, specificity, rhythm. Not 'leverage' and 'utilize'.

Usability

1–5 pts

Could this output be used directly, with minimal editing? Or does it need a significant rewrite before it's deployable?

5/5 looks like: Copy-paste ready. Not a starting point that still needs real work.

Relevance

1–5 pts

Did the agent follow the brief? Did it address the actual audience, goal, and constraints in the prompt?

5/5 looks like: On-brief. No hallucinated details. No scope creep. No generic filler that ignores the specifics.

25 pts max per task · scores are assigned per dimension after reading the full output · all tasks weighted equally

// grade scale

A

22–25 pts

Ship it

B

17–21 pts

Light editing needed

C

12–16 pts

Significant rewrite

D

7–11 pts

Not usable

F

1–6 pts

Complete failure

Overall grade is assigned based on total score across all tasks. A score of 66/75 (88%) maps to A. Thresholds: A ≥88%, B ≥68%, C ≥48%, D ≥28%, F below that.

// evaluation process

Identical prompts

Every agent receives the exact same prompt text. No rewording, no clarifications added. If an agent asks follow-up questions, we decline and ask it to proceed with the information given.

No system prompts

All tests are run on the default consumer interface for each agent — no custom system prompts, no pre-loaded context, no API mode unless the agent has no web UI.

Human evaluation

All scores are assigned by a human evaluator with 20+ years of performance marketing experience. The evaluator reads each output against the brief, scores it per dimension, and writes notes on what specifically earned or lost points.

First-pass output only

We score the first output returned. No regeneration, no cherry-picking a better attempt. The score reflects what you would get if you ran this prompt once and used the result.

// limitations

These evaluations reflect performance at the time of testing. Models are updated frequently. An agent that scores D today may score A in six months — or vice versa. Check the evaluation date on any result before drawing conclusions.

Content writing is one use case. An agent that scores low here may be exceptional at code, research, or other tasks. This is a narrow evaluation, intentionally so.

Human evaluation introduces subjectivity. We try to minimize it through a consistent rubric and detailed criteria, but a second evaluator would likely produce slightly different scores. The notes on each result explain the reasoning so you can apply your own judgement.