// how we score
Methodology
Every agent runs the same prompts. Every output is scored on the same rubric. No vibes, no brand loyalty — just a repeatable framework applied consistently across every agent tested.
// what we test
All tasks are drawn from real marketing use cases — content types a small business owner or marketing operator would actually need. Prompts are written to be specific, not generic. The goal is to test how agents perform under realistic brief constraints, not how they respond to open-ended "write me something about X."
Blog Intro
// prompt (exact)
Write a 400-word blog intro for a Canadian small business owner who is considering using AI tools for the first time. The tone should be conversational, not technical. Hook them in the first sentence. No fluff.
what this tests
Headline Generation
// prompt (exact)
Generate 10 headline variants for a collagen face mask product. The target customer is Canadian women aged 35-55. Mix of benefit-driven, curiosity, urgency, and social proof angles. No fluff, no filler — every headline should be distinct.
what this tests
Cold Email
// prompt (exact)
Write a cold email to a Canadian HVAC contractor offering exclusive leads in their service area. Subject line included. Max 150 words. No fluff. The offer: we deliver pre-qualified homeowners looking for furnace repair or replacement — they only pay per lead, no monthly fees.
what this tests
// scoring rubric
Clarity
1–5 ptsIs the message clear and easy to understand? Does it communicate without ambiguity or unnecessary complexity?
5/5 looks like: No re-reading required. The main idea lands in one pass.
Readability
1–5 ptsDoes it flow? Is the rhythm natural? Are sentence lengths varied appropriately for the format?
5/5 looks like: Reads like a person wrote it. No awkward constructions or walls of text.
Human Voice
1–5 ptsDoes it sound like a human or like a machine? Are AI tells present — hedging, formulaic closers, corporate vocabulary?
5/5 looks like: Contractions, opinions, specificity, rhythm. Not 'leverage' and 'utilize'.
Usability
1–5 ptsCould this output be used directly, with minimal editing? Or does it need a significant rewrite before it's deployable?
5/5 looks like: Copy-paste ready. Not a starting point that still needs real work.
Relevance
1–5 ptsDid the agent follow the brief? Did it address the actual audience, goal, and constraints in the prompt?
5/5 looks like: On-brief. No hallucinated details. No scope creep. No generic filler that ignores the specifics.
25 pts max per task · scores are assigned per dimension after reading the full output · all tasks weighted equally
// grade scale
22–25 pts
Ship it
17–21 pts
Light editing needed
12–16 pts
Significant rewrite
7–11 pts
Not usable
1–6 pts
Complete failure
Overall grade is assigned based on total score across all tasks. A score of 66/75 (88%) maps to A. Thresholds: A ≥88%, B ≥68%, C ≥48%, D ≥28%, F below that.
// evaluation process
Identical prompts
Every agent receives the exact same prompt text. No rewording, no clarifications added. If an agent asks follow-up questions, we decline and ask it to proceed with the information given.
No system prompts
All tests are run on the default consumer interface for each agent — no custom system prompts, no pre-loaded context, no API mode unless the agent has no web UI.
Human evaluation
All scores are assigned by a human evaluator with 20+ years of performance marketing experience. The evaluator reads each output against the brief, scores it per dimension, and writes notes on what specifically earned or lost points.
First-pass output only
We score the first output returned. No regeneration, no cherry-picking a better attempt. The score reflects what you would get if you ran this prompt once and used the result.
// limitations
These evaluations reflect performance at the time of testing. Models are updated frequently. An agent that scores D today may score A in six months — or vice versa. Check the evaluation date on any result before drawing conclusions.
Content writing is one use case. An agent that scores low here may be exceptional at code, research, or other tasks. This is a narrow evaluation, intentionally so.
Human evaluation introduces subjectivity. We try to minimize it through a consistent rubric and detailed criteria, but a second evaluator would likely produce slightly different scores. The notes on each result explain the reasoning so you can apply your own judgement.