Before diving into the details, let me clarify something important. The blog post you are about to read was not written by ChatGPT alone. I am not a believer in completely automated AI content that removes the human entirely from the process. Even with the most advanced systems available today, there is still a strong need for editorial judgment, structure, and voice.

That being said, if you are pressed for time and want to create a high-quality article, the recent progress of large language models makes this a realistic option. In fact, what I will show you here is how I managed to write a blog post that scored 9.1 out of 10 in SEMrush's Writing Assistant using nothing more than one carefully designed prompt.

This is a score rarely achieved even by well-performing human-written posts. My experiment demonstrates not only the technical potential of these models but also how to combine them with human intervention to create optimized, engaging, and credible blog content.

TL;DR

I tested nine different large language models by generating 162 blog posts on a range of blog topics, using two prompt variations. Grok 3 turned out to be the strongest performer, even producing the single highest-scoring article with a 9.1 out of 10 rating.

Adding the phrase "You are an expert SEO copywriter" at the beginning of the prompt had almost no consistent effect. In some cases, particularly with Chinese-trained models, it even lowered the quality. American-trained models performed better overall, especially in terms of readability, which emerged as one of the most decisive scoring factors.

If your target audience is looking for informative and accessible articles, the most effective strategy is to use Grok 3, GPT-5, or Claude Opus 4.1 with a refined prompt. By fine tuning the process and testing against SEMrush's Writing Assistant, it is possible to achieve near-perfect scores that rival or even surpass human-written content.

Definitions

To avoid confusion, let's briefly revisit two key terms.

Large Language Model (LLM)
A large language model, or LLM, is a type of artificial intelligence that has "read" huge amounts of text from books, websites, and articles. Because of this training, it can put words together in a way that sounds natural, almost like a human. Tools like ChatGPT are LLMs: they can answer questions, explain things, help you with writing a blog post, or even chat with you like a very knowledgeable friend. LLMs discussed in this blog are GPT-5, Grok 3, Claude Sonnet 4, Claude Opus 4.1, Gemini 2.5 Flash, Qwen 3, Kimi K2 and DeepSeek V3.

SEO (Search Engine Optimization)
SEO stands for search engine optimization. It's all about helping your website or blog show up higher in Google when people search for something. Imagine you write a blog post—good SEO makes sure that the right people can actually find it. This is done by using the same words people type into Google (called search queries), organizing your page clearly, and making sure the content is genuinely useful. In simple terms, SEO is how you make your content easier to discover online.

Introduction

A while back, I tested several AI-powered tools that claimed to generate SEO-optimized product descriptions . To my disappointment, most were not very helpful. Rather than streamlining the process, they created awkward generated content or overcomplicated the workflow.

This led me to the conclusion that the best approach is to work directly with large language models instead of smaller tools built around them by AI developers. In this blog post, I explore which models perform best when writing a blog post, and how to prompt them effectively. While blog posts and product descriptions differ in format, the underlying principles of readability, keyword targeting, and content marketing apply to both.

Experiments

To compare models fairly, I designed a systematic experiment. I asked nine large language models to write blogs on nine different blog topics. For each case, I used two versions of the same prompt: one standard version and one with the additional instruction "You are an expert SEO copywriter." This gave me a total of 162 created posts.

The models I tested included GPT-5, Gemini-2.5-Flash, Claude-Sonnet-4, Claude-Opus-4.1, Qwen-3, Grok-3, Kimi K2, and DeepSeek V3. To make the process efficient, I wrote a small program that automated prompt generation. This program can be a starting point both for writing a blog post and for creating product descriptions, which I plan to cover in a future article. The code and all the prompts used in this experiment are available in a public GitHub repository: SEO Writing Prompts.

Every article produced was evaluated with SEMrush's Writing Assistant. This tool scores content across three key dimensions: readability, SEO performance, and tone of voice, with an optional originality check. Using feedback from these scores, I refined my prompts four times until I reached a final version that consistently produced the strongest results.

Results

The differences between models quickly became clear. Grok 3 delivered the best overall performance, followed by Claude Opus 4.1 and Gemini 2.5 Flash. GPT-5 was close behind, while the Chinese models—Qwen 3, Kimi K2, and DeepSeek V3—lagged behind the others.

Large Language Model Average Score
Grok 3 7.5333
Claude Opus 4.1 7.3375
Gemini 2.5 Flash 7.2389
GPT-5 7.1667
Qwen 3 7.0278
Claude Sonnet 4 6.9000
Kimi K2 6.8611
DeepSeek V3 6.4611

AVERAGE SCORES

Across all blog posts and prompt variations, Grok 3 performed the best, Claude Opus 4.1 came in second, and Gemini 2.5 Flash third. The average scores are shown in the table above.

The single best article of the experiment was produced by Grok 3, which achieved a 9.1 score. This is rarely reached even by high-quality human-written blogs.

When testing the effect of the additional "expert SEO copywriter" sentence, the results were underwhelming. On average, the improvement was only 0.05 points. For Chinese models, the effect was negative, while American-trained models showed a slight but inconsistent boost.

This suggests that simply adding stylistic flourishes to prompts does not guarantee better results. Instead, you need to fine tune the process, iterate carefully, and keep in mind long term ranking factors such as readability, meta description optimization, and click through rate improvements.

Additional Insights

Some qualitative differences between models were just as revealing as the scores. Claude Sonnet 4 consistently produced overly complex and difficult-to-read text, ignoring the prompt's request for simplicity. Its lack of readability pulled down its performance significantly.

By contrast, Claude Opus 4.1 produced text that was much easier to follow, even though it is a more advanced and typically more resource-intensive model. This shows that more capable models do not always generate more complicated writing—in some cases, they generate cleaner and more approachable types of content.

I also experimented with controversial prompts. For example, I asked the models to generate arguments for why young people should not read books. Most of them complied, producing surprisingly persuasive but questionable reasons. Claude Opus 4.1, however, refused to generate the article altogether, pointing out that research does not support such claims.

Finally, I noticed a consistent divide between American and Chinese-trained models. The American models performed better in English-language SEO tasks, likely because they were trained on larger volumes of English data. It's possible that the results would reverse if the same experiment were run in Chinese.

How I Would Approach Writing With LLMs

Based on the experiment, my advice depends on the type of article being written. For broad, general topics, I would recommend starting with a top-performing model such as Grok 3, GPT-5, or Claude Opus 4.1 using the refined prompt that I developed. You can even ask the model to suggest blog topics, generate a short description, and propose keywords to insert in prompt instead of writing it on your own.

Once the draft is ready, it should be tested in SEMrush's Writing Assistant. Feedback on readability, keyword use, and tone can either be incorporated manually or fed back into the model for further refinement. A few iterations are usually enough to push the score close to 9.

For more personal, experience-based blogs, I would take a slightly different route. Start with your own draft, outlining the structure, headings, and main points you want to cover. Then, pass that draft to the model for SEO refinement and polishing. From there, the same SEMrush-based feedback loop can be applied to finalize the article, with human intervention ensuring it feels authentic.

Conclusion

This experiment shows that AI has reached the point where it can generate near-perfect SEO blog posts. The strongest models—Grok 3, GPT-5, and Claude Opus 4.1—are capable of producing content that matches or surpasses human performance in key optimization metrics.

However, the human role remains crucial. Drafts generated by AI benefit from careful review, editorial refinement, and iterative feedback cycles. Adding simple instructions like "act as an expert SEO copywriter" does little by itself, but a well-structured prompt combined with systematic improvement produces excellent results.

In the end, AI should not be treated as a replacement for writers but as a powerful assistant. When paired with human oversight, it offers speed, consistency, and optimization, while the human provides authenticity, creativity, and judgment.

That balance creates blog posts that not only perform well in SEO but also succeed in content marketing by truly engaging the target audience, whether through core posts or complementary guest posts. Over the long term, this combination of AI efficiency and human creativity is what leads to strong online visibility and sustainable results.