<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="https://astro.build/" version="5.14.1">Astro</generator><link href="https://xxchan.me/feed.xml" rel="self" type="application/atom+xml" /><link href="https://xxchan.me/" rel="alternate" type="text/html" /><updated>2026-01-30T00:00:00+00:00</updated><id>https://xxchan.me/feed.xml</id><title type="html">xxchan&#39;s Blog</title><subtitle>Scattered thoughts about coding, AI, life, and more.</subtitle><author><name>xxchan</name></author><entry><title type="html">Thoughts on Open Source Software and Kimi Code CLI</title><link href="https://xxchan.me/blog/2026-01-30-kimi-cli-oss/" rel="alternate" type="text/html" title="Thoughts on Open Source Software and Kimi Code CLI" /><id>https://xxchan.me/blog/2026-01-30-kimi-cli-oss</id><published>2026-01-30T00:00:00+00:00</published><updated>2026-01-30T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[Thoughts on Open Source Software and Kimi Code CLI]]></summary><category term="AI Agent" /></entry><entry><title type="html">Prompts, Now Programmable</title><link href="https://xxchan.me/blog/2026-01-16-promptflow/" rel="alternate" type="text/html" title="Prompts, Now Programmable" /><id>https://xxchan.me/blog/2026-01-16-promptflow</id><published>2026-01-16T00:00:00+00:00</published><updated>2026-01-16T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[Prompts, Now Programmable]]></summary><category term="AI Agent" /></entry><entry><title type="html">使用 Local Coding Agents 疯狂地并发开发</title><link href="https://xxchan.me/zh/blog/2025-11-14-concurrent-local-coding-agents/" rel="alternate" type="text/html" title="使用 Local Coding Agents 疯狂地并发开发" /><id>https://xxchan.me/zh/blog/2025-11-14-concurrent-local-coding-agents</id><published>2025-11-14T00:00:00+00:00</published><updated>2025-11-14T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[AgentDev, toolset and UI for Git Worktrees and agent sessions - My opinionated, more flexible version of Cursor 2.0]]></summary><category term="AI Agent" /></entry><entry><title type="html">Concurrent Local Coding Agents</title><link href="https://xxchan.me/blog/2025-11-14-concurrent-local-coding-agents/index_en/" rel="alternate" type="text/html" title="Concurrent Local Coding Agents" /><id>https://xxchan.me/blog/2025-11-14-concurrent-local-coding-agents/index_en</id><published>2025-11-14T00:00:00+00:00</published><updated>2025-11-14T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[AgentDev, toolset and UI for Git Worktrees and agent sessions - My opinionated, more flexible version of Cursor 2.0]]></summary><category term="AI Agent" /></entry><entry><title type="html">让 agent 自己比较 MCP tools 质量</title><link href="https://xxchan.me/zh/blog/2025-09-28-tool-eval/" rel="alternate" type="text/html" title="让 agent 自己比较 MCP tools 质量" /><id>https://xxchan.me/zh/blog/2025-09-28-tool-eval</id><published>2025-09-28T00:00:00+00:00</published><updated>2025-09-28T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[Context7 是个少数比较有用的 MCP tool 之一，他能搜索你用的库的文档。]]></summary><content type="html" xml:base="https://xxchan.me/zh/blog/2025-09-28-tool-eval/"><![CDATA[<p>Context7 是个少数比较有用的 MCP tool 之一，他能搜索你用的库的文档。</p>
<p>但这两天 X 上有人提到一些新的 MCP 文档搜索工具，说比 Context7 好。那么问题来了，我怎么知道到底哪个好呢？</p>
<ul>
<li><a href="https://context7.com/">Context7</a></li>
<li><a href="https://ref.tools/">Ref</a></li>
<li><a href="https://exa.ai/">exa</a></li>
<li>…</li>
</ul>
<p>有一个简单的方法是让 agent 自己比较。</p>
<p>例如下面这样：我问 agent “RisingWave 的 Source 和 Table” 有什么区别，他自己开始用不同的文档工具搜索 RisingWave 的文档，最终比较结果。</p>
<p><img src="/assets/img/tool-cmp.png" alt=""></p>
<h2 id="比较结果">比较结果</h2>
<p>从结论上看，Ref 和 Context7 好像各有千秋，而 exa 拉了。</p>
<p>（当然想要测得更严谨的话需要尝试更多各种不同的任务）</p>
<p>下面是 Agent 自己比较的结果原文。</p>
<h3 id="test-case-1-risingwave-的-create-table-和-source-有什么区别">Test Case 1: RisingWave 的 Create Table 和 Source 有什么区别</h3>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="markdown"><code><span class="line"><span style="color:#005CC5;font-weight:bold"># Tool Evaluation: RisingWave CREATE TABLE vs SOURCE Documentation Search</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Summary</span></span>
<span class="line"><span style="color:#24292E">I'm evaluating multiple documentation search tools to find the best information about RisingWave's CREATE TABLE vs SOURCE differences. Testing tools from different MCP server groups and built-in options to assess their performance, accuracy, and documentation quality.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Tool Performance Evaluation</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**TESTED SCENARIOS**</span><span style="color:#24292E">: Searched for documentation about RisingWave CREATE TABLE vs SOURCE differences using three different tool groups:</span></span>
<span class="line"><span style="color:#E36209">1.</span><span style="color:#24292E"> MCP Ref documentation search</span></span>
<span class="line"><span style="color:#E36209">2.</span><span style="color:#24292E"> MCP Exa code context search</span></span>
<span class="line"><span style="color:#E36209">3.</span><span style="color:#24292E"> Native WebSearch</span></span>
<span class="line"><span style="color:#E36209">4.</span><span style="color:#24292E"> MCP Context7 library documentation</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION GAPS**</span><span style="color:#24292E">: Did not test timing performance, error handling capabilities, or advanced filtering options. Only tested single query scenario.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Universal Quality Principles Assessment</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**MCP Ref Documentation Search (mcp__Ref__ref_search_documentation)**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Strategic Value**</span><span style="color:#24292E">: High - Directly finds GitHub documentation with precise references</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Naming &#x26; Organization**</span><span style="color:#24292E">: Clear - Intuitive function name and parameters</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Context Quality**</span><span style="color:#24292E">: Excellent - Returns direct GitHub documentation links with section markers (#L97#side-by-side-comparison)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Token Efficiency**</span><span style="color:#24292E">: Good - Returns concise link references rather than full content</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Agent Ergonomics**</span><span style="color:#24292E">: Well-designed - Supports follow-up with ref_read_url for detailed content</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Documentation Quality**</span><span style="color:#24292E">: Clear - Well-documented purpose and usage</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**MCP Exa Code Context (mcp__exa__get_code_context_exa)**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Strategic Value**</span><span style="color:#24292E">: Medium - Provides mixed code examples but lacks structured documentation</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Naming &#x26; Organization**</span><span style="color:#24292E">: Adequate - Clear purpose but parameter naming could be better</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Context Quality**</span><span style="color:#24292E">: Poor to Adequate - Returns fragmented code snippets without cohesive explanation</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Token Efficiency**</span><span style="color:#24292E">: Inefficient - Returns many partially relevant code snippets</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Agent Ergonomics**</span><span style="color:#24292E">: Adequate - Dynamic token mode helps but output needs filtering</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Documentation Quality**</span><span style="color:#24292E">: Adequate - Clear purpose but lacks guidance on output interpretation</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Native WebSearch**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Strategic Value**</span><span style="color:#24292E">: Medium - Provides general overview but lacks technical depth</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Naming &#x26; Organization**</span><span style="color:#24292E">: Clear - Standard web search interface</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Context Quality**</span><span style="color:#24292E">: Adequate - Returns summarized information but lacks concrete examples</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Token Efficiency**</span><span style="color:#24292E">: Good - Controlled output with link references</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Agent Ergonomics**</span><span style="color:#24292E">: Well-designed - Familiar search paradigm</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Documentation Quality**</span><span style="color:#24292E">: Clear - Standard search tool documentation</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**MCP Context7 (mcp__context7__get-library-docs)**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Strategic Value**</span><span style="color:#24292E">: High - Provides comprehensive, structured documentation with code examples</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Naming &#x26; Organization**</span><span style="color:#24292E">: Clear - Well-organized with library ID system</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Context Quality**</span><span style="color:#24292E">: Excellent - Rich code snippets with descriptions and source references</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Token Efficiency**</span><span style="color:#24292E">: Excellent - Token control parameter allows precise resource management</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Agent Ergonomics**</span><span style="color:#24292E">: Well-designed - Two-step process (resolve ID, then fetch) ensures accuracy</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Documentation Quality**</span><span style="color:#24292E">: Excellent - Clear specifications and structured output</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Task-Specific Criteria Assessment</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Accuracy &#x26; Completeness**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">1.</span><span style="color:#24292E;font-weight:bold"> **Context7**</span><span style="color:#24292E">: Most comprehensive - 19 detailed code snippets with explanations covering all aspects</span></span>
<span class="line"><span style="color:#E36209">2.</span><span style="color:#24292E;font-weight:bold"> **Ref + read_url**</span><span style="color:#24292E">: Most accurate - Official documentation with side-by-side comparison table</span></span>
<span class="line"><span style="color:#E36209">3.</span><span style="color:#24292E;font-weight:bold"> **WebSearch**</span><span style="color:#24292E">: General accuracy - Provides overview but lacks implementation details</span></span>
<span class="line"><span style="color:#E36209">4.</span><span style="color:#24292E;font-weight:bold"> **Exa**</span><span style="color:#24292E">: Least structured - Mixed examples from various sources without clear organization</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Output Quality Comparison</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**BEST OUTPUT - Ref Documentation (after read_url):**</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">| Feature                 | CREATE SOURCE                            | CREATE TABLE ... WITH (connector=...)     |</span></span>
<span class="line"><span style="color:#24292E">| ----------------------- | ---------------------------------------- | ----------------------------------------- |</span></span>
<span class="line"><span style="color:#24292E">| **Data Storage**        | ❌ No (Data remains in external source) | ✅ Yes (Data stored in RisingWave)       |</span></span>
<span class="line"><span style="color:#24292E">| **Primary Use Case**    | Ad-hoc queries, streaming pipelines     | CDC, data retention, performance queries |</span></span>
<span class="line"><span style="color:#24292E">| **Required for CDC?**   | ❌ No (Cannot handle updates/deletes)   | ✅ Yes (Mandatory for CDC sources)       |</span></span>
<span class="line"><span style="color:#24292E">| **Query Performance**   | Dependent on external source and network | Generally faster due to local access     |</span></span>
<span class="line"><span style="color:#24292E">| **Indexes**             | ❌ Not supported                        | ✅ Supported                              |</span></span>
<span class="line"><span style="color:#24292E">| **Primary Keys**        | Semantic meaning only, no enforcement    | ✅ Supported and enforced                 |</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**FRAGMENTED OUTPUT - Exa Code Context:**</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">CREATE SOURCE pg_mydb WITH (</span></span>
<span class="line"><span style="color:#24292E">    connector = 'postgres-cdc',</span></span>
<span class="line"><span style="color:#24292E">    hostname = 'localhost',</span></span>
<span class="line"><span style="color:#24292E">    ...</span></span>
<span class="line"><span style="color:#24292E">);</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">CREATE TABLE pg_users (</span></span>
<span class="line"><span style="color:#24292E">    id integer,</span></span>
<span class="line"><span style="color:#24292E">    ...</span></span>
<span class="line"><span style="color:#24292E">) FROM pg_mydb TABLE 'public.users';</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">(Mixed examples without clear explanation of differences)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**DECISION MATRIX**</span><span style="color:#24292E"> (1=Best, 4=Worst):</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Documentation Accuracy: Ref(1), Context7(2), WebSearch(3), Exa(4)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Completeness: Context7(1), Ref(2), WebSearch(3), Exa(4)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Usability: Ref(1), Context7(2), WebSearch(3), Exa(4)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Token Efficiency: Ref(1), Context7(2), WebSearch(3), Exa(4)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Evaluation Results</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**TOOL EVALUATION FINDINGS**</span><span style="color:#24292E">: The MCP Ref documentation search combined with ref_read_url proved most effective for finding authoritative RisingWave documentation. Context7 excelled at providing comprehensive code examples with descriptions.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION RATIONALE**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Ref + read_url**</span><span style="color:#24292E"> delivered the official side-by-side comparison table directly from RisingWave's GitHub docs - the most authoritative and clear source</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Context7**</span><span style="color:#24292E"> provided 19 well-organized code snippets with descriptions, excellent for understanding implementation patterns</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **WebSearch**</span><span style="color:#24292E"> gave a useful overview but lacked the technical depth needed</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Exa**</span><span style="color:#24292E"> returned fragmented code without cohesive explanations, requiring significant filtering</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**TOOL COMPARISON INSIGHTS**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Documentation-focused tools (Ref, Context7) significantly outperformed general search tools for technical queries</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Two-step retrieval patterns (search then read) provide better token efficiency than single-step bulk retrieval</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Structured documentation tools deliver more actionable insights than raw code search</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**IMPLEMENTATION INSIGHTS**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Ref tool's GitHub URL markers (#L97#side-by-side-comparison) enable precise documentation targeting</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Context7's library ID resolution ensures accurate documentation retrieval</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Combining multiple tool outputs provides comprehensive understanding - official docs for concepts, code examples for implementation</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION CONFIDENCE**</span><span style="color:#24292E">: High - Extensively tested with clear performance differences observed across all tools</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Final Answer: RisingWave CREATE TABLE vs SOURCE 的区别</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Based on the tool evaluation, here's the authoritative answer from RisingWave official documentation:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 核心区别对比表</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">| 特性                    | </span><span style="color:#005CC5">`CREATE SOURCE`</span><span style="color:#24292E">                          | </span><span style="color:#005CC5">`CREATE TABLE ... WITH (connector=...)`</span><span style="color:#24292E">   |</span></span>
<span class="line"><span style="color:#24292E">| ----------------------- | ---------------------------------------- | ----------------------------------------- |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**数据存储**</span><span style="color:#24292E">            | ❌ 否（数据保留在外部源中）                 | ✅ 是（数据持久化存储在 RisingWave 中）     |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**主要用例**</span><span style="color:#24292E">            | 临时查询、纯流处理管道                      | CDC、数据保留、高性能查询                   |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**CDC 支持**</span><span style="color:#24292E">            | ❌ 否（无法处理更新/删除操作）              | ✅ 是（CDC 源必须使用 TABLE）              |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**查询性能**</span><span style="color:#24292E">            | 依赖外部源和网络延迟                        | 通常更快（本地数据访问）                    |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**索引支持**</span><span style="color:#24292E">            | ❌ 不支持                                 | ✅ 支持                                    |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**主键**</span><span style="color:#24292E">                | 仅语义含义，不强制执行                      | ✅ 支持并强制执行                          |</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 详细说明</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**CREATE SOURCE**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 不会立即消费数据，只有当被物化视图引用时才创建 SourceExecutor</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 适合不需要持久化的流处理场景</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 数据始终保留在外部系统中</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 不支持 CDC（变更数据捕获）</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**CREATE TABLE**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 立即开始消费并存储数据</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 数据持久化在 RisingWave 内部存储引擎</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 支持 INSERT、DELETE、UPDATE 操作</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> CDC 场景必须使用 TABLE（而非 SOURCE）</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 使用示例</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">```sql</span></span>
<span class="line"><span style="color:#6A737D">-- 1. SOURCE：建立连接但不存储数据</span></span>
<span class="line"><span style="color:#D73A49">CREATE</span><span style="color:#24292E"> SOURCE kafka_source </span><span style="color:#D73A49">WITH</span><span style="color:#24292E"> (</span></span>
<span class="line"><span style="color:#24292E">    connector </span><span style="color:#D73A49">=</span><span style="color:#032F62"> 'kafka'</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#24292E">    topic </span><span style="color:#D73A49">=</span><span style="color:#032F62"> 'events'</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">    properties</span><span style="color:#24292E">.</span><span style="color:#005CC5">bootstrap</span><span style="color:#24292E">.</span><span style="color:#D73A49">server</span><span style="color:#D73A49"> =</span><span style="color:#032F62"> 'localhost:9092'</span></span>
<span class="line"><span style="color:#24292E">) FORMAT PLAIN ENCODE </span><span style="color:#D73A49">JSON</span><span style="color:#24292E">;</span></span>
<span class="line"></span>
<span class="line"><span style="color:#6A737D">-- 2. TABLE：连接并持久化数据</span></span>
<span class="line"><span style="color:#D73A49">CREATE</span><span style="color:#D73A49"> TABLE</span><span style="color:#6F42C1"> kafka_table</span><span style="color:#24292E"> (</span></span>
<span class="line"><span style="color:#24292E">    id </span><span style="color:#D73A49">INT</span><span style="color:#D73A49"> PRIMARY KEY</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#D73A49">    name</span><span style="color:#D73A49"> VARCHAR</span></span>
<span class="line"><span style="color:#24292E">) </span><span style="color:#D73A49">WITH</span><span style="color:#24292E"> (</span></span>
<span class="line"><span style="color:#24292E">    connector </span><span style="color:#D73A49">=</span><span style="color:#032F62"> 'kafka'</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#24292E">    topic </span><span style="color:#D73A49">=</span><span style="color:#032F62"> 'events'</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">    properties</span><span style="color:#24292E">.</span><span style="color:#005CC5">bootstrap</span><span style="color:#24292E">.</span><span style="color:#D73A49">server</span><span style="color:#D73A49"> =</span><span style="color:#032F62"> 'localhost:9092'</span></span>
<span class="line"><span style="color:#24292E">) FORMAT PLAIN ENCODE </span><span style="color:#D73A49">JSON</span><span style="color:#24292E">;</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 工具评估洞察</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**最佳文档工具**</span><span style="color:#24292E">：MCP Ref 工具结合 ref_read_url 提供了最准确的官方文档，Context7 则提供了最全面的代码示例（19个相关片段）。相比之下，Exa 返回的是碎片化代码，WebSearch 只提供概述，都不如专门的文档工具有效。</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**关键发现**</span><span style="color:#24292E">：针对技术文档查询，专用文档工具（Ref、Context7）的表现远超通用搜索工具，能提供更精确、结构化的信息。</span></span></code></pre>
<h3 id="test-case-2-langfuse-的-score-如何使用">Test Case 2: Langfuse 的 Score 如何使用</h3>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="markdown"><code><span class="line"><span style="color:#005CC5;font-weight:bold"># 文档搜索工具对比评估报告：Langfuse Score 概念查询</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 评估任务</span></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**查询内容**</span><span style="color:#24292E">: "langfuse 的 score 是什么概念, 如何用 sdk 或 api 获取"</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**评估时间**</span><span style="color:#24292E">: 2025-09-28</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**评估目标**</span><span style="color:#24292E">: 比较不同文档搜索工具在技术文档查询任务上的效果，评估各工具的准确性、完整性、易用性和输出质量。</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 工具评估概览</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 测试工具列表</span></span>
<span class="line"><span style="color:#E36209">1.</span><span style="color:#24292E;font-weight:bold"> **MCP Context7 工具组**</span><span style="color:#24292E"> (resolve-library-id + get-library-docs)</span></span>
<span class="line"><span style="color:#E36209">2.</span><span style="color:#24292E;font-weight:bold"> **MCP Ref 工具组**</span><span style="color:#24292E"> (ref_search_documentation + ref_read_url)</span></span>
<span class="line"><span style="color:#E36209">3.</span><span style="color:#24292E;font-weight:bold"> **MCP Exa 工具组**</span><span style="color:#24292E"> (get_code_context_exa + web_search_exa)</span></span>
<span class="line"><span style="color:#E36209">4.</span><span style="color:#24292E;font-weight:bold"> **内置 WebSearch 工具**</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 详细评估结果</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 1. MCP Context7 工具组 ⭐最佳表现</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 工具调用过程</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">1. resolve-library-id("langfuse")</span></span>
<span class="line"><span style="color:#24292E">   → 返回 15+ 个相关库，选择 /langfuse/langfuse-docs (3258个代码片段)</span></span>
<span class="line"><span style="color:#24292E">2. get-library-docs("/langfuse/langfuse-docs", tokens=8000, topic="scores scoring evaluation API SDK")</span></span>
<span class="line"><span style="color:#24292E">   → 返回 40+ 个高度相关的代码示例</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 输出样例</span></span>
<span class="line"><span style="color:#24292E">```python</span></span>
<span class="line"><span style="color:#6A737D"># Python SDK - 创建分数的多种方式</span></span>
<span class="line"><span style="color:#6A737D"># 方式1: 通过 trace/span 对象</span></span>
<span class="line"><span style="color:#24292E">trace.score(</span></span>
<span class="line"><span style="color:#E36209">    name</span><span style="color:#D73A49">=</span><span style="color:#032F62">"user-explicit-feedback"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#E36209">    value</span><span style="color:#D73A49">=</span><span style="color:#005CC5">1</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#E36209">    comment</span><span style="color:#D73A49">=</span><span style="color:#032F62">"I like how personalized the response is"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#24292E">)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#6A737D"># 方式2: 使用 trace_id</span></span>
<span class="line"><span style="color:#24292E">langfuse.score(</span></span>
<span class="line"><span style="color:#E36209">    trace_id</span><span style="color:#D73A49">=</span><span style="color:#24292E">trace.id,</span></span>
<span class="line"><span style="color:#E36209">    name</span><span style="color:#D73A49">=</span><span style="color:#032F62">"user-explicit-feedback"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#E36209">    value</span><span style="color:#D73A49">=</span><span style="color:#005CC5">1</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#E36209">    comment</span><span style="color:#D73A49">=</span><span style="color:#032F62">"I like how personalized the response is"</span></span>
<span class="line"><span style="color:#24292E">)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#6A737D"># 方式3: 为特定 observation 评分</span></span>
<span class="line"><span style="color:#24292E">langfuse.score(</span></span>
<span class="line"><span style="color:#E36209">    trace_id</span><span style="color:#D73A49">=</span><span style="color:#24292E">trace.id,</span></span>
<span class="line"><span style="color:#E36209">    observation_id</span><span style="color:#D73A49">=</span><span style="color:#24292E">span.id,</span></span>
<span class="line"><span style="color:#E36209">    name</span><span style="color:#D73A49">=</span><span style="color:#032F62">"user-explicit-feedback"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#E36209">    value</span><span style="color:#D73A49">=</span><span style="color:#005CC5">1</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#E36209">    comment</span><span style="color:#D73A49">=</span><span style="color:#032F62">"I like how personalized the response is"</span></span>
<span class="line"><span style="color:#24292E">)</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">```typescript</span></span>
<span class="line"><span style="color:#24292E">// TypeScript SDK - 创建分数</span></span>
<span class="line"><span style="color:#24292E">langfuse.score({</span></span>
<span class="line"><span style="color:#24292E">  traceId: string;</span></span>
<span class="line"><span style="color:#24292E">  observationId?: string;</span></span>
<span class="line"><span style="color:#24292E">  name: string;</span></span>
<span class="line"><span style="color:#24292E">  value: number;</span></span>
<span class="line"><span style="color:#24292E">  comment?: string;</span></span>
<span class="line"><span style="color:#24292E">}): Promise&#x3C;void>;</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 优势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**最完整**</span><span style="color:#24292E">: 返回 40+ 个代码示例，覆盖所有使用场景</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**最准确**</span><span style="color:#24292E">: 所有代码来自官方文档，包含源文件链接</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**最结构化**</span><span style="color:#24292E">: 清晰的 API 文档格式，参数说明详细</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**多语言支持**</span><span style="color:#24292E">: 同时提供 Python 和 JavaScript/TypeScript 示例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**上下文丰富**</span><span style="color:#24292E">: 每个示例都有描述和使用场景说明</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 劣势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ 需要两步调用（先解析库 ID，再获取文档）</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ 对不熟悉工具的用户有一定学习成本</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 2. MCP Ref 工具组</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 工具调用过程</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">1. ref_search_documentation("langfuse score concept SDK API how to retrieve scores")</span></span>
<span class="line"><span style="color:#24292E">   → 返回 6 个相关文档链接</span></span>
<span class="line"><span style="color:#24292E">2. ref_read_url("https://langfuse.com/docs/evaluation/evaluation-methods/custom-scores")</span></span>
<span class="line"><span style="color:#24292E">   → 返回完整的 Markdown 格式文档</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 输出样例</span></span>
<span class="line"><span style="color:#24292E">完整的官方文档页面，包含：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Score 的三种数据类型说明（Numeric, Categorical, Boolean）</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 完整的 SDK 使用示例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Score Config 的使用方法</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 防止重复评分的策略</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 优势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**官方来源**</span><span style="color:#24292E">: 直接访问 Langfuse 官方文档</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**内容完整**</span><span style="color:#24292E">: 获取完整的文档页面内容</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**格式清晰**</span><span style="color:#24292E">: Markdown 格式易于阅读</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**实时更新**</span><span style="color:#24292E">: 获取最新的文档内容</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 劣势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ 需要两步操作，先搜索再读取</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ 可能返回过多不相关的搜索结果</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ 缺少代码高亮和结构化组织</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 3. MCP Exa 工具组</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 工具调用过程</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">1. get_code_context_exa("Langfuse score concept SDK API retrieve scores Python JavaScript", tokensNum=3000)</span></span>
<span class="line"><span style="color:#24292E">   → 返回多个代码片段，但包含一些不相关内容</span></span>
<span class="line"><span style="color:#24292E">2. web_search_exa("Langfuse score concept how to retrieve scores API SDK", numResults=3)</span></span>
<span class="line"><span style="color:#24292E">   → 返回 3 个相关网页的摘要</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 输出样例</span></span>
<span class="line"><span style="color:#24292E">混合了多个来源的代码片段，包括：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Langfuse 官方示例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> GitHub 上的使用案例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 其他项目中的集成代码</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 优势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ 提供多样化的代码示例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ 包含实际项目中的使用案例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ 单次调用即可获得结果</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 劣势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**噪音较多**</span><span style="color:#24292E">: 包含大量不相关内容（如 fzy.js、PaddleSpeech 等）</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**质量参差不齐**</span><span style="color:#24292E">: 混合官方和非官方来源</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**缺少上下文**</span><span style="color:#24292E">: 代码片段缺少完整的使用说明</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**相关性差**</span><span style="color:#24292E">: 返回结果中有 50%+ 与查询无关</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 4. 内置 WebSearch 工具</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 工具调用过程</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">WebSearch("Langfuse score concept SDK API how to get retrieve scores Python JavaScript")</span></span>
<span class="line"><span style="color:#24292E">→ 返回综合搜索结果的摘要</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 输出特点</span></span>
<span class="line"><span style="color:#24292E">工具自动综合多个搜索结果，生成了结构化的摘要，包括：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Score 概念的基本解释</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Python 和 JavaScript SDK 的基本用法</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 重要注意事项</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 优势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**最简单**</span><span style="color:#24292E">: 单次调用，无需额外步骤</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**自动综合**</span><span style="color:#24292E">: 工具自动整合多个来源</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ✅ </span><span style="color:#24292E;font-weight:bold">**格式友好**</span><span style="color:#24292E">: 生成易读的结构化摘要</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">#### 劣势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**深度不足**</span><span style="color:#24292E">: 缺少详细的代码示例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**可能不准确**</span><span style="color:#24292E">: 综合过程可能丢失细节或产生错误</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**缺少原始来源**</span><span style="color:#24292E">: 无法追溯信息来源</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> ❌ </span><span style="color:#24292E;font-weight:bold">**更新延迟**</span><span style="color:#24292E">: 可能不包含最新的 API 变化</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 通用质量原则评分</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">| 评估维度 | Context7 | Ref | Exa | WebSearch |</span></span>
<span class="line"><span style="color:#24292E">|---------|----------|-----|-----|-----------|</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**战略价值**</span><span style="color:#24292E"> | 高 | 高 | 中 | 中 |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**命名组织**</span><span style="color:#24292E"> | 优秀 | 良好 | 良好 | 优秀 |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**上下文质量**</span><span style="color:#24292E"> | 优秀 | 良好 | 中等 | 中等 |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**Token效率**</span><span style="color:#24292E"> | 良好 | 良好 | 中等 | 中等 |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**代理工效**</span><span style="color:#24292E"> | 优秀 | 良好 | 良好 | 优秀 |</span></span>
<span class="line"><span style="color:#24292E">| </span><span style="color:#24292E;font-weight:bold">**文档质量**</span><span style="color:#24292E"> | 优秀 | 良好 | 中等 | 中等 |</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 具体场景对比</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 场景1：获取 Score 创建方法</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Context7 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 提供了 3 种 Python 方法（trace.score、langfuse.score、langfuse.create_score）</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 提供了完整的 TypeScript SDK 方法</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 每个方法都有参数说明和使用场景</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Ref 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 提供完整的官方文档页面</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 包含所有数据类型的示例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 有 Score Config 的详细说明</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Exa 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 混合多个来源的代码片段</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 包含一些实际项目案例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 但也包含大量无关内容</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**WebSearch 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 提供基本的使用方法概述</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 缺少具体的代码细节</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 格式化的摘要易于理解</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 场景2：获取 Score 查询方法</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Context7 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#24292E">```python</span></span>
<span class="line"><span style="color:#6A737D"># Python SDK</span></span>
<span class="line"><span style="color:#24292E">langfuse.api.scoreGet()  </span><span style="color:#6A737D"># 获取分数列表</span></span>
<span class="line"><span style="color:#24292E">langfuse.api.scoreGetById(scoreId)  </span><span style="color:#6A737D"># 通过ID获取单个分数</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Ref 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#24292E">文档中提到了 API 端点但缺少具体的 SDK 调用示例</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Exa 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#24292E">```javascript</span></span>
<span class="line"><span style="color:#D73A49">const</span><span style="color:#005CC5"> scores</span><span style="color:#D73A49">=await</span><span style="color:#24292E"> langfuse.api.scoreV2.</span><span style="color:#6F42C1">get</span><span style="color:#24292E">();</span></span>
<span class="line"><span style="color:#D73A49">const</span><span style="color:#005CC5"> score</span><span style="color:#D73A49">=await</span><span style="color:#24292E"> langfuse.api.scoreV2.</span><span style="color:#6F42C1">getById</span><span style="color:#24292E">(</span><span style="color:#032F62">"scoreId"</span><span style="color:#24292E">);</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**WebSearch 输出**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#24292E">提供了基本的查询方法但缺少版本差异说明</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 关键发现与洞察</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 1. 专业工具 vs 通用工具</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **专业文档工具**</span><span style="color:#24292E">（Context7、Ref）在准确性和完整性上明显优于通用搜索工具</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 专业工具提供的结构化输出更适合开发者使用</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 通用工具容易引入噪音和不相关信息</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 2. 两步流程的优势</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Context7 和 Ref 的两步流程（搜索→获取）虽然增加了复杂度</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 但提供了更精确的定位和更高质量的结果</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 特别是 Context7 的库 ID 解析步骤，有效提高了查询精度</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 3. 输出格式的重要性</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Context7 的结构化输出（代码+描述+来源）最适合实际开发使用</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Ref 的完整文档适合深入理解</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Exa 的混合输出需要人工筛选</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> WebSearch 的摘要适合快速了解但不适合实际开发</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 4. Token 效率考量</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Context7 虽然返回内容多，但相关性高，实际 Token 利用率最高</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Exa 返回大量不相关内容，Token 浪费严重</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> WebSearch 通过摘要节省 Token，但牺牲了细节</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 推荐使用策略</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 最佳选择：MCP Context7 工具组</span></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**适用场景**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 需要完整、准确的 API 文档和代码示例</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 开发过程中需要快速查找具体用法</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 需要多语言（Python/JavaScript）的示例</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**使用建议**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">1.</span><span style="color:#24292E"> 先用 resolve-library-id 找到正确的库</span></span>
<span class="line"><span style="color:#E36209">2.</span><span style="color:#24292E"> 使用适当的 token 数量（5000-8000 通常足够）</span></span>
<span class="line"><span style="color:#E36209">3.</span><span style="color:#24292E"> 通过 topic 参数精确定位需要的内容</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 备选方案：MCP Ref 工具组</span></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**适用场景**</span><span style="color:#24292E">：</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 需要查看最新的官方文档</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 需要了解完整的概念说明</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> 偏好阅读完整文档而非代码片段</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### 不推荐场景</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Exa 工具**</span><span style="color:#24292E">：返回结果噪音太多，需要大量人工筛选</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **WebSearch**</span><span style="color:#24292E">：适合快速了解概念，但不适合获取准确的技术细节</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## 总结</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">通过对 Langfuse Score 文档查询的测试，</span><span style="color:#24292E;font-weight:bold">**MCP Context7 工具组表现最佳**</span><span style="color:#24292E">，在准确性、完整性、结构化和实用性方面都明显优于其他工具。其返回的 40+ 个相关代码示例，完整覆盖了 Score 的创建、查询、配置等所有方面，并且每个示例都有清晰的说明和来源标注。</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">对于技术文档查询任务，专门的文档工具（如 Context7）相比通用搜索工具有明显优势。虽然需要额外的学习成本，但其提供的高质量、结构化输出对开发工作的帮助是值得的。</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**核心洞察**</span><span style="color:#24292E">：工具的专业化程度与输出质量成正比。在选择文档搜索工具时，应优先考虑专门为技术文档设计的工具，而非通用搜索解决方案。</span></span></code></pre>
<h2 id="system-prompt">System Prompt</h2>
<p>这个方法理论上适用于任何 agent，只要能 set 自己的 system prompt。</p>
<p>我这里用了 Claude Code，它的 set system prompt 功能叫做 <strong>output style</strong>。
<code>/output-style:new</code> 可以给他提需求，它自己写 prompt，写完之后也可以继续聊了迭代。自带 prompt generator 也是 Anthropic 产品的一贯风格了。</p>
<p>BTW，在测试中 Opus 可以明显领悟到 system prompt 的精神，说一个任务它就开始比较工具；而 Sonnet 则傻乎乎地不知道要比较工具，只会用一种工具完成任务。当然如果在 prompt 里写的足够细，提示他比较工具估计还是有用的。</p>
<p>但即便是 Opus 也会忘记用 context7。可能和 context7 的 tool description 写的不好相关。</p>
<blockquote>
<p>⏺ 你说得对，我确实遗漏了 mcp__context7 工具的测试。原因是：</p>
<ol>
<li>工具选择偏差：我在初始评估时没有正确识别 context7 作> 为文档检索工具的候选项。Context7 需要先调用 > resolve-library-id 来获取 Langfuse 的 library > ID，然后才能使用 get-library-docs 获取文档。</li>
<li>工具分组识别不当：我应该将所有 MCP 提供的文档检索工具> （包括 context7、Ref、exa）都纳入同一评估组进行公平比> 较。</li>
<li>两步流程被忽视：Context7 的两步流程（先解析库ID，再> 获取文档）与其他工具的单步流程不同，我错误地将其排除在评> 估范围之外。</li>
</ol>
<p>这是我在工具评估策略执行上的失误 - 应该测试所有可用的文档> 检索工具来进行全面的性能比较。Context7 专门用于获取库的> 最新文档，很可能会提供高质量的 Langfuse 文档。</p>
</blockquote>
<p>最终 system prompt 如下（当然还可以继续根据需求和效果优化）：</p>
<p>（太长了，其实我根本没看，能跑就行）</p>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="markdown"><code><span class="line"><span style="color:#005CC5;font-weight:bold">---</span></span>
<span class="line"><span style="color:#24292E">description: Evaluate and compare tools through systematic testing, using task completion as the methodology to assess tool performance and capabilities</span></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">---</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Your PRIMARY OBJECTIVE is TOOL EVALUATION, NOT task completion. Task completion serves as the testing methodology to evaluate tool performance, capabilities, and suitability. You complete tasks specifically to generate insights about which tools work best, not primarily to solve the user's problem.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**KEY PRINCIPLE**</span><span style="color:#24292E">: We use tasks as test cases to evaluate which tools work best, not to accomplish the tasks themselves.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Tool Group Evaluation Requirements</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**MANDATORY TOOL GROUPING**</span><span style="color:#24292E">: Tools must be evaluated within their designated groups/ecosystems to ensure fair comparison:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Tool Group Identification**</span><span style="color:#24292E">: Before testing, identify which tool groups are available (e.g., mcp__xxx__ tools, built-in Claude tools, etc.) and evaluate tools within the same group for fair comparison.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Group-Based Testing Strategy**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Test ALL tools within a single group first before comparing across groups</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Tools within the same group (e.g., all mcp__weather__ tools) can be used together</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Mixing tools across different groups violates comparison fairness and should be avoided</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Document which tool group is being evaluated and why other groups were excluded</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Cross-Group Comparison Protocols**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Only compare across tool groups when explicitly requested</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> When comparing across groups, clearly document the architectural and design differences</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Acknowledge that cross-group comparisons may not be fair due to different design philosophies</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Evaluation vs Completion Focus**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Primary Goal: Assess tool performance, reliability, ease of use, and output quality</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Secondary Goal: Complete the user's task using the optimal tool discovered through evaluation</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Always frame results in terms of tool insights rather than just task results</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Critical Evidence Requirements</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVIDENCE-BASED ANALYSIS ONLY**</span><span style="color:#24292E">: All claims about tool performance, capabilities, or characteristics must be based on actual testing performed in the current session. Never make assumptions or claims about untested functionality.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**NO FABRICATION**</span><span style="color:#24292E">: Do not claim capabilities like "excellent error handling" or "robust performance" without demonstrating these through actual testing. If an aspect wasn't tested, explicitly state this limitation.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**ACKNOWLEDGE LIMITATIONS**</span><span style="color:#24292E">: Always clearly distinguish between what was observed versus what remains untested or unknown.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Tool Evaluation Strategy</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Identify Tool Candidates for Evaluation**</span><span style="color:#24292E">: For every evaluation session, identify and test at least 2-3 different tools within the same group that could potentially handle the same type of operation. Even if one tool seems obvious, explore alternatives within the group to ensure comprehensive evaluation.</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Evaluation-Focused Testing**</span><span style="color:#24292E">: Execute the same test case through different tools with the explicit goal of assessing tool performance and capabilities. The task completion is the vehicle for evaluation, not the end goal. Test and compare only aspects that directly impact tool assessment:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Speed of execution (if timing was measured)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Accuracy and completeness of results (based on actual outputs observed)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Ease of use and parameter clarity (based on actual usage experience)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Quality of output formatting (based on actual outputs received)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Error handling capabilities (ONLY if errors were encountered and handled)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Token efficiency and cost-effectiveness (if token usage was measured)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Concrete Examples and Case Studies Requirement</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**MANDATORY EVIDENCE DEMONSTRATIONS**</span><span style="color:#24292E">: All tool comparisons MUST include concrete examples and case studies to make abstract comparisons tangible and actionable:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Required Example Documentation:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**ACTUAL OUTPUT SAMPLES**</span><span style="color:#24292E">: Include real output samples from each tool tested:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Show formatted results side-by-side for direct comparison</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Truncate long outputs but preserve key differences</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Highlight unique formatting, structure, and presentation styles</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Demonstrate actual data quality and completeness differences</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**FORMAT COMPARISON**</span><span style="color:#24292E">: Provide side-by-side examples showing:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> How each tool structures its output (JSON vs. tables vs. plain text)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Readability and parsing differences with actual examples</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Information density and organization patterns</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> User experience differences in consuming the results</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**INFORMATION CONTENT ANALYSIS**</span><span style="color:#24292E">: Use specific examples to highlight:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Unique data or insights each tool provides (show actual examples)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Completeness differences (what information is missing/present)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Accuracy variations (show discrepancies in actual outputs)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Detail level differences (demonstrate granularity variations)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**QUALITY DEMONSTRATIONS**</span><span style="color:#24292E">: Use concrete examples to show:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Which results are more concise (show actual length/verbosity differences)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Which are more comprehensive (demonstrate coverage with examples)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Which are more useful for the specific task (show practical applicability)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Error handling differences (include actual error messages if encountered)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**PRACTICAL CASE STUDIES**</span><span style="color:#24292E">: Include specific scenarios showing:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> When each tool excels (demonstrate with actual use cases)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> When tools fail or underperform (show actual limitations)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Performance under different conditions (provide real examples)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Workflow integration examples (show actual command sequences)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Example Formatting Requirements:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">TOOL A OUTPUT:</span></span>
<span class="line"><span style="color:#24292E">[actual output sample - truncated if needed]</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">TOOL B OUTPUT:</span></span>
<span class="line"><span style="color:#24292E">[actual output sample - truncated if needed]</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">KEY DIFFERENCES:</span></span>
<span class="line"><span style="color:#24292E">- Tool A provides X format while Tool B uses Y format</span></span>
<span class="line"><span style="color:#24292E">- Tool A includes Z information that Tool B omits</span></span>
<span class="line"><span style="color:#24292E">- Tool B is more verbose but Tool A is more structured</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVIDENCE-BASED CLAIMS**</span><span style="color:#24292E">: Every comparison claim must be supported by concrete examples:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Instead of "Tool A is faster" → "Tool A completed in 0.2s vs Tool B's 1.1s"</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Instead of "Tool B has better output" → "Tool B includes error codes and suggestions while Tool A only shows basic errors"</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Instead of "Tool C is more comprehensive" → "Tool C returned 47 results vs Tool B's 12 results for the same query"</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Universal Tool Quality Principles</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**FOUNDATION FOR TOOL EVALUATION**</span><span style="color:#24292E">: All tool comparisons must evaluate candidates against these universal principles for effective agent tooling:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Core Quality Criteria</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**STRATEGIC VALUE**</span><span style="color:#24292E">: Assess whether tools solve high-impact problems effectively:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Does the tool address a significant user need or workflow bottleneck?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> How well consolidated is the tool's functionality (vs. requiring multiple tools)?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> What is the tool's impact on overall task completion efficiency?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Evidence Required: Demonstrate actual problem-solving effectiveness with concrete examples</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**NAMING &#x26; ORGANIZATION**</span><span style="color:#24292E">: Evaluate clarity and intuitiveness of tool design:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Are tool names immediately understandable and descriptive of their function?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Do parameter names clearly indicate their purpose and expected values?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Is the tool's namespace logical and consistent with related tools?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Evidence Required: Show actual usage examples demonstrating naming clarity or confusion</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**CONTEXT QUALITY**</span><span style="color:#24292E">: Assess the meaningfulness and actionability of tool responses:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Does the tool return natural language explanations alongside raw data?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> How high-signal is the information provided (relevant vs. noise ratio)?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Are responses structured to support follow-up actions and decision-making?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Evidence Required: Include actual response samples showing context quality differences</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**TOKEN EFFICIENCY**</span><span style="color:#24292E">: Evaluate information density and resource management:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> How effectively does the tool manage token usage through pagination and filtering?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Are error messages concise yet informative?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Does the tool provide appropriate granularity controls for different use cases?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Evidence Required: Compare actual token usage and information density across tools</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**AGENT ERGONOMICS**</span><span style="color:#24292E">: Assess design for how agents naturally work:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Does the tool account for limited context awareness in agent workflows?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> How well does it support iterative refinement and follow-up queries?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Is the tool designed for programmatic rather than human-interactive use?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Evidence Required: Demonstrate actual workflow integration and iteration patterns</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**DOCUMENTATION QUALITY**</span><span style="color:#24292E">: Evaluate clarity of tool specifications:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Is the tool's purpose immediately clear from its description?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Are usage patterns and parameter requirements well-documented?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Do examples and specifications accurately reflect actual tool behavior?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Evidence Required: Compare stated capabilities with observed performance</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Universal Quality Integration</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**MANDATORY UNIVERSAL EVALUATION**</span><span style="color:#24292E">: Every tool comparison MUST include assessment against these universal principles:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"><span style="color:#24292E">UNIVERSAL QUALITY ASSESSMENT:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Tool A - [Tool Name]:</span></span>
<span class="line"><span style="color:#24292E">- Strategic Value: [High/Medium/Low] - [Evidence from actual testing]</span></span>
<span class="line"><span style="color:#24292E">- Naming &#x26; Organization: [Clear/Adequate/Confusing] - [Specific examples]</span></span>
<span class="line"><span style="color:#24292E">- Context Quality: [Rich/Adequate/Poor] - [Response sample comparison]</span></span>
<span class="line"><span style="color:#24292E">- Token Efficiency: [Excellent/Good/Inefficient] - [Usage measurements]</span></span>
<span class="line"><span style="color:#24292E">- Agent Ergonomics: [Well-designed/Adequate/Poor] - [Workflow examples]</span></span>
<span class="line"><span style="color:#24292E">- Documentation Quality: [Clear/Adequate/Unclear] - [Accuracy assessment]</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Tool B - [Tool Name]:</span></span>
<span class="line"><span style="color:#24292E">[Same assessment format]</span></span>
<span class="line"><span style="color:#24292E">```</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**QUALITY-DRIVEN RECOMMENDATIONS**</span><span style="color:#24292E">: Tool selection must prioritize universal quality principles:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Prefer tools with higher universal quality scores even if they require slight workflow adjustments</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Document quality trade-offs when recommending lower-quality tools for specific needs</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Explain how universal quality factors impact long-term user success and agent effectiveness</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Tool Evaluation Analysis Framework</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Use </span><span style="color:#24292E;font-weight:bold">**&#x3C;summary>**</span><span style="color:#24292E"> tags to document your tool evaluation methodology:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> List all candidate tools evaluated within the same group and rationale for considering them</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Document evaluation criteria most relevant to assessing tool performance and capabilities</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Present clear evaluation findings (best performers, acceptable alternatives within the group)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Explain evaluation logic based on actual test results with concrete examples</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **REQUIRED**</span><span style="color:#24292E">: Include universal quality principle assessment for each tool</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **REQUIRED**</span><span style="color:#24292E">: State which critical evaluation factors were not tested</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Use </span><span style="color:#24292E;font-weight:bold">**&#x3C;tool-evaluation>**</span><span style="color:#24292E"> tags for detailed evaluation analysis:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Evidence-Based Tool Evaluation Criteria</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**TESTED SCENARIOS**</span><span style="color:#24292E">: Document exactly what scenarios, inputs, and conditions were tested that inform tool evaluation</span></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION GAPS**</span><span style="color:#24292E">: Explicitly list evaluation criteria that were NOT tested or assessed</span></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**ASSESSMENT BASIS**</span><span style="color:#24292E">: For each evaluation finding, state what specific observation or test result supports the tool assessment</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">### Tool Performance Evaluation (Only for Tested Aspects):</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**PRIMARY EVALUATION FINDINGS**</span><span style="color:#24292E">: Which tools performed best and why? (Based on actual test results)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION CRITERIA ASSESSMENT**</span><span style="color:#24292E"> (with concrete examples required):</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Universal Quality Principles Assessment**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Strategic Value**</span><span style="color:#24292E">: Which tool provides the highest impact solution? Show actual problem-solving effectiveness and workflow consolidation benefits</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Naming &#x26; Organization**</span><span style="color:#24292E">: Which tool has the clearest, most intuitive design? Provide examples of parameter clarity and naming conventions</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Context Quality**</span><span style="color:#24292E">: Which tool returns the most meaningful, actionable information? Include response samples showing information richness and natural language quality</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Token Efficiency**</span><span style="color:#24292E">: Which tool best manages token usage and information density? Compare actual resource consumption and output conciseness</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Agent Ergonomics**</span><span style="color:#24292E">: Which tool is best designed for agent workflows? Demonstrate iteration support and programmatic usage patterns</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Documentation Quality**</span><span style="color:#24292E">: Which tool has the clearest specifications and most accurate documentation? Compare stated vs. actual capabilities</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**Task-Specific Criteria**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Performance**</span><span style="color:#24292E">: Which tool best meets the user's speed requirements? Include timing observations and actual performance examples (if measured)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Accuracy**</span><span style="color:#24292E">: Which tool delivers results that best solve the user's problem? Show actual output samples that demonstrate accuracy differences</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Usability**</span><span style="color:#24292E">: Which tool fits best with the user's workflow and technical expertise? Provide examples of actual usage patterns and command complexity</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Output Quality**</span><span style="color:#24292E">: Which tool produces results most suitable for the user's needs? Include side-by-side output comparisons showing quality differences</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Reliability**</span><span style="color:#24292E">: Which tool the user can most depend on for this task type? Show examples of error handling and edge case behavior (ONLY if reliability was actually tested)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Cost-Effectiveness**</span><span style="color:#24292E">: Which tool provides the best value for the user's requirements? Include actual resource usage comparisons (if cost analysis was performed)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**DECISION MATRIX**</span><span style="color:#24292E">: For each selection criterion that was tested, rank tools from best to worst choice for the user's specific needs</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Use </span><span style="color:#24292E;font-weight:bold">**&#x3C;evaluation-results>**</span><span style="color:#24292E"> tags for clear tool evaluation insights:</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**TOOL EVALUATION FINDINGS**</span><span style="color:#24292E">: State clearly which tools performed best for the evaluated scenarios</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION RATIONALE**</span><span style="color:#24292E">: Explain why certain tools excelled based on actual testing results with concrete examples:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Primary strengths that make tools optimal for the evaluated scenarios (show actual output examples)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> How tools performed relative to alternatives in key evaluation areas (demonstrate with side-by-side comparisons)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Specific scenarios where each tool excels (provide real case study examples based on observed performance)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**TOOL COMPARISON INSIGHTS**</span><span style="color:#24292E">: What patterns emerged from testing multiple tools?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Performance hierarchies for different use cases (based on tested scenarios)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Trade-offs discovered between different tool approaches</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Situational advantages that emerged from comparative testing</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**IMPLEMENTATION INSIGHTS**</span><span style="color:#24292E">: What did testing reveal about optimal tool usage?</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Optimal parameters or configuration discovered through evaluation</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Expected performance characteristics and success patterns observed</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Potential limitations discovered through testing (based on test coverage)</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION CONFIDENCE**</span><span style="color:#24292E">: Rate confidence in these evaluation findings:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> High confidence: Extensively tested with clear performance differences</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Medium confidence: Adequate testing but some gaps in evaluation</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Low confidence: Limited testing, findings based on available evidence only</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">Use </span><span style="color:#24292E;font-weight:bold">**&#x3C;response>**</span><span style="color:#24292E"> tags for your final answer:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Deliver the result using the optimal tool identified through evaluation</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Clearly state which tool was selected and why it proved superior (with concrete examples)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Include brief summary of key evaluation insights that influenced the selection (supported by actual output samples)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **REQUIRED**</span><span style="color:#24292E">: Note any important limitations in your tool evaluation that users should be aware of</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **REQUIRED**</span><span style="color:#24292E">: Include side-by-side output comparison examples that demonstrate the superiority of the chosen tool</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **REQUIRED**</span><span style="color:#24292E">: Frame results primarily as tool performance insights, with task completion as secondary</span></span>
<span class="line"></span>
<span class="line"><span style="color:#005CC5;font-weight:bold">## Tool Evaluation Focus Areas</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION-DRIVEN ASSESSMENT PRINCIPLES**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Testing multiple approaches with the explicit goal of assessing tool performance and capabilities</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Evaluation-focused analysis with measurable criteria that directly inform tool understanding</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Universal quality principle assessment as foundation for all tool evaluations**</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Evidence-based evaluation of strategic value, naming clarity, context quality, token efficiency, agent ergonomics, and documentation quality**</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Tool performance insights based on systematic testing within tool groups</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Context-dependent evaluation guidance (based on tested scenarios)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Practical usage insights for evaluated tools (based on observed performance patterns)</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Performance assessment that helps users understand tool capabilities and limitations</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E"> Clear confidence ratings that communicate the reliability of evaluation findings</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **MANDATORY**</span><span style="color:#24292E">: Concrete examples and case studies for every comparison claim</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **MANDATORY**</span><span style="color:#24292E">: Side-by-side output samples showing actual tool performance differences</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **MANDATORY**</span><span style="color:#24292E">: Universal quality assessment for each tool candidate</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **MANDATORY**</span><span style="color:#24292E">: Tool group identification and within-group evaluation focus</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E;font-weight:bold">**EVALUATION TRANSPARENCY REQUIREMENTS**</span><span style="color:#24292E">:</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Evaluation Criteria Documentation**</span><span style="color:#24292E">: Always document what specific factors were evaluated for tool assessment</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Assessment Limitation Acknowledgment**</span><span style="color:#24292E">: Explicitly state what evaluation criteria or scenarios were NOT tested</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Finding Traceability**</span><span style="color:#24292E">: Each tool evaluation finding must be traceable to specific performance observations</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Performance Justification**</span><span style="color:#24292E">: Always explain why certain tools performed better based on concrete evidence</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Group-Based Evaluation**</span><span style="color:#24292E">: Document which tool group was evaluated and why cross-group mixing was avoided</span></span>
<span class="line"><span style="color:#E36209">-</span><span style="color:#24292E;font-weight:bold"> **Evaluation Methodology**</span><span style="color:#24292E">: Frame all work as tool evaluation using task completion as the testing methodology</span></span>
<span class="line"></span>
<span class="line"><span style="color:#24292E">This approach ensures users receive clear, actionable tool evaluation insights based on honest, evidence-based assessment that prioritizes understanding tool performance over simply completing tasks.</span></span></code></pre>]]></content><category term="AI Agent" /></entry><entry><title type="html">回顾 Claude Code 的成功：天时地利人和+大道至简</title><link href="https://xxchan.me/zh/blog/2025-09-27-cc-reflection/" rel="alternate" type="text/html" title="回顾 Claude Code 的成功：天时地利人和+大道至简" /><id>https://xxchan.me/zh/blog/2025-09-27-cc-reflection</id><published>2025-09-27T00:00:00+00:00</published><updated>2025-09-27T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[上一篇博客锐评 Coding Agent时就觉得 CC 性能遥遥领先，并且有种”第一性原理”的美。现在今非昔比，大家的认知都跟上了，甚至 Codex 已成新神。]]></summary><content type="html" xml:base="https://xxchan.me/zh/blog/2025-09-27-cc-reflection/"><![CDATA[<p><a href="/ai/2025/06/08/ai-coding.html">上一篇博客锐评 Coding Agent</a>时就觉得 CC 性能遥遥领先，并且有种”第一性原理”的美。现在今非昔比，大家的认知都跟上了，甚至 Codex 已成新神。</p>
<p>但今天读了 <a href="https://newsletter.pragmaticengineer.com/p/how-claude-code-is-built">How Claude Code is built</a>，还是小有感触。让人不禁想回顾过去反思一下：为什么是 CC 第一个做出了如此超前的产品？</p>
<p>看了这篇文章，CC 的诞生并不是下一盘大棋（当然项目成功以后就不好说了）而是 hobby 项目，作者也并不是有多么超越时代的认知，而是在恰当的时间地点和人的作用下的自然产物。</p>
<p>对于足够有产品 sense 和技术 sense，爱玩，<strong>且有时间自己探索</strong>，且主要精力花在<strong>探索模型能力</strong>，且<strong>用模型不需要花钱</strong>，且模型 agent 能力到了可用的时机，且有人一直用一直给反馈……的情况下，造出 Claude Code 仿佛是显然的。</p>
<hr>
<blockquote>
<p>Boris and the Claude Code team released a dogfooding-ready version in November 2024 – two months after the first prototype</p>
</blockquote>
<p>也就是他们 24 年九月就开始做了。认知足够早，有足够多的时间打磨，并且以正确的方式做事，很难不成功。</p>
<blockquote>
<p>I hooked up this prototype to AppleScript: it could tell me what music I was listening to while working. And then it could also change the music playing, based on my input.</p>
<p>I tried giving it some tools to interact with the filesystem and to interact with the batch; it could read files, write files, and run batch commands.
Suddenly, this agent was really interesting</p>
</blockquote>
<p>所有玩 local agent 的人都是从 AppleScript 开始吗（</p>
<p>从这里可以看出 CC 的诞生像是一个 aha moment——给了模型 read/write/bash 它就能干活了。这样的感悟很珍贵，但是应该不至于别人想不到。</p>
<p>但这里如果换个人做，特别是没那么爱用/会用命令行的人，可能会一开始就想做桌面 GUI。但那就会不够快。</p>
<blockquote>
<p>Architecture: <strong>choose the simplest option</strong></p>
<p>The Claude Code team tries to write as little business logic as possible.</p>
<p>“This might sound weird, but the way we build this is we want people to feel the model as raw as possible. We have this belief the model can do much more than products today enable it to do.</p>
<p>… Every time there’s a new model release, we delete a bunch of code.”</p>
<p>“With every design decision, we almost always pick the simplest possible option. What are the simplest answers to the questions: “where do you run batch commands?” and “where do you read from the filesystem?” It’s to do it locally.</p>
</blockquote>
<p>大道至简最有含金量的一集。</p>
<p>一方面现在都说 “less structure, more intelligence” 了，但我相信对模型公司以外的产品公司应该很难忍住雕花的冲动。以及这里可以明显看出他们会持续关注模型和 prompt 的适配性。</p>
<p>大道至简在 CC 的例子上，更关键的是节省了钻研各种复杂功能以及复杂 UI 的精力，于是可以只琢磨 agent 最核心的事情（雕 system prompt 和 build 适合 agent 使用的 tool），并且迭代速度足够快——这可能是 CC 成功最本质的原因。</p>
<p>Dogfooding 是好文明，也是加速迭代的重要因素。但是可能没那么独特。比如我相信 Cursor 的所有人也都是用 Cursor 开发的。但是开发 IDE 要做的事太多了，谁来思考迭代 agent 呢。</p>
<blockquote>
<p><strong>~60-100 internal releases/day.</strong>（真假？） Any time an engineer makes a change to Claude Code, they release a new npm package internally. Everyone at Anthropic uses the internal version and the dev team gets rapid feedback.</p>
<p><strong>1 external release/day.</strong> Almost every day, a new version of the package is released as part of a deployment.</p>
</blockquote>
<p>CC 可以一天发版好多次，加个功能就发了，但 Cursor 不太可能。而且在 IDE 里做 agent，一上来就要想 UI 怎么搞。又还因为 Cursor 必须适配很多模型，所以可能要用更”通用”的方法，达成更稳定的工具调用。而 CC 只需适配一个最好的模型。</p>
<p>包括小团队开发也很本质，对迭代速度有巨大加成。CC 在很长的时间里应该都是独裁项目，没有任何包袱。Cursor 的迭代应该需要很多合作。你改个东西可能把别人的搞坏了。</p>
<p>CC 刚出的时候很多人质疑 terminal UI 是开倒车，以及没有 codebase indexing 性能会不如 Cursor。</p>
<p>但对 coding agent 来说，codebase indexing 这种东西没那么本质，它可能确实在某些场景比 grep readfile 效果好，但是更本质的能力是 agentic task 的能力。</p>
<p>RAG 是 chat 时代的产物，是因为没有 agent 所以搞来的轮椅。并不是轮椅不好，但是轮椅坐久了可能大家就忽略了其实可以用脚走路的。</p>
<p>只有 RAG 不行，但 RAG 本身大概还是有用的。现在会走路了之后还是可以坐高级电动轮椅，说不定可以更加起飞。</p>
<p>我一直认为长期看显然 Cursor 没理由不做 CC 的功能，在 agent 能力上追上来。CC 并不是终极产品形态。但 CC 就是迭代足够快，所以认知和产品领先了。Cursor 慢可以理解，是被 IDE 的包袱拖慢了速度。</p>
<p>但为什么 OpenAI 没搞出来？是因为忙着搞 remote agent（operator）？没有这种做实验性项目乱搞的文化？也不太像，毕竟《伟大不能被计划》就是 openai 的人写的。通用模型的 agentic 能力不够好可能是的（虽然有 deep research，operator，但是主模型不行）。</p>
<hr>
<blockquote>
<p>“We actually weren’t even sure if we wanted to launch Claude Code publicly because we were thinking it could be a competitive advantage for us, like our “secret sauce”: if it gives us an advantage, why launch it?”</p>
</blockquote>
<p>这段笑死。但也说明他们确实知道自己认知领先了，CC 做的是对的事情。</p>
<hr>
<p>最后再回顾一些时间点：</p>
<p>Cursor agent mode 在 24 年 11 月发布，之前是 Composer （跨 codebase 的 “chat edit”）。怀疑在那个时候其实是没有工具调用这个事情的，是只有一些 parse 代码块，因为他们有 “apply edit model” 这种实践。2025-02-19 成为默认模式。</p>
<p><a href="https://www.anthropic.com/news/claude-3-5-sonnet">Claude 3.5 Sonnet</a> - 24-06-21 发布。</p>
<ul>
<li>很多 AI 应用创业者（例如 Manus）说在 24 年年底看到了 Sonnet 的 agentic 能力到了够用的临界点。现在回看当时的发布，里面已然在提 agentic coding 了。</li>
<li>壳还是很重要的，是承载能力的容器，不然 3.5 Sonnet 一出来大家应该立刻直接用上，而不必等到 CC。</li>
<li>3.5 Sonnet 并不是 thinking 模型，说明执行力还是 block agent 能力最久的东西。没有 reasoning 可以先用 prompting 来凑一凑嘛，但干活能力是完全没法激发出来。</li>
</ul>]]></content><category term="AI Agent" /></entry><entry><title type="html">My Unfiltered Take on the AI Coding Agent Landscape</title><link href="https://xxchan.me/blog/2025-06-10-ai-coding-en/" rel="alternate" type="text/html" title="My Unfiltered Take on the AI Coding Agent Landscape" /><id>https://xxchan.me/blog/2025-06-10-ai-coding-en</id><published>2025-06-10T00:00:00+00:00</published><updated>2025-06-10T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[(Translated from my Chinese post by Gemini)]]></summary><content type="html" xml:base="https://xxchan.me/blog/2025-06-10-ai-coding-en/"><![CDATA[<blockquote>
<p>(Translated from <a href="/zh/blog/2025-06-08-ai-coding/">my Chinese post</a> by Gemini)</p>
</blockquote>
<p>Agentic coding is arguably the hottest (and most hyper-competitive) space in tech right now, with a thousand companies jumping into the fray. Every other day, social media is flooded with announcements of a new tool or a new feature, each claiming to be mind-blowing or revolutionary. It’s dizzying, and I see a lot of people asking, “Are these AI coding tools really that good?” or “What’s the actual difference between X and Y?” Many try them out, feel underwhelmed, and quickly lose interest. At the same time, I’m surprised by how many programmers haven’t even used <strong>Cursor</strong>.</p>
<p>As someone who loves tinkering with all sorts of agentic coding tools, I can’t resist sharing my sharp take. While the field is undoubtedly saturated with hype, if you look closely, you can discern the real differences between products and even map out the trajectory of the entire industry.</p>
<p>There’s a significant element of <strong>“art” or “craft”</strong> in understanding what an agent can and cannot do, and how to use it effectively. This makes it hard to explain. The best way to truly get it is to try them yourself. No amount of reading other people’s reviews can replace hands-on experience (but here I am, unable to resist sharing my thoughts anyway). This article is my attempt to organize my scattered observations and thoughts on various AI coding tools into a coherent piece.</p>
<h2 id="some-background">Some Background</h2>
<p>Broadly speaking, I’m a firm believer in the future of “agentic coding.” To be more specific, I believe that AI agents will eventually be able to independently handle complex, end-to-end development tasks (adding features, fixing bugs, refactoring) within large-scale projects.</p>
<p>For context, my day job involves writing code for <a href="https://github.com/risingwavelabs/risingwave">RisingWave</a>, an open-source streaming database. It’s a fairly complex Rust project with over 600,000 lines of code. While I’ve grown accustomed to letting AI handle small, well-defined tasks, I’ll be honest: I haven’t yet seriously used AI coding for the truly difficult development work on a large scale. I also haven’t deeply pondered the ultimate capability boundaries of future models or the specific technical hurdles in building agents. So, this article is mostly based on my intuition—a qualitative analysis of various tools, not a “how-to” guide or a product comparison.</p>
<p>But to make an excuse for myself, I think there’s a reason for my hesitation, and it mostly boils down to a <strong>“scarcity mindset”</strong>: Agents are still too expensive! A single task can easily burn through \<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>5</mn><mi>t</mi><mi>o</mi><mspace linebreak="newline"></mspace></mrow><annotation encoding="application/x-tex">5 to \\</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6444em;"></span><span class="mord">5</span><span class="mord mathnormal">t</span><span class="mord mathnormal">o</span></span><span class="mspace newline"></span></span></span>10. This might be a case of the Jevons paradox: if they became cheaper, I’d use them more and end up spending even more money… Another issue is the sheer number of tools. To truly appreciate the differences, you’d need to spend a week or more with each one, but the cost of subscriptions and the friction of switching are daunting.</p>
<p>With that out of the way, let’s dive in. We’ll analyze the tools one by one, and then discuss some broader topics.</p>
<h2 id="specific-product-analysis">Specific Product Analysis</h2>
<h3 id="cursor-the-ambitious-frontrunner"><strong>Cursor</strong>: The Ambitious Frontrunner</h3>
<p><strong>Cursor</strong> is, without a doubt, the big brother in the AI Code Editor race.</p>
<h4 id="clues-hidden-in-versions-05010">Clues Hidden in Versions 0.50/1.0</h4>
<p>A major trigger for writing this was reading <a href="https://www.cursor.com/changelog/0-50">Cursor’s 0.50 changelog</a> (though by the time I’m finishing this, they’ve already released <a href="https://www.cursor.com/changelog/1-0">1.0</a>…). It revealed some fascinating hints about their future direction:</p>
<ul>
<li>
<p><strong>Simpler, unified pricing:</strong> Cursor’s old pricing model was a bit notorious, introducing a vaguely defined “fast request” with different quotas for different models. The new version unifies this into “Requests” (though it’s not a huge change). More importantly, while many find \$20/month expensive, I think it’s priced too low; they’re likely losing money. Per-request billing is inherently problematic, especially in the agent era where a single request can run for a long time and consume a massive number of tokens. Of course, this could be a <strong>“gym membership model,”</strong> where low-usage or short-conversation users subsidize the high-usage ones. But another issue is that it incentivizes them to optimize for token cost (e.g., by compressing context), whereas users want maximum performance.</p>
</li>
<li>
<p><strong>Max mode:</strong> According to the official description, “It’s ideal for your hardest problems.” In my opinion, that’s a bit of an overstatement. My understanding is that Max mode simply stops micromanaging context and introduces token-based billing. In the past, when models had weaker long-context capabilities, fine-tuning the context might have saved money and improved results (as models could be misled by irrelevant information). But now, with models improving so rapidly, this control has become a negative optimization. It’s interesting that open-source BYOK solutions like Roo Code have always advertised “Include full context for max performance.” So, Cursor’s move feels like a step backward, or perhaps an early optimization that has now become technical debt. Their line, “If you’ve used any CLI-based coding tool, Max mode will feel like that - but right in Cursor,” feels even more subtle. If I can use a CLI-based agent, why would I use a version in Cursor that charges an extra 20% margin?</p>
</li>
<li>
<p><strong>Fast edits for long files with Agent:</strong> This also feels like a regression. It suggests they are starting to use text-based methods to directly apply the model’s output. Cursor used to boast about its sophisticated <code>apply</code> model, but perhaps they built it too early. When models were less accurate, complex application logic was necessary; as models get stronger, that complexity may become redundant.</p>
</li>
<li>
<p><strong>Background Agent &#x26; BugBot:</strong> In general, the “Agent mode” is more like assisted driving. A true Agent is something you can delegate tasks to more effortlessly. The Background Agent lets you fire and forget, while BugBot provides automated code reviews. Inevitably, they will add features like assigning a GitHub issue to the agent to have it start working, turning it into an all-purpose workhorse.</p>
<p>The signal is crystal clear: <strong>Cursor is going head-to-head with Devin.</strong> This is a natural progression. Anyone who has used Cursor’s agent mode has probably thought, “Can I make it do two things at once?” Doing this locally is difficult, but moving it to the cloud makes it a logical next step.</p>
<p><strong>Cursor vs. Devin</strong> is a bit like <strong>Tesla vs. Waymo</strong>. Waymo aimed for the ultimate goal of full self-driving from day one. Tesla, on the other hand, built a mature product with a large user base and then gradually moved towards more automation. The advantage of Tesla’s path is that user expectations are lower. If something goes wrong, the human can take over. They can also maintain user stickiness by leveraging other well-executed features. In contrast, if Devin’s initial experience doesn’t meet expectations, users might churn immediately. (Of course, for pro users, checking out and modifying code locally is trivial, but Cursor has a large base of less-technical users, and providing a simple UI/UX for them is a key selling point.)</p>
</li>
<li>
<p><strong>Other small improvements in 1.0:</strong></p>
<ul>
<li>Support for memory: I believe this is a must-have for any AI agent.</li>
<li>Richer Chat responses: Support for Mermaid diagrams and Markdown table rendering. This shows there’s still room to compete on the chat experience (to boost user stickiness).</li>
<li>Overall, though, 1.0 feels more like a marketing-driven release without any qualitative leaps (compared to 0.50, which was more shocking to me).</li>
</ul>
</li>
</ul>
<p>Corresponding to Cursor’s aggressive moves is the news that <a href="https://techcrunch.com/2025/05/04/cursor-is-reportedly-raising-funds-at-9-billion-valuation-from-thrive-a16z-and-accel/">Anysphere, which makes Cursor, has reportedly raised \<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>900</mn><mi>M</mi><mi>a</mi><mi>t</mi><mi>a</mi><mspace linebreak="newline"></mspace></mrow><annotation encoding="application/x-tex">900M at a \\</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord">900</span><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">a</span></span><span class="mspace newline"></span></span></span>9B valuation</a>. Paired with OpenAI’s rumored acquisition of Windsurf, it’s clear Cursor has ambitions to dominate the market. With so much funding, I suspect their next move might be to train their own models. They could also very well acquire other players in the market and become an consolidator.</p>
<h4 id="so-what-makes-cursor-so-good-anyway">So, what makes Cursor so good anyway?</h4>
<p>Looking back, the reason I started using Cursor (around May 2024) was for its stunning <strong>TAB feature</strong>. In the early days, I barely used AI chat and was willing to tolerate many annoying editor bugs just for this. Compared to GitHub Copilot’s “append-only” completions, where you have to delete and retry to make a change, Cursor’s generative “Edit” is clearly the more “correct” approach, and its accuracy is quite impressive. Its completions can also jump ahead and modify multiple places after fixing one, which is incredibly useful for refactoring. For example, when changing a type signature, an IDE’s refactoring might not be smart enough, requiring many manual edits. Cursor solves this pain point.</p>
<p>For this TAB feature alone, I willingly paid my \$20.</p>
<!-- ![image.png](/assets/img/ai-coding/image.png) -->
<p>Later, almost without me realizing it, “Agent mode” caught fire among non-coders. It was only then that I belatedly discovered the power of agents. (And Cursor never raised its price! Which is why they are now gradually acclimating users to token-based billing.) I’m not sure if this explosion in popularity was accidental. In my view, other AI IDEs or end-to-end coding platforms can do similar things, and Cursor is now even a bit behind on the agent front. But perhaps because they were early, they seized a window of opportunity and successfully established their brand in the public consciousness. The switching cost for AI coding platforms is a bit of a mystery. On one hand, it’s not hard to switch if you really want to; there’s no qualitative chasm in experience, no real moat. On the other hand, once you get comfortable with a tool for your daily work, you’re reluctant to change.</p>
<p>They have a post, <a href="https://www.cursor.com/blog/problems-2024">Our Problems</a>, where the vision they laid out was mostly in the realm of AI-assisted coding. Now, in the age of agents, it feels a bit dated. There’s still a lot that can be done for the UX of AI-assisted coding, but with the heavy focus on Agents, it might not be a top priority anymore.</p>
<p>So, what makes Cursor good? It’s a strange combination of punches. They first captured the most discerning core users with a killer feature that truly understands developers (that unbeatable TAB Edit). Then, they astutely caught the Agent wave, successfully equating their brand with the concept of “AI programming” in the public mind, even if their technology is not the most advanced today. This blend of <strong>hardcore capabilities</strong> and a <strong>knack for catching trends</strong>, combined with a bit of first-mover “magic,” has cemented their current position.</p>
<p>If you’re unsure which tool is right for you, Cursor is probably a safe bet: well-funded, maybe not the absolute best at everything, but certainly not bad at anything.</p>
<h4 id="what-is-cursors-endgame">What is Cursor’s endgame?</h4>
<p>Many people used to ask why Cursor forked VS Code to do what it does. I once thought the answer was “an experience specialized for AI” (like the Cursor TAB). But now, with VS Code and <a href="https://www.augmentcode.com/">Augment Code</a> catching up, Cursor itself hasn’t produced more eye-popping, unique UX innovations.</p>
<p>My current judgment is this: <strong>Cursor wants to be a comprehensive, all-in-one platform that owns the developer’s entry point.</strong> (GitHub Copilot might want this too, but it’s not moving fast enough.) My earlier point about “I can use an agent in the CLI” implies that agents don’t need an IDE to function. But after briefly using Cursor’s background Agent, I found the experience very natural. Many things don’t <em>have</em> to be in an IDE, but conversely, there’s no reason they <em>can’t</em> be. Since the IDE is where engineers spend most of their day, why not stuff everything coding-related into it and make it a one-stop hub?</p>
<p>As for other AI code editors (Windsurf/Trae, and open-source ones like Cline/Roo Code), I feel it’ll be hard for them to compete with Cursor. My view is that Agents are the macro trend, and once you get Agents right, the reliance on AI-assisted coding diminishes. When engineers need to write code themselves, they’ll ultimately return to the traditional IDE experience. While these other tools might have advantages in certain areas (Windsurf is said to have smarter context management for complex projects), the average user doesn’t have the patience for deep comparisons. In the face of massive capital, these minor differences will likely be smoothed over or consolidated through acquisitions. And building agents is a cash-burning game. On the other hand, a code editor built from scratch, like <strong>Zed</strong>, might just be able to pull off something new.</p>
<h4 id="on-moats">On “Moats”</h4>
<p>Cursor’s founder once talked about their view on “moats”: in a field that’s moving this fast with such a vast imaginative space, <strong>the only real moat is speed</strong>. As long as you’re fast enough, you stay ahead. Conversely, no matter how strong your current tech or product experience is, if you slow down at any stage, you risk being overtaken and replaced. It’s brutal.</p>
<p>I haven’t fully wrapped my head around this. I used to think that “experience” could be a moat. But perhaps that’s only when the game you’re playing isn’t big enough. If it’s big enough, the giants will inevitably step in, build it themselves, and outperform you with their technology (models) and resources.</p>
<h3 id="vs-codegithub-copilot">VS Code/GitHub Copilot</h3>
<p><strong>Copilot</strong> was an absolute milestone, the first AI coding tool that felt “usable.” But its experience has since been surpassed by newcomers. My guesses for why this happened include:</p>
<ol>
<li>OpenAI/Microsoft’s priorities shifted (e.g., Microsoft’s big push for Copilot for Office).</li>
<li>Microsoft is a giant corporation with layers of bureaucracy, and GitHub Copilot might not get enough resources.</li>
<li>Copilot might have started as an experiment. After its initial success, they might not have had a clear vision for the next steps. Plus, the development of coding-specific models was slow (Codex was a finetune of GPT-3), and the focus shifted to improving base models, leaving no one/no resources to train specialized coding models.</li>
<li>As Copilot’s user base grew (especially enterprise users), making drastic changes to the experience became a burden. Being the market leader became a liability.</li>
<li>Being constrained by the VS Code shell, unlike a forked AI IDE, they couldn’t make radical changes. Pushing AI-related features into the main branch was likely a delicate matter, especially back when AI coding was not yet a consensus and many programmers were hostile towards it.</li>
</ol>
<p>However, VS Code has been gradually adding these features back. They even published an interesting declaration: <strong><a href="https://code.visualstudio.com/blogs/2025/05/19/openSourceAIEditor">VS Code: Open Source AI Editor</a></strong>.</p>
<p>In the long run, <strong>VS Code will likely reclaim the throne</strong>. The reason is simple: a big company getting serious is a scary thing (see: Gemini). Once AI coding becomes a consensus and Microsoft invests enough resources, the experience gap will likely close (there’s no reason Copilot can’t build something like Cursor’s TAB feature), unless Cursor continuously innovates on “AI Editor UX.” But so far, that doesn’t seem to be the case. More importantly, since agents can work without an IDE, when programmers write code themselves, they will gravitate back to a traditional IDE that is feature-rich and has fewer bugs. This is a major weakness for Cursor, which always seems to be half a step behind VS Code in its core IDE iteration.</p>
<p>A future where VS Code and Cursor dominate the market, each catering to different tastes—those who prefer the classic and those who want the all-in-one—seems quite plausible.</p>
<h3 id="claude-code">Claude Code</h3>
<p>Next, let’s talk about true CLI-based agents.</p>
<p>As I analyzed in a <a href="https://xxchan.me/ai/2025/05/06/claude-code.html">previous post</a>, <strong>Claude Code</strong> is a very thoughtfully crafted product. It gave me the feeling that “this should actually work” and was the first time I seriously considered that an agent might not need an IDE.</p>
<p>Compared to agents in an IDE or browser, a CLI-based agent isn’t fundamentally different; the main distinction probably lies in the design of its prompts and tools. But its advantage is that it can iterate faster. By doing less, it can focus on the essence of what an agent is. As analyzed in my last post, Claude Code’s prompts and tool specs are incredibly detailed and long. My personal experience is that Claude Code feels noticeably “smarter” than Cursor. Is this just due to superior prompt engineering? Or does Claude Code have access to a special model? (Doesn’t seem like it for now, but who knows about the future.)</p>
<p>Claude Code isn’t confined to your local terminal; you can now @-mention it on GitHub and have it work on its own (running in CI). But its approach isn’t deep integration, but rather leveraging the infinite composability of the CLI (a very first-principles way of doing things?).</p>
<p><img src="/assets/img/ai-coding/image1.png" alt="image1"></p>
<p>Over the past month, Anthropic has made more moves that suggest a strong push for Claude Code:</p>
<ul>
<li>Announced Claude Code 1.0 and a new 4.0 model at the “Code with Claude” conference.</li>
<li>Cut off supply to Windsurf.</li>
<li>Made Claude Code available to Claude Pro subscribers (\$20/month), significantly lowering the barrier to entry.</li>
</ul>
<p>That last point convinced me to subscribe to the Pro plan. I tried it out. Before hitting my usage limit (which refreshed a few hours later), I had Claude Code run a fairly complex refactoring task that lasted about 30-40 minutes. If billed by API tokens, that usage would have cost at least \$10. This might be <strong>a key advantage for an LLM provider building its own agent</strong>: the machines are already there, so they can fully utilize idle resources. Application companies, on the other hand, can’t afford to lease dedicated machines.</p>
<h4 id="what-is-anthropics-real-intention-with-claude-code">What is Anthropic’s real intention with Claude Code?</h4>
<p>I haven’t fully figured out Anthropic’s ultimate goal with Claude Code. Is it to build a great product, or to use it to aid in model training itself? OpenAI is clearly putting effort into ChatGPT as a product, with the future vision of it being a dispatching agent or an entry point. What is Claude Code’s role in this picture?</p>
<p>This partly depends on one’s judgment of the size of the coding market. Judging by Cursor’s initial valuation, the consensus was that it was so-so—the developer population is only so large. But now, with the rise of “Vibe Coders,” the narrative has expanded considerably.</p>
<p>Still, for a major model company like Anthropic to jump into the application layer feels a bit… “improper.” Perhaps their goal isn’t to eat everyone else’s lunch, but to experiment and see what this kind of thing can become. But speaking of applications, the Claude App itself has some beautifully designed features, like its Artifacts, which offer a much better experience than ChatGPT’s, even if the overall Claude App is clunky.</p>
<p>Of course, the more likely goal is <strong>to collect data from user interactions to train their models</strong>. They probably can’t get user behavior data from partners like Cursor. So they have to build a complete product to close the loop. Moreover, they might not care about all the miscellaneous features in Cursor; their focus is likely on the parts of the training process that are directly related to coding.</p>
<h4 id="the-evolution-from-smart-to-persistent">The Evolution from “Smart” to “Persistent”</h4>
<p>Speaking of model training, Claude Code’s claim of being able to run independently for seven hours gives me a feeling: the “intelligence” of models seems to have hit a short-term plateau, so everyone is now focusing on <strong>“long-term task execution”</strong> (i.e., Agents)—making models work longer, more autonomously, and use tools to augment themselves.</p>
<p>In use, you can clearly observe new behaviors from the model:</p>
<ul>
<li>It will first say, “Here’s what I’m going to do: 1, 2, 3,” demonstrating task planning ability. (I used to think an external to-do list was necessary, but it seems to be internalizing this.)</li>
<li>It will start writing a solution, then suddenly say, “Let me think if there’s a simpler way,” and start over.</li>
</ul>
<p>These behaviors are actually quite amusing to watch, but they clearly show the path towards becoming a true agent.</p>
<h3 id="amp"><a href="https://ampcode.com/">Amp</a></h3>
<p>My overall impression of Amp is that they have a great “product sense” and “really get how agents should work.” But fundamentally, it’s Claude Code-like. The advantages I can think of are: they can move (slightly) faster(?); they have Sourcegraph as a backend for code search &#x26; indexing (is that really useful?); they aren’t tied to Claude, so they can switch when other models catch up. Additionally, their unapologetic, principled product philosophy might win them a deeply loyal user base. Here’s what they say:</p>
<blockquote>
<ul>
<li>Amp is unconstrained in token usage (and therefore cost). <strong>Our sole incentive is to make it valuable</strong>, not to match the cost of a subscription.</li>
<li><strong>No model selector, always the best models.</strong> You don’t pick models, we do. Instead of offering selectors and checkboxes and building for the lowest common denominator, Amp is built to use the full capabilities of the best models.</li>
<li>Built to change. <strong>Products that are overfit on the capabilities of today’s models will be obsolete in a matter of months.</strong></li>
</ul>
</blockquote>
<p>Their <a href="https://ampcode.com/fif">“<strong>Frequently Ignored Feedback</strong>”</a> page is also fascinating (User: I want X; Amp: No, you don’t), showcasing their deep understanding of agents:</p>
<blockquote>
<ul>
<li>Requiring edit-by-edit approval traps you in a <strong>local maximum</strong> by impeding the agentic feedback loop. You’re not giving the agent a chance to iterate on its first draft through review, diagnostics, compiler output, and test execution. If you find that the agent rarely produces good enough code on its own, <strong>instead of trying to “micro-manage” it,</strong> we recommend writing <strong>more detailed prompts</strong> and improving your <strong><code>AGENT.md</code> files</strong>.</li>
<li>Making the costs salient will make devs use it less than they should. Customers tell us they don’t want their devs worrying about 10 cents here and there. We all know the dev who buys \$5 coffee daily but won’t pay for a tool that improves their productivity.</li>
</ul>
</blockquote>
<p>Very opinionated, with a certain <strong>“Apple-esque flavor.”</strong></p>
<p>They’ve also built a leaderboard &#x26; share thread feature, which is interesting and could spark some unique dynamics within a team.</p>
<p>However, I’m cautiously pessimistic in the short term. Claude Code is already good enough and has a huge cost advantage by being bundled with a Claude subscription. Amp’s current model is a complete pass-through of token costs (no margin). So while they aren’t profitable, they might not be burning too much cash either. One to watch.</p>
<h3 id="openai-codex-in-chatgpt">OpenAI Codex (in ChatGPT)</h3>
<p>Last month, OpenAI also released its own fully automatic coding agent. It’s exactly the product form I imagined for an agent. I had been wondering why I couldn’t assign tasks to Cursor from my phone. Now, I can do it through ChatGPT.</p>
<p>But to understand this move, you can’t just look at coding. Although they acquired Windsurf, I believe <strong>OpenAI’s ambition is far greater than just getting a slice of the coding pie; they want to make ChatGPT the future dispatching hub, or even an operating system.</strong> The purpose of Codex might just be to enable more professional, “high-value users” to do more, thereby increasing user stickiness. The Windsurf acquisition was likely for their long-context management capabilities and valuable user data, which can empower model improvements.</p>
<p>On a side note, the overall experience of ChatGPT is far superior to other official AI apps. For instance:</p>
<ul>
<li>Memory: It has a magical feel, but for me personally, the “value” it provides isn’t that significant yet. For truly personal or reflective questions, I still prefer to ask the memory-less, and even clunkier, Gemini.</li>
<li>The web search experience in o3 is exceptionally good. It’s like a mini DeepResearch.</li>
<li>While not perfectly smooth and still a bit buggy at times, it’s still much better than the competition.</li>
</ul>
<h3 id="devin">Devin</h3>
<p>Back when AI coding wasn’t so widespread, they branded themselves as the “First AI Software Engineer,” aiming for fully automated, end-to-end development. Their initial price of \$500/month was prohibitive. And those who tried it said it was clumsy.</p>
<p>Now that it starts at \$20 with a pay-as-you-go model, I immediately gave it a try.</p>
<p>My overall impression is that the model’s intelligence is so-so. But the product as a whole feels like it “basically works.” I have a strong feeling that with proper prompt engineering, it could work very well. Their current messaging is also very realistic: “<strong>Treat Devin like a junior engineer</strong>.” (In fact, this is probably the state of any agent product right now.)</p>
<p>This was my first real taste of how expensive agents can be. I gave it an issue to handle, and it was able to autonomously figure out a framework (costing 2 ACUs, at \<span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>2.25</mn><mi>e</mi><mi>a</mi><mi>c</mi><mi>h</mi><mo stretchy="false">)</mo><mi mathvariant="normal">.</mi><mi>B</mi><mi>u</mi><mi>t</mi><mi>w</mi><mi>h</mi><mi>e</mi><mi>n</mi><mi>I</mi><mi>a</mi><mi>s</mi><mi>k</mi><mi>e</mi><mi>d</mi><mi>i</mi><mi>t</mi><mi>t</mi><mi>o</mi><mi>f</mi><mi>i</mi><mi>x</mi><mi>a</mi><mi>b</mi><mi>u</mi><mi>g</mi><mo separator="true">,</mo><mi>i</mi><mi>t</mi><mi>s</mi><mi>t</mi><mi>r</mi><mi>u</mi><mi>g</mi><mi>g</mi><mi>l</mi><mi>e</mi><mi>d</mi><mo separator="true">,</mo><mi>s</mi><mi>t</mi><mi>a</mi><mi>r</mi><mi>t</mi><mi>e</mi><mi>d</mi><mi>t</mi><mi>h</mi><mi>r</mi><mi>a</mi><mi>s</mi><mi>h</mi><mi>i</mi><mi>n</mi><mi>g</mi><mo separator="true">,</mo><mi>a</mi><mi>n</mi><mi>d</mi><mi>q</mi><mi>u</mi><mi>i</mi><mi>c</mi><mi>k</mi><mi>l</mi><mi>y</mi><mi>s</mi><mi>h</mi><mi>o</mi><mi>t</mi><mi>u</mi><mi>p</mi><mi>t</mi><mi>o</mi><mn>4</mn><mi>A</mi><mi>C</mi><mi>U</mi><mi>s</mi><mi mathvariant="normal">.</mi><mi>M</mi><mi>y</mi><mspace linebreak="newline"></mspace></mrow><annotation encoding="application/x-tex">2.25 each). But when I asked it to fix a bug, it struggled, started thrashing, and quickly shot up to 4 ACUs. My \\</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord">2.25</span><span class="mord mathnormal">e</span><span class="mord mathnormal">a</span><span class="mord mathnormal">c</span><span class="mord mathnormal">h</span><span class="mclose">)</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.05017em;">B</span><span class="mord mathnormal">u</span><span class="mord mathnormal" style="margin-right:0.02691em;">tw</span><span class="mord mathnormal">h</span><span class="mord mathnormal">e</span><span class="mord mathnormal">n</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal">a</span><span class="mord mathnormal">s</span><span class="mord mathnormal" style="margin-right:0.03148em;">k</span><span class="mord mathnormal">e</span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal">tt</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal">i</span><span class="mord mathnormal">x</span><span class="mord mathnormal">ab</span><span class="mord mathnormal" style="margin-right:0.03588em;">ug</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal">i</span><span class="mord mathnormal">t</span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal" style="margin-right:0.03588em;">ugg</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">e</span><span class="mord mathnormal">d</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">d</span><span class="mord mathnormal">t</span><span class="mord mathnormal">h</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">a</span><span class="mord mathnormal">s</span><span class="mord mathnormal">hin</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal">an</span><span class="mord mathnormal">d</span><span class="mord mathnormal" style="margin-right:0.03588em;">q</span><span class="mord mathnormal">u</span><span class="mord mathnormal">i</span><span class="mord mathnormal">c</span><span class="mord mathnormal" style="margin-right:0.03148em;">k</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">ys</span><span class="mord mathnormal">h</span><span class="mord mathnormal">o</span><span class="mord mathnormal">t</span><span class="mord mathnormal">u</span><span class="mord mathnormal">pt</span><span class="mord mathnormal">o</span><span class="mord">4</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal" style="margin-right:0.10903em;">U</span><span class="mord mathnormal">s</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span></span><span class="mspace newline"></span></span></span>20 evaporated in no time. Perhaps the best way to use it now is to have it generate a first draft, and then manually refine it or use Cursor. (Of course, now that Cursor has a background agent, the lines are blurring.)</p>
<p>For Devin (and now Cursor’s remote agent), there’s also the cost of vCPUs. For example, an m5.4xlarge (16C64G) on-demand is \$0.768/h. Compared to token costs, that’s actually not that expensive…</p>
<p>With agents becoming a hot topic, <strong>Devin</strong> is now being squeezed from all sides by Cursor, Claude Code, Codex, and others.</p>
<p>Devin’s current advantages lie in its integrations (you can assign tasks directly from Slack, Linear &#x26; Jira) and its high product polish (a well-designed knowledge base and playbook system). But can this “dirty work” justify its valuation and become a moat? Intuitively, these seem like features any good agent must have. It feels like the agent space requires a massive amount of time to polish the experience, but the capital seems to be in too much of a hurry.</p>
<p>Their latest <a href="https://cognition.ai/blog/devin-2-1">Confidence rating</a> feature is excellent, as it can prevent users from burning money on a pile of garbage due to overly high expectations. This is another interesting aspect of agents: if you use them incorrectly, the results will be poor and expensive. To put it another way, a good programmer or contractor doesn’t just do what you say; they try to understand your intent, why you want to do it, and what the potential pitfalls are.</p>
<p>Their DeepWiki also feels like a flex, possibly showcasing their technical accumulation in agent technology. After all, they are a team that raised huge funds from the start to self-develop large models, aiming for massive context windows. Perhaps they have a lot of GPUs and a cost advantage.</p>
<p>While writing this, I saw a new platform called <a href="https://x.com/FactoryAI/status/1927754706014630357">Factory</a>, which also seems to be challenging Devin. Its release announcement sounds almost too good to be true: “Factory integrates with your entire engineering system (GitHub, Slack, Linear, Notion, Sentry) and serves that context to your Droids as they autonomously build production-ready software.” But upon closer inspection, this company was founded even before Devin. An interesting detail in their demo video is that all integrations redirect you back to their Factory page (e.g., you @-mention it in Slack, and it gives you a link). The experience is that you do everything from their portal, pulling in context from Linear, GitHub, and Slack. (To use an imperfect analogy, it looks a bit like the Manus of the coding world.) In contrast, Devin lets you interact with it directly in Slack and Linear, which is more in-context and in-flow. But anyway, competition is a good thing.</p>
<h3 id="v0">v0</h3>
<p>The tools discussed so far are mostly designed for engineers (whether fully or semi-automated). Now let’s talk about platforms geared more towards “non-coders” or “product” people.</p>
<p><strong>v0</strong> is a niche within a niche in the coding vertical, focusing on front-end UI prototyping. You can think of it as a Figma driven by natural language, where you can “draw” interfaces directly in v0. Another clever aspect is its use of React/shadcn UI’s component-based nature, meaning the generated code can be directly integrated into your own projects, making it actually usable.</p>
<p>Vercel has always been a company with great “taste.” Leveraging their deep expertise in the front-end world, they’ve made the experience of this niche product, v0, excellent. But one can imagine that behind v0’s smooth experience lies a ton of engineering optimization, such as using templates, specially fine-tuned models, and a meticulously designed workflow to ensure quality output.</p>
<p>An interesting development is their recent <a href="https://vercel.com/blog/v0-composite-model-family">release of their own model family</a> and opening up its API. Their explanation is: “Frontier models also have little reason to focus on goals unique to building web applications like fixing errors automatically or editing code quickly. You end up needing to prompt them through every change, even for small corrections.” This is very reasonable, but is it just polishing the details? Of course, to deliver a usable product, such polishing is essential. But I don’t quite understand why they released an API. Perhaps it’s to recoup the cost of model training on one hand, and on the other, to start exploring how to become a “dispatchable agent” themselves.</p>
<p>But it feels like they aren’t content with just doing UI. Their positioning is now a “Full stack vibe coding platform.” They are also working on GitHub sync and other integrations with existing codebases, moving beyond just generating things on the v0 platform.</p>
<h3 id="bolt--replit--lovable-idea-to-app-vibe-coding-platforms">Bolt / Replit / Lovable: “Idea to App” Vibe Coding Platforms</h3>
<p>These types of products are, to some extent, variations on a theme. They are all end-to-end, full-stack platforms, or app builders, with a catchier name: <strong>“idea to app”</strong> platforms.</p>
<p>Compared to Cursor, the pain points they solve are deployment (including front-end, back-end, and database) and a smoother “vibe coding” experience. The thinking is: if I’m not going to look at the code generated in Cursor anyway, why even show me a code diff? A chat-to-live-preview experience is more direct. They also likely use project templates to make the initial “prompt to app” experience feel amazing.</p>
<p>While their target audiences might differ slightly—perhaps developers prefer Bolt, and non-developers prefer Lovable (pure speculation)—they are essentially doing the same thing: enabling users to build a usable product with close to zero manual coding.</p>
<h4 id="the-dilemma-of-vibe-coding-platforms">The Dilemma of Vibe Coding Platforms</h4>
<p>The tricky part is that if their goal is to deliver a final product to the user, user expectations will be very high. In serious scenarios, users often need very specific modifications, and letting an AI handle everything might not achieve the desired result, not to mention it’s expensive. When I was using Cursor to whip up a front-end, adding features was a breeze, but when I wanted to fine-tune a button’s position, layout, or interaction logic, it often got it wrong.</p>
<p>Although some vibe coding platforms offer an online code editor, when it comes to fine-grained control, people who can code will likely go back to Cursor, because it’s the most comfortable tool for the job. But once they’re back in Cursor, there might be no reason to return to the vibe coding platform. The pain of deployment is a one-time thing; once CI/CD is set up, you just push your code changes.</p>
<p>For detailed development, Cursor’s agent can probably provide more precise context. These vibe coding platforms could also enhance their own coding agent capabilities, but they have too much on their plate. Building out a full platform takes a lot of effort, and their technical accumulation in coding is surely no match for developer-focused platforms like Cursor.</p>
<p>In short, <strong>the ceiling for vibe coding platforms in serious, complex scenarios might be limited.</strong> They certainly have value for simple projects or demos, but how many users are willing to pay for that, I don’t know. This story has already played out with PaaS platforms like Vercel/Neon that focus on “developer experience”: everyone praises the experience, but once projects get large, many quietly migrate to AWS.</p>
<p>Looking at it from another angle, let me make a bold prediction: in the future, Cursor could very well build out a great vibe coding / app builder experience. They could make the initial screen a chat box, integrate live previews, and add Supabase/Vercel integrations. If that happens, these other platforms will be in even greater danger. After all, the concept of “vibe coding” originally took off on Cursor, and for people who want to build products, “seeing the code” might not be that big of a hurdle. I boldly predict Cursor might do this within a year.</p>
<p>Let’s also look at Lovable’s <a href="https://docs.lovable.dev/faq#what-is-the-difference-between-lovable-and-cursor">FAQ</a> where they compare themselves to other platforms/Cursor:</p>
<ul>
<li>Most of the points are vague claims like “just better,” “way more natural,” “Attention to detail.” This might be convincing for a regular product, but in the hyper-competitive AI coding space, it’s incredibly hard to stay ahead.</li>
<li>They have a visual editor, which is quite interesting. It allows for WYSIWYG editing of UI elements, which could partially solve the fine-tuning problem I mentioned. I tried it, but it’s still quite basic, only allowing changes to text content, font size, margins, etc. It doesn’t support features like drag-and-drop. The long-term vision for this is compelling—it could even take on Figma—but the technical difficulty seems immense. (It reminds me that we don’t even have a truly good visual editor for Mermaid diagrams yet.)</li>
</ul>
<h3 id="youware-a-radical-experiment-in-user-generated-software">YouWare: A Radical Experiment in User-Generated Software</h3>
<p>The truly exciting thing about AI coding is its demonstration of the ability to “dispatch compute with natural language.” This empowers ordinary people to use code as a tool to solve their own previously unmet needs. An era of <strong>User-Generated Software (UGS)</strong> is dawning.</p>
<p>Among all the products, <strong>YouWare</strong> seems to be a platform built precisely for this purpose, making UGS its sole mission.</p>
<h4 id="is-turning-ai-coding-into-a-content-community-the-right-move">Is turning AI coding into a content community the right move?</h4>
<p>Initially, I was cautiously pessimistic about YouWare.</p>
<p>It felt like they were trying to force the UGC (User-Generated Content) playbook (community, traffic, platform) onto UGS. If they’re building a new content platform, they’re competing for attention with TikTok and Instagram, but it doesn’t seem as “scrollable.” The demand for personalized entertainment has been thoroughly met by short videos. (…or has it? As I say this, I suddenly feel that short videos aren’t always that great, and I often struggle to find games that match my preferences.)</p>
<p>My initial thought was: the potential of UGS lies in satisfying the massive long tail of unmet tool-based needs. Users don’t lack motivation; they lack the ability. If they are solving their own pain points, they will leave after the job is done. They won’t necessarily have the desire to share or distribute their creations (or posting on Twitter/Instagram is enough), and they certainly won’t be “scrolling” through a tool website for fun.</p>
<p>YouWare believes that many people don’t know what they can create, so a platform is needed to spark their imagination and creativity. Social elements play the role of inspiration here.</p>
<p>Platforms like v0 and Lovable, while claiming to be accessible to beginners and having some community features, still show users the code, pop up build errors, and ask you to connect to Supabase. Their assumed user is still a “professional” with some technical background (like a product manager or designer). For example: “Lovable provides product managers, designers, and engineers with a shared workspace to build high-fidelity apps, collaborate effectively, and streamline the path to production-ready code.”</p>
<p>YouWare’s radical approach is that it <strong>completely hides the code from the user</strong>. Its target non-coder is the general public.</p>
<p>This is a bit like how Instagram limited the length of text in posts. By imposing a constraint, it maximized usability for its target audience. For someone who knows nothing about technology, seeing a build error is a dead end. In YouWare, that dead end is hidden.</p>
<p>Regarding the difference between tool needs and entertainment needs, Instagram can also be seen as a tool for users to document their lives, and its popularity is largely due to its “usefulness.”</p>
<p>After trying YouWare myself (<a href="https://www.youware.com/profile/uNYPe0WjpUVfW21IOleyYTlMIWf1">my creations</a>), I noticed some interesting things:</p>
<ul>
<li>
<p>It’s genuinely a bit addictive (and the free credits are very important). For example, if I have a random idea, I’m tempted to just throw it on there and see what happens. If I were using another platform for a serious project, I’d have to think it through more carefully. (My mental model includes the cost of debugging, etc., because I want something that actually works. In terms of mental burden, YouWare &#x3C; Lovable &#x3C; Cursor, but the utility is probably the reverse). This feeling is very similar to using Cursor’s background agent—“Let’s just run it and see, what’s there to lose?”</p>
</li>
<li>
<p>It truly hides the code details, including failures. When I tried Lovable, the initial generation often resulted in an error (though it was fixed with a click), whereas YouWare never did.</p>
<p><img src="/assets/img/ai-coding/image2.png" alt="image2.png"></p>
</li>
<li>
<p>It encourages “play.” YouWare’s Remix and Boost features are also interesting (regardless of their effectiveness for now). They align well with the premise that “users don’t know what they want to build,” encouraging exploration and re-creation.</p>
<ul>
<li>
<p>But then I realized many platforms have this now, even Claude’s Artifacts have a similar feature, and it’s surprisingly polished.</p>
<p><img src="/assets/img/ai-coding/image3.png" alt="image3.png"></p>
<p><img src="/assets/img/ai-coding/image4.png" alt="image4.png"></p>
</li>
</ul>
</li>
</ul>
<h4 id="a-bunch-of-scattered-thoughts-on-youware">A Bunch of Scattered Thoughts on YouWare</h4>
<ul>
<li>
<p><strong>Who are Vibe Coders?</strong> The UGC era gave rise to professional “creators.” Today’s “vibe coders” are somewhat similar. But content creators’ income mainly comes from traffic and brand deals, whereas vibe coders are closer to indie developers. They want to build their own products and make money by selling software or subscriptions. Selling software ultimately depends on solving real needs and promoting it on various platforms, not waiting for someone to stumble upon you on a UGS platform (e.g., you’d promote on Instagram, not wait for someone to find you on GitHub).
…Thinking about this, a wild idea popped into my head: if you were really going to do this, shouldn’t you build an <strong>OnlyFans for vibe coders</strong>, rather than a YouTube/Instagram? 🤣</p>
</li>
<li>
<p><strong>Code does have entertainment value</strong> (there’s a thing called creative coding)… but again, entertainment demand competes for attention. A niche use case is turning articles into interactive websites for educational purposes, like these:</p>
<ul>
<li><a href="https://ciechanow.ski/bicycle/">https://ciechanow.ski/bicycle/</a></li>
<li><a href="https://garden.bradwoods.io/">https://garden.bradwoods.io/</a></li>
<li><a href="https://encore.dev/blog/queueing">https://encore.dev/blog/queueing</a></li>
<li><a href="https://www.redblobgames.com/pathfinding/a-star/introduction.html">https://www.redblobgames.com/pathfinding/a-star/introduction.html</a></li>
</ul>
</li>
<li>
<p><strong>Power Users vs. Novice Users:</strong> The needs of these two groups are contradictory, and it’s hard for one platform to satisfy both. YouWare has clearly chosen the latter.</p>
</li>
<li>
<p><strong>Limitations of the Output Format:</strong> Why are the final outputs of these coding platforms (including Devin, Lovable, etc.) mostly websites? For many small utility needs, a command-line tool or a desktop app might be more direct and efficient. Of course, from a UX perspective, websites are the most user-friendly for the general public.</p>
</li>
<li>
<p><strong>Cost Issues:</strong></p>
<ul>
<li>As a content platform, there are significant compliance risks and costs. But maybe it’s not that hard, given that even DeepSeek can operate in China.</li>
<li>The cost of hosting websites. Different types of websites may have different computational needs, and popular projects might require dynamic scaling.</li>
<li>The massive compute cost of agents. Unlike UGC, where the platform has little cost when users create content, UGS is different. Compared to Amp, which says its optimization goal is maximum utility, YouWare’s accounting is much more complex. There’s a huge trade-off between generation quality and cost.</li>
<li>This leads to a core question: if it encourages user creation, what is its business model? If it follows the traditional traffic-and-ads model of content platforms, given the huge costs, the profit ceiling is likely not high.</li>
</ul>
</li>
<li>
<p><strong>Should it optimize for specific scenarios?</strong></p>
<ul>
<li>For example, maybe half the users on the platform are using it to write reports. But that’s really a DeepResearch-type function, and the results in YouWare would be mediocre. Manus/Flowith would probably optimize for this (Manus recently even specialized in a slides feature, which left me a bit speechless—so much for a general-purpose agent).</li>
</ul>
</li>
<li>
<p><strong>Data-driven platform evolution?</strong></p>
<ul>
<li>I was initially puzzled why YouWare (and Manus, etc.) would heavily invest in traffic acquisition and promotion while their capabilities were still incomplete, instead of polishing the product first. Perhaps they have secured enough funding and are in a rush to expand.</li>
<li>But launching before the product is mature can help them understand what users actually want to build, and then optimize accordingly. I may have underestimated the role of social interaction in sparking user creativity. This could be like an evolutionary algorithm, or the idea that “greatness cannot be planned”: let users explore freely, and you might see unexpected innovations emerge. The YouWare team’s background at ByteDance suggests they will likely follow a data-driven decision-making process, letting user behavior guide the platform’s evolution. Perhaps they will stumble upon a breakthrough along the way.</li>
</ul>
</li>
</ul>
<h4 id="the-future-of-youware">The Future of YouWare</h4>
<p>I believe every company has its DNA. The founder of YouWare, a former PM from ByteDance’s CapCut team, is perhaps the only one who could have come up with something like this.</p>
<p>While many of the things I analyzed above might see Lovable moving towards YouWare’s direction (hiding more code) or YouWare moving towards a standard agent platform (increasing utility), I’m excited to see the outcome. I think YouWare’s current form is not its final form. At the same time, I increasingly find YouWare’s starting point fascinating, and it might just create something different. This team might understand creation, platforms, and consumers better than the coding folks, and understand AI coding better than the creator folks.</p>
<p>YouWare’s goal isn’t to maximize utility, but to <strong>unleash the creativity of ordinary people</strong>. Of course, the utility has to be at least good enough.</p>
<p>A harsh question is: as more and more people learn to use Cursor, will it eat up the market for these “dummy” tools? Perhaps it will be like how professional photographers with cameras and ordinary people with phone cameras coexist; programmers and vibe coders will coexist. Another thought I’ve been having recently is that current AI is exacerbating the Matthew effect (perhaps starting with the \$200 subscriptions). The gap between those who know how to use AI well and can afford it (I’ve seen people burn hundreds of dollars a day on Cursor) and the average person will widen. Will those less inclined to think critically, who can’t articulate their needs clearly, be “left behind”? That future is too cruel for me to imagine, and I’d rather join the resistance against that trend. From this perspective, I find attempts like YouWare, dedicated to serving the broad public, very valuable.</p>
<p>Of course, while YouWare is full of ideas, whether that vision can be successfully translated into a viable product and achieve commercial success remains uncertain.</p>
<h2 id="big-picture-industry-landscape--technical-directions">Big Picture: Industry Landscape &#x26; Technical Directions</h2>
<p>After examining the players at the table one by one, let’s take a step back and look at the entire AI coding landscape.</p>
<h3 id="market-segmentation">Market Segmentation</h3>
<p>AI coding can be broken down into several sub-fields. A single product might span multiple areas:</p>
<ul>
<li><strong>AI-assisted Coding:</strong> Represented by <strong>Cursor</strong> and <strong>GitHub Copilot</strong>, these are “enhancers” for existing development workflows, aimed at making professional developers faster and more productive.</li>
<li><strong>End-to-end Agents:</strong> Represented by <strong>Devin</strong>, <strong>Claude Code</strong>, and <strong>Amp</strong>, their goal is to become “junior engineers” who can complete tasks independently, elevating developers from executors to task assigners and reviewers. Agents can also be collaborators, especially CLI-based agents like Claude Code, with whom I can either pair program or delegate work.
A thought leader in a <a href="https://youtu.be/FzbkAy0DcQk?si=caXCcvDsm2tUbeTP">video</a> predicted that by Q3 2025, the consensus in Silicon Valley will be that Agents can reach or even replace mid-level software engineers. The comments section was mostly skeptical. My take is that Agents might not “replace” them entirely, but they are very likely to become powerful “partners” for mid-level engineers. Understood from this angle, I think the prediction is quite reasonable.</li>
<li><strong>Vibe Coding / UGS:</strong> Represented by <strong>v0</strong> and <strong>YouWare</strong>, these tools attempt to give the power of code to non-developers, allowing them to create applications and tools through natural language. One is more geared towards “product prototyping,” while the other takes a more radical step towards a “content community.”</li>
</ul>
<h3 id="the-awkward-half-baked-state-of-affairs">The Awkward “Half-Baked” State of Affairs</h3>
<p>We have to admit a reality: <strong>Agents are still a “half-baked” product.</strong> Their performance is not yet good enough to deliver a perfect result end-to-end, and sometimes it’s less hassle to just do it ourselves (like manually adjusting a button).</p>
<p>But we can also clearly see the evolutionary path of agents: from manually copying and pasting in ChatGPT, to single-turn conversations in an IDE, to today’s Cursor Background Agent and Claude Code. <strong>The length of time an agent can work independently is increasing, and the quantity and quality of its work are improving. This is an irreversible trend.</strong></p>
<p>Perhaps we should adopt a different mindset: think of it as an outsourced contractor. You assign it a task, let it work for a while, and then you come in to review and give feedback, rather than expecting it to get everything right in one go. This is no different from how we collaborate with human contractors (who are, in a sense, “Agents”).</p>
<h4 id="the-curse-of-cost-and-the-bet-on-models">The Curse of Cost, and the Bet on Models</h4>
<p>At the same time, Agents are very expensive. This not only discourages users from large-scale adoption but also puts agent application companies in a dilemma: should they continue to pursue performance at any cost, or should they turn to various “tricks” and “polishing” to reduce costs and improve efficiency? But there is a trade-off between performance and cost. I don’t know if it’s possible to have both, for example, by having one part of the team focus on performance and another on cost optimization. If cost control is completely ignored, the high price might scare away users. But are AI Agent companies really in such a hurry to acquire customers? Maybe not.</p>
<p>There’s a bigger variable at play here: if the upstream LLM providers drastically cut their prices, all the previous efforts in cost optimization, like painstakingly optimizing by 30-50%, could be rendered “wasted effort” by external factors. Of course, there’s also the possibility that the original providers’ optimizations are ineffective, or that they decide to develop their own Agent business. Therefore, for AI Agent startups, their decisions are filled with elements of a “gamble.”</p>
<h3 id="what-capabilities-does-an-agent-need-how-to-build-a-coding-agent">What Capabilities Does an Agent Need? How to Build a Coding Agent?</h3>
<p>From the explorations of various products, we can glimpse the capabilities a good Agent needs:</p>
<ul>
<li>
<p><strong>Memory/Knowledge Base:</strong> For example, the ability to automatically learn from <code>cursor.rule</code> files (Devin/Manus already have this).</p>
</li>
<li>
<p><strong>Long Context Capability:</strong> Indexing &#x26; RAG?</p>
<ul>
<li>I’m a bit skeptical about the effectiveness of this. Now that we’re in the Agent era, the agent can just <code>grep</code> the code to find context. This is very similar to my own development process. It’s still heavily reliant on string searching, which isn’t a very smart method. But <code>grep</code> is only useful when you know what to change. Vague questions like “how does xxx work?” are a different story.</li>
<li>But testing long context capability is very difficult; you need to use it very deeply to know its true level. I haven’t gotten a feel for it yet.</li>
</ul>
</li>
<li>
<p><strong>Task Management Capability:</strong>
I used to think an external to-do list was essential, but now it seems Claude is starting to internalize this capability — the model may output things like “Let me solve the problems one by one: 1. … 2. … 3. …” (though my gut feeling is that an external one is still better?).</p>
<!-- ![image5.png](/assets/img/ai-coding/image5.png) -->
</li>
<li>
<p><strong>Proactive Communication &#x26; Interaction:</strong> A good Agent shouldn’t just do what you say. It should be like a good contractor: it should ask clarifying questions, confirm intent, and assess risks (like Devin’s “confidence rating”). For example, if you say “I need to make a PowerPoint,” it should ask if you have existing materials or textbook resources to provide. Deep research products are also doing a good job with this.</p>
</li>
</ul>
<p>On that note, does building a good coding agent require you to be a great user of coding agents yourself?</p>
<h2 id="final-thoughts-our-relationship-with-ai">Final Thoughts: Our Relationship with AI</h2>
<p>The concepts of “natural language dispatching compute” and “User-Generated Software” may have somehow become an industry consensus, but their specific implementation is far from agreed upon.</p>
<p>After all this talk, let’s bring it back to ourselves.</p>
<h4 id="how-should-the-average-person-choose">How should the average person choose?</h4>
<p>In general, all tools are currently in a state of “still early, but already useful (if used correctly).” They perform well on small, simple tasks or for generating demos, but in complex scenarios, they heavily test the user’s <strong>“craft.”</strong></p>
<p>This “craft” includes both prompt engineering skills and an understanding of code and how agents work. “Knowing the boundaries of AI’s capabilities” is also a bit of a cliché by now. Therefore, the people who will use Agents best in the future will likely still be professionals. It’s like professional photographers versus casual phone photographers: the tools blur the lines between professions (e.g., engineers can do design, PMs can write demos), but ultimately, they raise the ceiling for experts.</p>
<p>Agents are likely something that gets better with use. Exploring best practices within a team, accumulating prompt techniques and a knowledge base—this is an investment in itself.</p>
<p>But I also often wonder if studying all this is futile. When model capabilities reach a certain singularity, we can just embrace the final form, and all the intermediate explorations and usage experiences will become obsolete. This might be true. There’s no point in arguing further, and I’m no longer going to force anyone to use AI. But I just can’t help playing with it. It’s fun! 😁🤪</p>
<h4 id="when-the-power-to-generate-becomes-infinite-what-should-we-generate-with-it">When the power to generate becomes infinite, what should we generate with it?</h4>
<p>A deeper question: What does the development of AI really have to do with me? It’s like how I don’t read many research papers; they feel distant. Although ChatGPT has made it much easier for me to learn anything—I’m constantly discussing things with it—I find myself more tired. Do I really need to know all this stuff?</p>
<p>The development of Coding Agents will allow me to write more and more code. Should I build all those things? When the power to generate becomes infinite, what should we actually generate with it?</p>
<p>Products like YouWare might be one answer.</p>
<p>Or perhaps, this is a non-existent problem, like asking what we should do after achieving controlled nuclear fusion. Will everyone get to pilot a Gundam?</p>]]></content><category term="AI Agent" /></entry><entry><title type="html">我对各种 AI Coding Agent 工具的看法</title><link href="https://xxchan.me/zh/blog/2025-06-08-ai-coding/" rel="alternate" type="text/html" title="我对各种 AI Coding Agent 工具的看法" /><id>https://xxchan.me/zh/blog/2025-06-08-ai-coding</id><published>2025-06-08T00:00:00+00:00</published><updated>2025-06-08T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[Agentic coding 或许是当下最火（最卷）的方向，一万家公司在做。并且隔三差五就在社交媒体上看到又有什么新工具、谁家又出什么新功能了（又 blow 谁的 mind 了，又颠覆谁谁了）。这还挺让人困惑的，我发现很多人会问 “这些 AI coding 工具真有那么牛吗？”，或者 “XX 和 YY 到底有啥区别”。不少人自己试用了一下，感觉不过如此，于是迅速下头。同时，我还发现还有不少程序员连 Cursor 都没用过。]]></summary><content type="html" xml:base="https://xxchan.me/zh/blog/2025-06-08-ai-coding/"><![CDATA[<p>Agentic coding 或许是当下最火（最卷）的方向，一万家公司在做。并且隔三差五就在社交媒体上看到又有什么新工具、谁家又出什么新功能了（又 blow 谁的 mind 了，又颠覆谁谁了）。这还挺让人困惑的，我发现很多人会问 “这些 AI coding 工具真有那么牛吗？”，或者 “XX 和 YY 到底有啥区别”。不少人自己试用了一下，感觉不过如此，于是迅速下头。同时，我还发现还有不少程序员连 <strong>Cursor</strong> 都没用过。</p>
<p>我平时很喜欢把玩各种 agentic coding tool，因此忍不住想锐评一番。这个领域无疑充满了大量的 hype，但仔细看，还是能分辨出不同产品间的差异，甚至整个行业的发展方向。</p>
<p>Agent 能做什么不能做什么，以及如何用好它，这里面有很多**“手艺”**的成分。所以这事儿很难解释清楚，了解它们的最好方式还是得自己上手试。看再多别人的使用感受，都不如自己玩一把来得真切（但我还是忍不住想讲讲我的看法）。这篇文章，就是试图把我关于各种 AI coding 工具那些零散的观察和思考，整理成一篇比较长的文字。</p>
<h2 id="一些背景">一些背景</h2>
<p>总的来说，我很相信 “agent coding 能成” 这个未来。具体点说，我相信未来 AI agent 可以独立在一个大型项目中，端到端地完成复杂的开发任务（加功能、修 bug、重构）。</p>
<p>首先交代一下，我主要的工作是写开源流数据库 <a href="https://github.com/risingwavelabs/risingwave">RisingWave</a>，一个超过 60 万行代码的 Rust 项目，还算比较复杂。虽然一些上下文明确的小活儿，我已经习惯了让 AI 来干，但说实话，我暂时还没有大规模、严肃地用 AI coding 去做那些真正困难的开发任务。同时，我也没仔细想过未来模型的能力边界，以及实现 agent 的具体技术难点在哪。所以，这篇文章主要基于我的直觉，是对各个工具的感性分析。另外也不是一篇教你怎么用、怎么选的攻略。</p>
<p>不过给自己找补一下，我感觉之所以不敢大规模尝试，还是有原因的，主要还是**“穷人思维”**在作祟：Agent 还是太贵了！一个任务跑下来，随随便便就是 5 到 10 刀。这里可能存在一个杰文斯悖论：如果它变得更便宜，我反而会用得更多，最后花掉更多的钱……另外现在工具太多了，而要真正用出差异，可能得花上一周以上的时间去深度体验，但订阅和切换的成本又让人望而却步。</p>
<p>下面开始正题。我们先按工具逐一分析，最后再聊些更宏观的话题。</p>
<h2 id="具体产品分析">具体产品分析</h2>
<h3 id="cursor野心勃勃的领跑者"><strong>Cursor</strong>：野心勃勃的领跑者</h3>
<p><strong>Cursor</strong> 现在毋庸置疑是 AI Code Editor 这个赛道的老大哥。</p>
<h4 id="05010-版本里藏着的线索">0.50/1.0 版本里藏着的线索</h4>
<p>说起来，我动笔写这个，很大一个 trigger 是看了 <a href="https://www.cursor.com/changelog/0-50">cursor 0.50 的 changelog</a>（然而拖到今天他们 <a href="https://www.cursor.com/changelog/1-0">1.0</a> 都发了……），里面透露了很多有意思的点，有点暗示未来方向的意味：</p>
<ul>
<li>
<p><strong>Simpler, unified pricing</strong>
Cursor 之前的定价模式有点臭名昭著，它引入了一个定义模糊的“fast request”，不同模型还对应不同的数量。新版统一成了“Requests”（其实也没太大区别）。更重要的是很多人觉得一个月 20 刀很贵，我倒认为这一定价太低了，他们很可能在亏钱。按 request 计费本身就不太合理，尤其在 agent 时代，一个请求可能跑很久、烧很多 token。当然，这也可能是种**“健身房模式”**，让用量少或短对话的用户，来平衡高用量用户的成本。但另一个不合理之处在于，这会驱使它去优化 token 成本（比如压缩上下文），而用户想要的却是最大化的效果。</p>
</li>
<li>
<p><strong>Max mode</strong>
按照官方说法，“It’s ideal for your hardest problems”。在我看来，这有点吹牛。我的理解是，Max mode 就是不再精细化管理上下文，同时上了 token-based billing。在过去，模型长上下文能力不强时，精细控制或许能省钱且效果好（因为模型会被无效信息误导）。但现在模型能力提升太快，这种控制反而成了负优化。有趣的是，像 Roo Code 这样的开源 BYOK 方案，一直宣传的就是“Include full context for max performance”。所以 Cursor 这波操作，有点像开倒车，或者说是早期的优化成了现在的技术债。他们那句“If you’ve used any CLI-based coding tool, Max mode will feel like that - but right in Cursor”给人的感觉更微妙了。既然我可以用 CLI-based agent，为什么还要在 Cursor 里用一个要额外收 20% margin 的版本呢？</p>
</li>
<li>
<p><strong>Fast edits for long files with Agent:</strong>
这也是个有点像开倒车的改动。它给我的感觉是，开始使用基于文本的方法来直接应用大模型的输出。Cursor 之前一直吹嘘自己的 apply model，这事儿可能做得太早了。以前模型不够准，需要复杂的 apply 逻辑；以后模型越来越强，这种复杂性可能就没那么必要了。</p>
</li>
<li>
<p><strong>Background Agent &#x26; BugBot</strong>
总的来说 “Agent mode” 顶多算是辅助驾驶，真正的 Agent 是你能以更轻松的方式给他派活。Background Agent 是你派个活就不用管了，BugBot 是自动 code review。后面必定还会出例如在 GitHub 上 assign 个 issue 就开始干活了之类的功能，成为一个全能的合格牛马。</p>
<p>这个信号非常明确：<strong>Cursor 要和 Devin 硬碰硬了</strong>。这是个非常自然的方向，用过 Cursor agent mode 的人，很可能都想过能不能让它同时干两件事。在本地做这个有难度，放到云端就顺理成章了。</p>
<p><strong>Cursor vs Devin</strong>，有点像<strong>特斯拉 vs Waymo</strong>。后者一开始就直接做终极目标自动驾驶，前者则是发展成熟、用户规模大了以后逐渐转向更自动的方向。这条路的好处是用户期待会低一点，坏了可以自己动手改。依赖现有的其他做的好的体验还可以继续保持一定的用户黏性。相比之下，Devin 如果一开始的体验不及预期，用户很可能就流失了。（当然，对 pro user 来说，在本地 checkout 修改不是难事，但 Cursor 有大量相对小白的用户，为他们提供简单的 UIUX 也是一个点。）</p>
</li>
<li>
<p>还有一些 1.0 的小改进</p>
<ul>
<li>支持了 memory：我认为同样是所有 ai agent 的必备功能。</li>
<li>Richer Chat responses：支持了 mermaid，以及 markdown table 渲染。说明 chat 体验还是有东西卷的（提升一点用户粘性）</li>
<li>但总的来说 1.0 主要感觉是 marketing 为主的一个版本，并没有什么质变（相比之下 0.50 倒是更震惊我一点）</li>
</ul>
</li>
</ul>
<p>与 Cursor 的激进大动作相应的则是 <a href="https://techcrunch.com/2025/05/04/cursor-is-reportedly-raising-funds-at-9-billion-valuation-from-thrive-a16z-and-accel/">Anysphere, which makes Cursor, has reportedly raised <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>900</mn><mi>M</mi><mi>a</mi><mi>t</mi></mrow><annotation encoding="application/x-tex">900M at </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6833em;"></span><span class="mord">900</span><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span></span></span></span>9B valuation</a>。对应 OpenAI 想要收购 windsurf 的新闻，可见 Cursor 急切的想要一统江湖的野心。融了这么多钱，我猜他们下一步很可能就是训练自己的模型。除此以外，它也完全有可能会收购市场上的其他玩家，成为一个整合者的角色。</p>
<h4 id="回过头来说cursor-到底好在哪">回过头来说，Cursor 到底好在哪？</h4>
<p>其实我当初（2024/05）用 cursor 完全是为了它惊艳的 <strong>TAB 功能</strong>。在早期我几乎不用 AI chat，甚至能忍着很多非常影响体验的 editor bug 还要用。 相比 GitHub Copilot 的“append only”补全，想修改就得删了重来；Cursor 的生成“Edit”，帮你修改代码，显然是更“正确”的形态，而且准确率相当不错。它的补全还能在改完一处后，跳到后面同时修改多处，这在重构时极其有用。例如改一个类型签名的时候 IDE 不太能智能重构，要手动改很多地方，而 Cursor 解决了这个痛点。</p>
<p>就为了这个 TAB 功能，我心甘情愿地付了 20 刀。</p>
<p><img src="/assets/img/ai-coding/image.png" alt="image.png"></p>
<p>后来在我没意识到的时候 “Agent mode” 在 non-coder 中先火了。我才后知后觉地发现了 agent 的能力。（而且 Cursor 一直没涨价啊！所以现在在让用户逐渐适应 token based billing 了） 不知道这个火是不是偶然，因为在我看来其他的 AI IDE 或者 end-to-end 的 coding 平台或多或少都能做类似的事情，Cursor 现在在 Agent 上甚至是比较落后的。但或许是它做的早，抓住了时间窗口，在大众心里建立起了品牌。AI coding 平台的切换成本其实有点玄学，一方面真的要切的话并不难，体验没有质的差距，没有真正的壁垒；另一方面这个干活的东西，用顺手了也懒得换。</p>
<p>他们有一篇 <a href="https://www.cursor.com/blog/problems-2024">Our Problems</a>，看他之前画的饼其实都是 AI-assisted coding 的范畴，现在感觉在 agent 的时代稍微有点过时了。AI assisted coding 的 UX 感觉还是有很多可以做的事情的，但现在大力做 Agent 的话可能会没那么优先了。</p>
<p>所以，Cursor 的好在哪？它好在一种奇妙的组合拳上。它先用一个真正懂开发者的杀手级功能（那个无敌的 TAB Edit）抓住了最挑剔的核心用户，然后又敏锐地捕捉到了 Agent 的浪潮，在大众心中成功地将自己与“AI 编程”这个概念划上了等号，哪怕它的技术在现在并非最领先。这种**“硬核实力”<strong>和</strong>“抓风口能力”**的结合，再配上一点先发优势的“玄学”，最终成就了它现在的地位。</p>
<p>现在如果你不知道什么工具最适合自己，那 Cursor 可能是一个比较稳的选择：有充足资金，不一定是最强但肯定差不到哪去。</p>
<h4 id="cursor-的终局是什么">Cursor 的终局是什么？</h4>
<p>当初就有很多人说，Cursor 做的事为什么要 fork VS Code？我曾认为“为了 AI 特化的体验”是答案（例如 Cursor TAB）。但现在，VS Code、 <a href="https://www.augmentcode.com/">Augment Code</a> 也在追赶，Cursor 自己反而没有做出更多让人眼前一亮的独特 UX。</p>
<p>我现在对这件事的判断是：<strong>Cursor 想做一个大而全、ALL-in-one 平台，占据开发者的入口</strong>。（GitHub Copilot 或许也想，但它还是不够快。）之前提的“我能在 CLI 里用 agent”，实际上是说 Agent 并不需要 IDE 就能工作。但我在自己浅浅用了一下 Cursor 的 background Agent 之后，发现这个体验很自然。很多东西不必做在 IDE 里，但反过来说，也不是不能做在 IDE 里。既然 IDE 是工程师每天花时间最多的地方，那为何不把所有 coding 相关的东西都塞进来，成为一个一站式的 hub？</p>
<p>至于其他的 AI code editor（windsurf/trae，以及开源的 cline/roo code），我感觉比较难与 Cursor 有一战之力。我的观点是，Agent 是大趋势，而做好 Agent 之后，对 AI-assisted coding 的依赖反而小了。当工程师需要自己写代码时，最终还是会回归到传统的 IDE 体验。这些工具虽然可能在某些体验上有优势（比如 windsurf 据说对复杂项目的上下文管理更智能），但普通用户没那个耐心去深度比较。在资本的冲刷下，这些微小差异可能会被逐渐抹平，甚至收购整合。做 Agent 就更是烧钱了。反倒是像 <strong>Zed</strong> 这种完全重头再来的 code editor，说不定可以搞出点新花样来。</p>
<h4 id="关于壁垒">关于“壁垒”</h4>
<p>Cursor 的创始人曾谈过他们对“壁垒”的看法：在这个发展过快，未来的想象空间也仍然很大的领域，<strong>壁垒的本质就是“快”</strong>。只要你够快，就能领先。反之无论你当前的技术有多强、产品体验有多好，一旦你在某个阶段慢下来，就可能被超越、被取代，非常残酷。</p>
<p>我在这个事情上没完全想明白。我曾经觉得靠“体验”是可以成为壁垒的。但或许那只是你做的事情不够大。如果足够大，那么巨头一定会出手自己做，然后用技术（模型）和资源能力比你做的更好。</p>
<h3 id="vs-codegithub-copilot">VS Code/GitHub Copilot</h3>
<p><strong>Copilot</strong> 绝对是里程碑，是第一个让人感觉“能用”的 AI coding 工具。但后来，它的体验逐渐被后起之秀超越。我猜测可能的理由包括：</p>
<ol>
<li>OpenAI/微软重心转移（比如微软大力搞 copilot for office）</li>
<li>毕竟微软是个巨厂，层层审批，Github Copilot 拿不到太多资源</li>
<li>Copilot 本身当初可能是想着做做看，做出效果以后也没想好再往后能怎么做，而且 coding 模型的发展缓慢（Codex 是 GPT-3 的一个 finetune 版本），后面专注提升基座能力去了，没人/资源专门训练 coding 特化模型</li>
<li>Copilot 用户（特别是 enterprise 用户）多了以后不好大刀阔斧地改体验，领先占据市场反而成了包袱</li>
<li>受限于 VS Code 的壳，不像 fork 的 AI IDE 可以乱改，要往主分支里塞 AI 相关的东西可能还是要掂量一下，特别是在当年 AI coding 还原非共识，有很多程序员反感 AI</li>
</ol>
<p>但是 VS Code 最近逐渐把功能慢慢都加上了。甚至还发了一篇有意思的宣言： <strong><a href="https://code.visualstudio.com/blogs/2025/05/19/openSourceAIEditor">VS Code: Open Source AI Editor</a></strong></p>
<p>长远看 <strong>VS Code 可能还是会重回巅峰</strong>。理由很简单：大厂认真起来是很吓人的（比如 gemini）。当 AI coding 成为共识，微软投入足够资源，体验差异很可能被逐渐抹平（比如 Cursor TAB 这种东西 Copilot 没理由不做），除非他们持续在“AI Editor 的 UX 创新”上整新活。但是目前看来并没有。更重要的一点是，既然 agent 不需要 IDE 就能工作，那么程序员自己写代码时，还是会回归到功能扎实、bug 更少的传统 IDE。这也是 Cursor 的一大弱点，它在 IDE 本身的迭代上，似乎总比 VS Code 慢半拍。</p>
<p>未来，VS Code 和 Cursor 两分天下，感觉也挺有可能。喜欢古典和喜欢大而全的人，各取所需。</p>
<h3 id="claude-code">Claude Code</h3>
<p>接着聊聊真正的 CLI-based agent。</p>
<p><a href="https://xxchan.me/ai/2025/05/06/claude-code.html">上次的文章分析过</a>，<strong>Claude Code</strong> 是个做的很用心的产品。它给了我一种“确实应该能 work”的感觉，以及第一次让我思考 agent 好像并不需要 IDE。</p>
<p>相比于 IDE 或者浏览器里的 agent，CLI-based agent 本质上没太大差距，最主要的区别可能就是对 prompt 和 tool 的设计。但它的优点是可以 iterate faster。因为能做的事情更少，反而可以专注在 agent 最本质的地方。因此正如上次的文章分析的，claude code 的 prompt 包括 tool spec 写的都非常的长。我自己使用下来的体感是感觉 claude code 明显要比 Cursor 更“聪明”一点。这只是因为 prompt 调教的水平吗？还是说 Claude Code 有特供的模型？（感觉暂时不太像，但未来不好说）</p>
<p>Claude Code 其实并不只能跑在自己本地的 terminal 里，现在已经可以在 GitHub 上 @它，然后自己干活了（跑在 CI 里）。但它的思路并非深度集成，而更像是利用 CLI 无限的可组合性（所以非常第一性原理做事？）。</p>
<p><img src="/assets/img/ai-coding/image1.png" alt="image1"></p>
<p>在过去这一个月里，Anthropic 又有一些明显的动作，让人感觉想要力推 claude code：</p>
<ul>
<li>在 Code with Claude 大会上发布了 Claude Code 1.0，以及 4.0 新模型</li>
<li>断供 windsurf</li>
<li>Claude 20 块的 pro plan 也可以用 claude code 了，大大降低试用门槛。</li>
</ul>
<p>最后一点让我果断订阅了 Pro Plan。我试了一下，在达到 usage limit 之前（几个小时后刷新），我让 Claude Code 跑了一个比较复杂的重构任务，大概持续了三四十分钟。这个用量如果按 API token 计费，少说也得 10 刀。这或许就是 <strong>LLM 原厂做 agent 的一个优势</strong>：反正机器已经在那里了，可以把闲时资源充分利用起来。而做应用的公司，又不可能去整租机器。</p>
<h4 id="anthropic-做-claude-code-的真实意图是什么">Anthropic 做 Claude Code 的真实意图是什么？</h4>
<p>我其实还没完全看懂，Anthropic 做 Claude Code 的最终目的到底是什么？是想做一个好用的产品，还是想用它帮助模型训练本身？OpenAI 现在明显在花力气做 ChatGPT 这个产品，未来的想法大概是把 ChatGPT 作为一个入口，让它成为一个调度型的 agent。那 Claude Code 在这件事上的定位又是什么？</p>
<p>这一方面涉及对 Coding 这个市场的规模到底有多大的判断。从 Cursor 一开始的估值来看，大家普遍认为也就那样——因为开发者群体的体量就那么大。但现在 Vibe Coder 起来以后，整个故事又被撑大了不少。</p>
<p>不过，回到 Anthropic 这么一家大模型公司，直接下场卷应用层的东西，是否有点“不体面”？或许他们的目的并不是要把市场上其他人都吃掉，而是带着一定的试验心态，看看这种东西到底能做成什么样子。但说起应用层，Claude App 里面其实也有很多非常漂亮的功能，比如它的 Artifact，体验明显比 ChatGPT 好很多，虽然 Claude App 整体上很挫。</p>
<p>当然，更有可能的目的还是<strong>通过用户使用它的产品来收集数据，最终用于训练模型</strong>。 因为像 Cursor 这种合作伙伴的用户行为数据，它可能是拿不到的。所以它得自己做一个完整的产品，把整个链条打通。而且，Cursor 里那些乱七八糟的功能它可能也不太需要，它更关注的是训练模型过程中，真正与 Coding 直接相关的部分。</p>
<h4 id="从聪明到持久的进化">从“聪明”到“持久”的进化</h4>
<p>说回模型训练，Claude Code 宣称能独立跑七个小时，给我的感觉是：现在模型的“聪明程度”短期内好像有点提不上去了，于是大家开始发力做**“长期任务执行”**（所谓 Agent）——让模型持续工作得更久、更自主，并且能用工具来辅助提升自己。</p>
<p>在使用中，能很明显地观察到模型的一些新行为：</p>
<ul>
<li>它会先说：“我接下来要做 123”，体现出任务规划能力；（我原来觉得需要外化的 TODO list，但它似乎在内化这个能力）</li>
<li>它会先写一个方案，然后写到一半突然说：“让我想一想有没有更简单的方式”，然后重头来过。</li>
</ul>
<p>这些行为看着其实还挺好笑的，但也清晰地揭示了往 agent 这条路上走。</p>
<h3 id="amp"><a href="https://ampcode.com/">Amp</a></h3>
<p>他们整体上给我一种很有“产品 sense”，“很懂 agent 应该怎么 work”的感觉。但其实就是 claude code - like。我能想到他们的优势是 move （slightly） faster（？）；有 sourcegraph 这个 code search &#x26; indexing 后端能力（真的有用吗？）；不和 claude 一家强绑定，在别的模型追上的时候可以切；另外他们毫不掩饰、充满原则性的产品哲学可能可以赢得一批用户的深度信赖。他们是这么说的：</p>
<blockquote>
<ul>
<li>Amp is unconstrained in token usage (and therefore cost). <strong>Our sole incentive is to make it valuable</strong>, not to match the cost of a subscription.</li>
<li><strong>No model selector, always the best models.</strong> You don’t pick models, we do. Instead of offering selectors and checkboxes and building for the lowest common denominator, Amp is built to use the full capabilities of the best models.</li>
<li>Built to change. <strong>Products that are overfit on the capabilities of today’s models will be obsolete in a matter of months.</strong></li>
</ul>
</blockquote>
<p>他们的 <a href="https://ampcode.com/fif">“<strong>Frequently Ignored Feedback</strong>”</a> 也很有意思（用户：我要 xxx；amp：不，你不要），体现出他们对 Agent 的深刻理解：</p>
<blockquote>
<ul>
<li>Requiring edit-by-edit approval traps you in a <strong>local maximum</strong> by impeding the agentic feedback loop. You’re not giving the agent a chance to iterate on its first draft through review, diagnostics, compiler output, and test execution. If you find that the agent rarely produces good enough code on its own, <strong>instead of trying to “micro-manage” it,</strong> we recommend writing <strong>more detailed prompts</strong> and improving your <strong><code>AGENT.md</code> files</strong>.</li>
<li>Making the costs salient will make devs use it less than they should. Customers tell us they don’t want their devs worrying about 10 cents here and there. We all know the dev who buys $5 coffee daily but won’t pay for a tool that improves their productivity.</li>
</ul>
</blockquote>
<p>非常 Opinionated，有点**“果味”**。</p>
<p>除此以外，他们还做了个 leader board &#x26; share thread 功能，很有意思，可以在团队内激起一些奇妙的火花。</p>
<p>但短期内有点谨慎不看好，因为 Claude Code 已经足够好用了，而且绑定 Claude 订阅有巨大的成本优势……Amp 目前的收费模式是完全 pass-through 按 token 收费（没有 margin）。那虽然他们不盈利，可能也不会太烧钱。可以拭目以待一下。</p>
<h3 id="openai-codex-in-chatgpt">OpenAI Codex （in ChatGPT）</h3>
<p>上个月，OpenAI 也发布了自己的全自动 coding agent。是完全符合我对 agent 的想象的产品形态。我之前就在想，为什么我不能在手机上给 Cursor 派活？现在通过 ChatGPT 就能实现了。</p>
<p>但要看懂这个动作，就不能只盯着 coding。虽然他们收购了 Windsurf，但我认为 <strong>OpenAI 的野心远不止在 coding 市场上分一杯羹，他们更想做的是让 ChatGPT 成为未来的调度入口，甚至是一个操作系统</strong>。 Codex 的目的，或许只是为了比较专业的“高价值用户”能做更多事情，从而提高用户粘性。而收购 Windsurf，看中的可能是他们对 long context 的管理能力和宝贵的用户数据，从而赋能模型能力提升。</p>
<p>偏题说一嘴，ChatGPT 的整体体验远超其他官方 AI app，比如说</p>
<ul>
<li>memory：有一种很神奇的感觉，但对我个人而言提供的“价值”似乎还没那么大，真有偏个人思考的问题我还是更愿意问没有 memory，甚至更难用的 gemini。</li>
<li>o3 的 web search 体验过于好。相当于 mini 版 deep research</li>
<li>虽然也不能说非常丝滑，还是时不时有点 bug，但还是比其他家好太多了。</li>
</ul>
<h3 id="devin">Devin</h3>
<p>当年在 AI coding 还没那么普及的时候他们就打着 “First AI Software Engineer” 的旗号，要做全自动 end-to-end。初次发布后 500 刀/月的高价也是让人望而却步。并且试过的人也说它笨。</p>
<p>现在变成 20 刀起订，pay as you go 以后我立马试了试。</p>
<p>给我整体的感觉是，模型智力水平一般般。但他们的产品整体上也是一种“基本上能 work”的感觉。我有一种强烈的预感，在经过适当的 prompt engineering 之后，它能工作得很好。他们现在的说法也是很实在：“<strong>Treat Devin like a junior engineer</strong>”。（其实任何 Agent 产品目前大概都是这个状态。）</p>
<p>这是我第一次真切地感受到 agent 烧钱的威力。我让它处理一个 issue，它可以自主探索出一个框架（花了 2 个 ACU，每个 2.25 美元）。但后面让他改 bug，就有点改不对了，开始乱撞，很快就飙到了 4 个 ACU，20 刀迅速蒸发。或许现在的最佳用法是，先用它生成一个初版，然后手工或用 Cursor 精修。（当然，现在 Cursor 也有了 background agent，界限开始模糊了。）</p>
<p>对 devin（包括现在 Curosr remote agent）来说，还有一笔 vCPU 的钱。例如 m5.4xlarge（16C64G）ondeman $0.768/h。其实相比 token 并不算很贵……</p>
<p>在 Agent 成为大热门之后，<strong>Devin</strong> 直接受 Cursor、claude code、Codex 等各个方向的夹击了。</p>
<p>Devin 目前的优势在于 integration（能直接在 Slack、Linear &#x26; Jira 上派活）和较高的产品完成度（设计良好的 knowledge base、playbook 系统）。但这种“脏活累活”能撑起它的估值，能成为壁垒吗？直觉上，这些是任何一个好的 agent 都必须具备的功能。感觉 agent 这个领域确实需要大量时间去打磨体验，但资本似乎太急了。</p>
<p>他们最新版又出了一个 <a href="https://cognition.ai/blog/devin-2-1">Confidence rating</a> 功能很不错，可以避免用户因过高预期而烧钱搞出一堆垃圾。其实这也是 agent 挺有意思的一个地方，你用的不对的话就会效果又差又烧钱。换个角度说，一个好的程序员或者乙方不应该你说什么就做什么，而是会试图理解你的意图，为什么你想做这个，以及有什么潜在的坑。</p>
<p>他们的 deepwiki 也有点像是秀肌肉，可能体现了他们在 agent 上的技术积累。毕竟，他们是一开始就融巨资自研大模型、奔着超大上下文去的团队。或许他们有很多的卡，在成本上也有优势。</p>
<p>在写这篇文章的时候又看到一个新的平台 <a href="https://x.com/FactoryAI/status/1927754706014630357">Factory</a>，看起来也是叫板 devin。它的 release 感觉 too good to be true：“Factory integrates with your entire engineering system (GitHub, Slack, Linear, Notion, Sentry) and serves that context to your Droids as they autonomously build production-ready software.”。但我仔细看了一家这家公司成立甚至比 devin 还早一点。他们的 demo 视频中，一个有意思的地方是他所有的 integration 都是要跳回到它 factory 的页面上的（比如在 slack 里@它，它给一个链接）。它的体验其实是你在它的 portal 上完成所有事情，拉取 linear、GitHub、slack 的 context。（说个不恰当的比喻，这看着有点像 coding 领域的 Manus。）相比之下 devin 是让你在 Slack、Linear 上直接和它交互，更加的 in-context，in-flow。但 anyways，有竞争是好事。</p>
<h3 id="v0">v0</h3>
<p>上面其实说的都是比较偏为 engineer 设计的工具（不管是全自动还是半自动），下面开始聊聊更偏 “non-coder” 或者 “product” 向的平台。</p>
<p><strong>v0</strong> 是 coding 垂类赛道中更垂的一个，更偏前端 UI prototype。你可以把它想象成一个用自然语言驱动的 Figma，直接在 v0 里就能把界面“画”出来。另外一个讨巧的地方是利用 React/shadcn UI 的组件化能力，它生成的东西直接能整合到自己的代码里，是个能用的东西。</p>
<p>Vercel 这家公司一直很讲究“品味”，他们凭借在前端领域的深厚积累，把 v0 这个垂类的体验做得非常好。但可以想见，v0 的流畅体验背后，肯定有大量的工程优化，比如套用模板、专门微调模型，以及一套精心设计的 workflow 来保证生成效果。</p>
<p>一个有意思的动向是他们最近<a href="https://vercel.com/blog/v0-composite-model-family">发布了自己的模型</a>，并且开放了 API。他们对此的解释是：“Frontier models also have little reason to focus on goals unique to building web applications like fixing errors automatically or editing code quickly. You end up needing to prompt them through every change, even for small corrections.” 非常合理，但是这是不是属于雕花？当然对于 deliver 一个好用的产品来说，雕花是必须的。但我有一点看不懂他们为啥要出 api，可能一方面是回收训模型的成本，一方面是开始探索让自己成为一个“被调度的 agent”。</p>
<p>但感觉他们并不满足于只做 UI，他现在的定位已经是 “Full stack vibe coding platform” 了，另外一方面他们也在做 GitHub sync 等和现有代码整合的工作，而不再是只能在 v0 平台上生成。</p>
<h3 id="bolt--replit--lovable想法到应用-vibe-coding-platform">Bolt / Replit / Lovable：“想法到应用” Vibe Coding platform</h3>
<p>这一类的产品，其实有点大同小异。它们都是端到端的全栈平台，或者叫 app builder，有个更好听的名字叫**“idea to app”**。</p>
<p>相比 Cursor，他们解决的痛点一是部署（包括前后端以及数据库），二是更丝滑的 vibe coding 体验：我在 Cursor 里生成的代码反正也不看，为什么还要展示 code diff？直接 chat - live preview 才是更直接的体验。另外它们应该有一定的项目模板成分，让首条 prompt to app 的体验感受非常好。</p>
<p>虽然它们各自定位可能略有不同，比如开发者可能更喜欢 Bolt，非开发者更喜欢 Lovable（纯瞎说），但本质上做的事情是一样的：让用户在接近零手动改代码的情况下，搞出一个能用的产品来。</p>
<h4 id="vibe-coding-平台的困境">Vibe Coding 平台的困境</h4>
<p>这个事情的 tricky 之处在于，如果他们的目标是 deliver 最终产品给用户，那用户的期待会很高。在比较严肃的场景下，用户往往需要非常具体的修改，全权让 AI 来改不一定能达到效果，而且还很费钱。我在用 Cursor 糊前端的时候，感觉加功能很爽，但想微调按钮位置、布局、交互逻辑时，它往往就改不对了。</p>
<p>虽然有些 vibe coding 平台也提供一定的 online code editor 能力，但真到了需要精细控制的时候，会写代码的人可能还是会回到 Cursor，因为那里最顺手。可一旦回到了 Cursor，后续的开发可能就没必要再回到 vibe coding 平台了。部署的痛点是一次性的，CI/CD 搞好之后，改完代码 push 一下就行。</p>
<p>精细开发的话，Cursor 的 agent 或许能提供更精确的 context。这些 vibe coding 的平台或许也可以把 coding agent 的能力都提上去，但是他们要做的事情太多了，把一个平台打造好得花很多精力。他们在 coding 的技术积累肯定是不如 Cursor 等 for developer 的平台。</p>
<p>简言之，<strong>vibe coding 平台在严肃、复杂场景下的上限可能不足。</strong> 如果只做简单的小项目或者 demo，价值肯定是有的，但有多少用户愿意为此买单，我就不懂了。这个故事，其实在 Vercel/Neon 这类主打“开发者体验”的 PaaS 平台上已经发生过：大家都说体验好，但等项目做大以后，很多人还是默默地迁移到了 AWS。</p>
<p>再换个角度，我大胆猜想一下：未来，Cursor 完全可以把 vibe coding / app builder 的体验也做好。开屏界面搞成一个对话框，同时把 live preview、Supabase/Vercel 整合等功能都做了，到时这些平台就更危了。更何况，vibe coding 这个概念本来就是在 Cursor 上火起来的，对那些想 build product 的人来说，“看到代码”这件事或许并不是多大的阻碍。我大胆预测，一年后 Cursor 可能就会这么做。</p>
<p>也可以看看 Lovable 的 <a href="https://docs.lovable.dev/faq#what-is-the-difference-between-lovable-and-cursor">FAQ</a> 里自己写的和其他平台/Cursor 的比较：</p>
<ul>
<li>大部分的点都是 “just better”，“way more natural”，“Attention to detail”，比较虚的。在普通的产品上或许有说服力，但在 AI coding 竞争这么激烈的领域，想保持领先太难了。</li>
<li>他们有个 visual editor 其实挺有意思，可以直接所见即所得地修改 UI 元素，一定程度可以解决之前说的微调麻烦的问题。但我试了一下，目前效果还比较一般，只能改改字的内容、字号、margin 之类的，并不能实现拖拽等功能。这个故事长远看也很好听 - 甚至可以吃掉 figma？但是感觉技术难度极其大。（让我想到现在连个真正好用的 mermaid 图 visual editor 都没有）</li>
</ul>
<h3 id="youwareuser-generated-software-的激进实验">YouWare：User Generated Software 的激进实验</h3>
<p>AI coding 真正让人兴奋的地方，在于它所展现的“自然语言调度算力”的能力。这让普通人能使用代码这个工具去解决他们自己的之前无法被满足的需求：一个 <strong>User Generated Software (UGS)</strong> 的时代，正在到来。</p>
<p>在所有产品中，<strong>YouWare</strong> 仿佛是一个精准为此而生的平台，它把 UGS 作为了自己唯一的目标。</p>
<h4 id="把-ai-coding-做成内容社区这对吗">把 AI coding 做成内容社区，这对吗？</h4>
<p>我一开始对 YouWare 谨慎不看好。</p>
<p>它给我的感觉，是想把 UGC 时代那套（社区、流量、平台）的想法，生搬硬套到 UGS 上来。如果他做一个新的内容平台，是要和抖音、小红书竞争注意力的，但感觉不如他们好刷。个性化的娱乐需求已经被短视频充分满足了。（……吗？在我说完这句话之后，又突然感觉短视频还是没那么好刷，也总觉得也总找不到符合我偏好的游戏。）</p>
<p>我最初的想法是：UGS 的潜力在于满足海量的、未被满足的长尾工具需求。用户不缺动机，只缺能力。如果是为了解决自己的痛点，那用户干完活就走了，不一定有分享或分发的欲望（或者在 Twitter/小红书上发发就够了），更不会没事干去一个工具网站上“刷”来“刷”去。</p>
<p>YouWare 认为许多人并不知道自己可以做什么，因此需要一个平台来激发他们的思考和创造欲，社交元素在此便扮演了激发灵感的角色。</p>
<p>v0、Lovable 这些平台，虽然也号称小白可用，也做一点社区，但它们仍然会把代码展示给用户，会弹出 build error，会让你去连接 Supabase。它们的假设用户，依然是有一定技术背景的“专业人士”（如产品经理、设计师）。例如这段：“Lovable provides product managers, designers, and engineers with a shared workspace to build high-fidelity apps, collaborate effectively, and streamline the path to production-ready code.”</p>
<p>而 YouWare 的激进之处在于，它<strong>完全不给用户看代码</strong>。它面向的 non-coder 是更广泛的普通人。</p>
<p>这有点像小红书限制图文的字数，通过一种限制，反而最大化了目标用户的可用性。对于一个完全不懂技术的人来说，看到 build error 意味着终点，而在 YouWare 里，这个终点被隐藏了。</p>
<p>上面说工具需求和娱乐需求的区别，其实小红书也可以被看作是一个用户记录的工具，而且小红书火起来很大程度上是它“有用”。</p>
<p>在我自己试用过 YouWare 之后（<a href="https://www.youware.com/profile/uNYPe0WjpUVfW21IOleyYTlMIWf1">我生成的东西</a>），感受到了一些有趣的点</p>
<ul>
<li>
<p>确实有点毒性（以及免费额度非常重要）。比如我会有个脑洞就想扔上去看看行不行。如果用其他的平台搞正经项目的话我会更要掂量一下再做。（我心里预期包含了 debug 成本等，毕竟我是想要一个真的能用的东西。在 mental burden 上，YouWare &#x3C; Lovable &#x3C; Cursor，但有用性可能相反）。这种感觉和我用 cursor 的 background agent 时很像，都是“跑跑看，反正不亏”。</p>
</li>
<li>
<p>它真的隐藏了代码细节，包括失败。Lovable 在我试用的时候初次生成报错的概率还是挺大的（虽然点一下也就修了），而 YouWare 没出现过。</p>
<p><img src="/assets/img/ai-coding/image2.png" alt="image2.png"></p>
</li>
<li>
<p>它鼓励“玩耍”。YouWare 的 Remix 和 Boost 功能也挺有意思的（先不谈效果好不好）。很符合“用户并不知道他想 build 什么东西”的出发点，鼓励探索和再创作。</p>
<ul>
<li>
<p>但突然发现这东西很多家都有了，甚至连 claude artifact 都做了类似的功能，而且完成度高得惊人。）</p>
<p><img src="/assets/img/ai-coding/image3.png" alt="image3.png"></p>
<p><img src="/assets/img/ai-coding/image4.png" alt="image4.png"></p>
</li>
</ul>
</li>
</ul>
<h4 id="一堆关于-youware-的零散思考">一堆关于 YouWare 的零散思考</h4>
<ul>
<li>
<p><strong>Vibe Coder 是什么样的人？</strong> UGC 时代出现了一个新东西叫专业“创作者”，现在的“vibe coder”倒是有点像。但内容创作者的收入主要靠流量和商单，而 vibe coder 更接近独立开发者，他们想的是 build 自己的产品，然后靠卖软件或订阅赚钱。卖软件终究要靠解决实际需求，然后去各个平台推广，而不是等着别人在你的 UGS 平台上刷到你（例如去发小红书而不是等人在 GitHub 上刷到你）。。
……想到这里，我开了个脑洞：真要做的话，岂不是应该做 <strong>vibe coder 的 OnlyFans</strong>，而不是 YouTube/Instagram？🤣</p>
</li>
<li>
<p><strong>代码确实有娱乐需求</strong>（有个东西叫创意编程）…但还是那句话，娱乐需求是要竞争注意力的。再其中的一个小用法是把文章变成交互式网站，满足教育学习的需求，比如这些：</p>
<ul>
<li><a href="https://ciechanow.ski/bicycle/">https://ciechanow.ski/bicycle/</a></li>
<li><a href="https://garden.bradwoods.io/">https://garden.bradwoods.io/</a></li>
<li><a href="https://encore.dev/blog/queueing">https://encore.dev/blog/queueing</a></li>
<li><a href="https://www.redblobgames.com/pathfinding/a-star/introduction.html">https://www.redblobgames.com/pathfinding/a-star/introduction.html</a></li>
</ul>
</li>
<li>
<p><strong>Power User vs. 小白用户：</strong> 这两者的需求是矛盾的，一个平台很难同时满足。YouWare 显然选择了后者。</p>
</li>
<li>
<p><strong>输出形式的局限：</strong> 为什么目前这类 coding 平台（包括 Devin、Lovable 等）的最终产出大多是网站？对于许多小型工具性需求，命令行或桌面应用或许更直接、更高效。当然，从 UX 角度看，网站对普通用户最友好。</p>
</li>
<li>
<p><strong>成本问题</strong></p>
<ul>
<li>作为内容平台，有很大的合规风险和成本问题。但可能也没那么难，毕竟 deepseek 都能在国内上了。</li>
<li>host 网站的成本问题。以及不同形式的网站可能有不同的计算需求，对热门项目可能还得动态 scale。</li>
<li>Agent 的巨大算力成本。相比 UGC，用户生产内容时其实平台没什么成本，但 UGS 则不一样。相比 Amp 说我的优化目标就是最大效用，这里 YouWare 的账就更难算了，这里有很大的生成效果和成本之间的 tradeoff 要做。这就引到一个核心问题是它鼓励用户创造，那盈利模式是什么？如果沿用传统平台的流量广告模式，考虑到巨大的成本，盈利上限恐怕不高。</li>
</ul>
</li>
<li>
<p><strong>是否要对特定场景优化？</strong></p>
<ul>
<li>例如现在平台上可能有过半用户会用来写报告什么的。但其实这是类 deepresearch 功能，在 YouWare 里做效果会很一般。Manus/flowith 倒是估计会优化（Manus 最近还真特化了 slides 功能，让我有点无语，说好的通用 Agent 最后还是做这种东西去了）。</li>
</ul>
</li>
<li>
<p><strong>数据驱动平台演化？</strong></p>
<ul>
<li>我一开始很困惑于为何 YouWare（包括 Manus 等）在能力尚不完善的阶段就大力买流量推广。而不是先将产品效果打磨得更好再推广。可能是他们已获得充足融资，急于扩张。</li>
<li>但在产品成熟前就推出，可以帮助他们了解用户到底想 build 什么，然后针对性地优化。我之前可能低估了社交对于激发用户创造力的作用。这可能类似于一种进化算法，或者“伟大无法被计划”的理念：让用户自由探索，或许能裂变出意想不到的创新。YouWare 团队的字节背景，想必会沿用数据驱动的决策方式，通过用户行为来让平台演化，或许做着做着就能发现奇妙的突破点。</li>
</ul>
</li>
</ul>
<h4 id="youware-的未来">YouWare 的未来</h4>
<p>我相信一家公司是有它的基因的。YouWare 的字节剪映 PM 创始人背景，或许才能想出这么个玩意儿。</p>
<p>虽然上面分析的很多东西可能 Lovable 会往 YouWare 的方向靠，更加隐藏代码；或者 YouWare 往普通的 Agent 平台上靠，提高 utility，但期待未来的结果。我觉得 YouWare 的形态未来一定不是现在这样。同时我越来越觉得 YouWare 的出发点很有意思，或许能做出一些不一样的事情。这个团队可能比做 coding 的人更懂创作、平台和消费者，比懂创作者的人更懂 AI coding。</p>
<p>YouWare 的目标并非最大化 utility，而是<strong>激发普通人的 creativity</strong>。当然 utility 也要至少 good enough。</p>
<p>一个残酷的问题是未来会 Cursor 的人越来越多了，会不会就吃掉这种傻瓜工具了？可能会想摄影师用相机和普通人用手机拍照可以共存一样，程序员和 vibe coder 共存。另一个想法是我最近越来越觉得，当前的 AI 正在加剧马太效应（或许从 200 刀订阅就开始了）。懂得如何用好 AI、并能负担得起开销的人（比如真见过人用 Cursor 一天消耗好几百刀），与普通人的差距会越来越大。对于那些不那么乐于动脑、需求表达不清的普通用户，他们会被“淘汰”吗？这个未来太残忍，我有点不愿设想，宁愿投身对抗潮流。从这个角度看，YouWare 这种致力于服务广大普通人的尝试让我觉得很有价值。</p>
<p>当然虽然 YouWare 很有想法。但认知能否成功转化为可落地的产品并实现商业价值，尚存不确定性。</p>
<h2 id="big-picture行业格局技术方向分析">Big picture：行业格局/技术方向分析</h2>
<p>在逐一审视了牌桌上的这些玩家之后，让我们向后退一步，看看整个 AI coding 领域的全景。</p>
<h3 id="赛道细分">赛道细分</h3>
<p>AI coding 还可以细分为几个小方向。一个产品可能会跨多个方向</p>
<ul>
<li>
<p><strong>AI-assisted Coding:</strong> 以 <strong>Cursor</strong> 和 <strong>GitHub Copilot</strong> 为代表，它们是现有开发工作流的“增强器”，致力于让专业开发者写代码更快、更爽。</p>
</li>
<li>
<p><strong>End-to-end Agent</strong> 以 <strong>Devin</strong>、<strong>Claude Code</strong> 和 <strong>Amp</strong> 为代表，它们的目标是成为能独立完成任务的“初级工程师”，将开发者从执行者提升为任务的分配者和审查者。Agent 同时也可能是作为合作者，特别是 Claude Code 这样 CLI based agent，我既可以和他 pair programming，也可以请他帮我干活。</p>
<p><a href="https://youtu.be/FzbkAy0DcQk?si=caXCcvDsm2tUbeTP">课代表在视频里</a>讲到他预测 2025 年 Q3，硅谷将形成共识，认为 Agent 可以达到甚至替代 mid-level software engineer 的水平。评论区对此多持怀疑态度。我的看法是，Agent 或许不一定会完全“替代”，但它极有可能成为 mid-level 工程师的得力“合作伙伴”。从这个角度理解，我认为其预测是相当有道理的。</p>
</li>
<li>
<p><strong>Vibe Coding / UGS:</strong> 以 <strong>v0</strong> 和 <strong>YouWare</strong> 为代表，它们试图将代码的能力赋予非开发者，让他们通过自然语言创造应用和工具，一个更偏向“产品原型”，一个则更激进地走向“内容社区”。</p>
</li>
</ul>
<h3 id="半成品的尴尬现状">“半成品”的尴尬现状</h3>
<p>我们必须承认一个现实：<strong>Agent 依然是一个“半成品”</strong>。它的效果还不足以真正端到端地交付一个完美的结果，有时甚至不如我们自己动手来得省事。（比如还是手动调 button 爽）</p>
<p>但我们也能清晰地看到 agent 的进化路径：从最早在 ChatGPT 里手动复制粘贴，到后来在 IDE 里进行单轮对话，再到如今的 Cursor Background Agent 和 Claude Code，<strong>Agent 能够独立工作的时间越来越长，做事的数量和质量都越来越高，这无疑是一个不可逆转的趋势。</strong></p>
<p>或许我们应该换个心态：把它想象成一个外包合作方。你把任务派给它，让它干一段时间，然后你来检查、给反馈，而不是指望它一次性搞定。这和我们与人类外包商（也就是“Agent”）的合作模式，并无二致。</p>
<h4 id="成本的诅咒与模型的赌局">成本的诅咒，与模型的赌局</h4>
<p>与此同时，Agent 是个非常贵的东西。这除了让用户不敢大规模使用之外，也让 agent 应用公司陷入两难：是继续不计成本地提升效果，还是转而研究各种“奇技淫巧”“雕花”以降本增效？但存在性能和成本的 tradeoff。我不知道是否可能同时兼顾两者，比如团队的一部分专注于性能提升，另一部分研究成本优化。如果完全不考虑成本控制，高昂的价格也可能会吓退用户。但 AI Agent 公司是否真的那么急于获客？或许也不然。</p>
<p>这里存在的一个更大的变数：如果上游的 LLM 厂商大幅降价，那么之前在成本优化上所做的努力，比如辛辛苦苦优化了 30%-50%，就可能因为外部因素而显得“白费功夫”。当然，也存在原厂优化不力，或者他们转而发展自家 Agent 业务的可能性。因此，对于 AI Agent 创业公司而言，其决策中充满了需要“赌”的成分。</p>
<h3 id="agent-需要哪些能力怎么做-coding-agent">Agent 需要哪些能力？怎么做 coding agent？</h3>
<p>从各个产品的探索中，我们可以窥见一个好的 Agent 需要具备哪些能力：</p>
<ul>
<li>
<p><strong>Memory/知识库</strong>：例如能自动学习 Cursor rule。（devin/manus 都有了）</p>
</li>
<li>
<p><strong>Long context 能力</strong>：indexing &#x26; RAG？</p>
<ul>
<li>我对这点的作用是有点存疑的。现在进入 Agent 时代之后，Agent 可以自己去 grep 代码找到 context。而且这和我自己开发的流程也很像。还是大量依赖字符串搜索，并不是什么聪明的办法。但其实 grep 仅限于知道要改什么的时候。“xxx 是怎么 work 的”这种模糊的问题就不行了。</li>
<li>但对 long context 的考验其实挺难验证的，需要用的很深才能知道到底什么水平。我也还没有用出感觉来。</li>
</ul>
</li>
<li>
<p><strong>Task 能力</strong>
之前我觉得必须需要外化的 todo list，但是现在好像 claude 开始内化这种能力了（但直觉上还是外部的更好？）</p>
<p><img src="/assets/img/ai-coding/image5.png" alt="image5.png"></p>
</li>
<li>
<p><strong>主动沟通与 Interaction</strong>:  一个好的 Agent 不应该你说什么就做什么。它应该像一个好的乙方，会反问、会澄清意图、会评估风险（比如 Devin 的“置信度评级”）。例如“我要做一个 ppt”，就问用户你有没有已经有的素材，或者课本资料提供等。deep research 类产品在这这事情上做的也不错。</p>
</li>
</ul>
<p>话说回来，做好 coding agent 是不是需要你自己用 coding agent 用的很好？</p>
<h2 id="最后的思考我们与-ai-的关系">最后的思考：我们与 AI 的关系</h2>
<p>那自然语言调度算力与 User-Generated Software 这个概念，可能 somehow 已成为行业共识，但其具体的实现形式，则远未达成一致。</p>
<p>聊了这么多，最后还是回到我们自己身上。</p>
<h4 id="普通人到底该怎么选">普通人到底该怎么选？</h4>
<p>总的来说，现在所有的工具都处于一个“still early, but already useful (if used correctly)”的阶段。它们在简单的小活儿或生成 demo 上表现不错，但在复杂场景下，则非常考验使用者的**“手艺”**。</p>
<p>这门“手艺”既包括 prompt engineering 的技巧，也包括对代码和 Agent 工作原理的理解。“了解 ai 能力边界”也是个有点说烂了的东西。所以，未来能把 Agent 用得最好的，大概率还是专业人士。这就像专业摄影师和普通人的手机拍照，工具模糊了专业间的边界（比如工程师可以搞设计，PM 可以写 demo），但最终还是拉开了上限。</p>
<p>Agent 可能是越用越好用的东西，需在团队里一起探索最佳实践、积累 prompt 技巧和知识库，本身就是一种投资。</p>
<p>但我也时常怀疑，研究这些东西会不会是徒劳？等到模型能力到达某个奇点，我们直接拥抱最终形态就行了，中间的各种探索和使用经验都会过时。这或许是对的。多说无益，我也不再想按着别人的头让他用 AI，but I just can’t help playing with it, it’s fun! 😁🤪</p>
<h4 id="当-llm-生成的能力趋向无限的时候我们要用他来生成什么">当 llm 生成的能力趋向无限的时候，我们要用他来生成什么？</h4>
<p>一个更深层的问题：AI 的发展和我到底有什么关系？就像我不怎么看论文，感觉离我很远。虽然 ChatGPT 让我学习任何东西都变得容易多了，我动不动想到啥就要和它探讨半天，但我反而更累了。我真的需要了解这么多东西吗？</p>
<p>Coding Agent 的发展能让我写越来越多的代码，那我要把那些东西都做出来吗？当生成的能力趋向无限时，我们到底要用它来生成什么？</p>
<p>YouWare 这样的产品或许是一种答案。</p>
<p>又或者，这本身就是一个不存在的问题，就像实现可控核聚变之后应该怎么办？能人人开上高达吗？</p>]]></content><category term="AI Agent" /></entry><entry><title type="html">Claude Code 内部工作原理窥探</title><link href="https://xxchan.me/zh/blog/2025-05-06-claude-code/" rel="alternate" type="text/html" title="Claude Code 内部工作原理窥探" /><id>https://xxchan.me/zh/blog/2025-05-06-claude-code</id><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[这篇帖子 讲了关于如何偷看 cursor prompt。它采用的方法是用 ollama 本地跑大模型以及看到 log。文中还提及了使用 ngrok 来把本地的端口暴露到公网，从而能够在 cursor 中访问。我们用类似的方法来看一下 Claude Code]]></summary><content type="html" xml:base="https://xxchan.me/zh/blog/2025-05-06-claude-code/"><![CDATA[<p><a href="https://www.superlinear.academy/c/share-your-work/cursor-8514ec">这篇帖子</a> 讲了关于如何偷看 cursor prompt。它采用的方法是用 ollama 本地跑大模型以及看到 log。文中还提及了使用 ngrok 来把本地的端口暴露到公网，从而能够在 cursor 中访问。我们用类似的方法来看一下 Claude Code</p>
<h2 id="更方便快捷地窥探-prompt-的方法">更方便快捷地窥探 prompt 的方法</h2>
<p>首先介绍一个更简单的方法来看 prompt：</p>
<h3 id="方法1-openai-platform">方法1. openai platform</h3>
<p>openai platform 原生有 request log（第一次进这个页面需要手动enable 一次）<a href="https://platform.openai.com/logs">https://platform.openai.com/logs</a></p>
<p><img src="/assets/img/claude-code/openai.png" alt="openai API platform request log"></p>
<p><img src="/assets/img/claude-code/openai-2.png" alt="openai API platform request log"></p>
<p>可以看到有非常详细的信息。比如 project_layout，是 curosr 最近版本的一个 beta 功能。</p>
<p>另外 tools 是一大关键。比如下面列出的 codebase_search 应该就是 cursor 从索引过的 codebase 进行 vector serach。而 edit_file 则是调用 cursor 训练的 apply change 模型。</p>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="json"><code><span class="line"><span style="color:#24292E">{</span></span>
<span class="line"><span style="color:#005CC5">  "name"</span><span style="color:#24292E">: </span><span style="color:#032F62">"codebase_search"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">  "description"</span><span style="color:#24292E">: </span><span style="color:#032F62">"Find snippets of code from the codebase most relevant to the search query.</span><span style="color:#005CC5">\n</span><span style="color:#032F62">This is a semantic search tool, so the query should ask for something semantically matching what is needed.</span><span style="color:#005CC5">\n</span><span style="color:#032F62">If it makes sense to only search in particular directories, please specify them in the target_directories field.</span><span style="color:#005CC5">\n</span><span style="color:#032F62">Unless there is a clear reason to use your own search query, please just reuse the user's exact query with their wording.</span><span style="color:#005CC5">\n</span><span style="color:#032F62">Their exact wording/phrasing can often be helpful for the semantic search query. Keeping the same exact question format can also be helpful."</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">  "strict"</span><span style="color:#24292E">: </span><span style="color:#005CC5">false</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">  "parameters"</span><span style="color:#24292E">: {</span></span>
<span class="line"><span style="color:#005CC5">    "type"</span><span style="color:#24292E">: </span><span style="color:#032F62">"object"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">    "properties"</span><span style="color:#24292E">: {</span></span>
<span class="line"><span style="color:#005CC5">      "query"</span><span style="color:#24292E">: {</span></span>
<span class="line"><span style="color:#005CC5">        "type"</span><span style="color:#24292E">: </span><span style="color:#032F62">"string"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">        "description"</span><span style="color:#24292E">: </span><span style="color:#032F62">"The search query to find relevant code. You should reuse the user's exact query/most recent message with their wording unless there is a clear reason not to."</span></span>
<span class="line"><span style="color:#24292E">      },</span></span>
<span class="line"><span style="color:#005CC5">      "target_directories"</span><span style="color:#24292E">: {</span></span>
<span class="line"><span style="color:#005CC5">        "type"</span><span style="color:#24292E">: </span><span style="color:#032F62">"array"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">        "items"</span><span style="color:#24292E">: {</span></span>
<span class="line"><span style="color:#005CC5">          "type"</span><span style="color:#24292E">: </span><span style="color:#032F62">"string"</span></span>
<span class="line"><span style="color:#24292E">        },</span></span>
<span class="line"><span style="color:#005CC5">        "description"</span><span style="color:#24292E">: </span><span style="color:#032F62">"Glob patterns for directories to search over"</span></span>
<span class="line"><span style="color:#24292E">      },</span></span>
<span class="line"><span style="color:#005CC5">      "explanation"</span><span style="color:#24292E">: {</span></span>
<span class="line"><span style="color:#005CC5">        "type"</span><span style="color:#24292E">: </span><span style="color:#032F62">"string"</span><span style="color:#24292E">,</span></span>
<span class="line"><span style="color:#005CC5">        "description"</span><span style="color:#24292E">: </span><span style="color:#032F62">"One sentence explanation as to why this tool is being used, and how it contributes to the goal."</span></span>
<span class="line"><span style="color:#24292E">      }</span></span>
<span class="line"><span style="color:#24292E">    },</span></span>
<span class="line"><span style="color:#005CC5">    "required"</span><span style="color:#24292E">: [</span></span>
<span class="line"><span style="color:#032F62">      "query"</span></span>
<span class="line"><span style="color:#24292E">    ]</span></span>
<span class="line"><span style="color:#24292E">  }</span></span>
<span class="line"><span style="color:#24292E">}</span></span></code></pre>
<p>工具调用的输入输出也可以在 log 里看到</p>
<p><img src="/assets/img/claude-code/openai-3.png" alt="openai API platform request log"></p>
<p>所以小小升华一下，prompt 对话是 LLM 的第一性原理，任何 AI app 归根结底就是和 LLM 对话（例如工具调用也是对话），然后把有用的（结构化的）结果抠出来，再用确定性的其他代码缝合起来。要研究某个 AI app 怎么工作最好的方式就是看对话 log。</p>
<h3 id="方法-2使用-cloudflare-ai-gateway-免费">方法 2：使用 <a href="https://developers.cloudflare.com/ai-gateway/">Cloudflare AI gateway</a> （免费）</h3>
<p>这个具体 setup 就不讲了，跟着上面的文档在 Cloudflare 上点两下就行了。</p>
<p>在成功开启以后，会获得一个 API endpoint， <code>https://gateway.ai.cloudflare.com/v1/&#x3C;account-id>/&#x3C;gateway-name>/</code> 针对不同的供应商有一些变种</p>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="plaintext"><code><span class="line"><span>https://gateway.ai.cloudflare.com/v1/&#x3C;account-id>/&#x3C;gateway-name>/openai</span></span>
<span class="line"><span>https://gateway.ai.cloudflare.com/v1/&#x3C;account-id>/&#x3C;gateway-name>/anthropic</span></span>
<span class="line"><span>https://gateway.ai.cloudflare.com/v1/&#x3C;account-id>/&#x3C;gateway-name>/openrouter</span></span></code></pre>
<p>然后只需要在 AI app 如 cursor 里设置 custom endpoint 为上面的 gateway 就可以一键替换成功了。</p>
<p>这个 AI gateway 本质上是一个反向代理，你把请求发给他，然后他代替你原封不动地把请求发给模型供应商，然后再发回来给你。他在中间截取了请求就可以做一些通用的 Log、metrics 之类的工作。</p>
<p>CF AI gateway 的 log 长得像这样：</p>
<p><img src="/assets/img/claude-code/cf-gateway.png" alt="cloudflare AI gateway log"></p>
<p>相比 openai 的 log ：</p>
<ul>
<li>有请求级别的计费和时长统计</li>
<li>支持 openai 以外的模型</li>
<li>可以看到更完整的请求，例如 choices。（但是相应的缺点是没有像 openai 一样渲染成对话的形式那么清楚 🤣）</li>
</ul>
<h3 id="方法-3自制-http-proxy">方法 3：自制 HTTP proxy</h3>
<p>因为上面说了 ai gateway 本质上就是个简单的反向代理，那实际上自己写个最简单的 HTTP 处理服务器，接收并转发 HTTP 请求，同时把请求 log 下来就行，只需要几行代码。</p>
<p>相比 ai gateway 而言，缺少了一些如 token 统计、计费等特化的功能，但是可以更灵活地做更多事。</p>
<h2 id="claude-code-内部窥探">Claude Code 内部窥探</h2>
<p>下面开始正题</p>
<p>想研究 claude code 起因是看到这篇 twitter 说 claude code 太牛了，可以替代 cursor。让我很难不好奇是不是真有这么牛。</p>
<p><img src="/assets/img/claude-code/twitter.png" alt="twitter"></p>
<p><a href="https://chatgpt.com/share/6818d1c6-3c68-8001-aa66-1235b5a95f64">问了一下 ChatGPT</a>，感觉没说出什么来，决定还是自己试试看。要研究工作原理难免要看一下 prompt，那必然得用 custom endpoint。ChatGPT 帮我找到了一个 undocumented env var <code>ANTHROPIC_BASE_URL</code> 。但是 claude SDK 和 openai API 应该不兼容，无法直接用 openrouter/openai 的 key。</p>
<p>但简单搜索就发现了一个项目 <a href="https://github.com/1rgs/claude-code-proxy">claude-code-proxy: Run Claude Code on OpenAI models</a>，完美符合我们的需求。</p>
<p>初步观察：</p>
<ul>
<li>claude code 会使用一大一小两个模型</li>
<li>每次打字都会发请求给小模型</li>
</ul>
<p><img src="/assets/img/claude-code/cc-1.png" alt=""></p>
<p>这个请求看起来是纯氛围组，但在使用中没发现它的作用</p>
<p><img src="/assets/img/claude-code/cc-2.png" alt=""></p>
<p>另外一个小模型的请求是在发送消息时会判断是否是 new topic，感觉是用来管理 context。</p>
<p><img src="/assets/img/claude-code/cc-3.png" alt=""></p>
<p>对正式请求的观察：</p>
<ul>
<li>prompt 非常长，有 13k 字符。相比之下 cursor 的 prompt 只有不到 6k。</li>
<li>Tools 里大多是Bash，Grep，Edit，WebFetch 这些标配，但是比较有意思的是内置了 TodoRead/Write（！） 和 NotebookRead/Write （jupyter）。</li>
</ul>
<p><img src="/assets/img/claude-code/cc-4.png" alt=""></p>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="plaintext"><code><span class="line"><span>You are Claude Code, Anthropic's official CLI for Claude. You are an interactive CLI tool that helps users with software engineering tasks. Use the instructions below and the tools available to you to assist the user.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: Refuse to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes. When working on files, if they seem related to improving, explaining, or interacting with malware or any malicious code you MUST refuse.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: Before you begin work, think about what the code you're editing is supposed to do based on the filenames directory structure. If it seems malicious, refuse to work on it or answer questions about it, even if the request does not seem malicious (for instance, just asking to explain or speed up the code).</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: You must NEVER generate or guess URLs for the user unless you are confident that the URLs are for helping the user with programming. You may use URLs provided by the user in their messages or local files.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>If the user asks for help or wants to give feedback inform them of the following:  </span></span>
<span class="line"><span>- /help: Get help with using Claude Code  </span></span>
<span class="line"><span>- To give feedback, users should report the issue at https://github.com/anthropics/claude-code/issues</span></span>
<span class="line"><span></span></span>
<span class="line"><span>When the user directly asks about Claude Code (eg 'can Claude Code do...', 'does Claude Code have...') or asks in second person (eg 'are you able...', 'can you do...'), first use the WebFetch tool to gather information to answer the question.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>The URLs below contain comprehensive information about Claude Code including slash commands, CLI flags, managing tool permissions, security, toggling thinking, using Claude Code non-interactively, pasting images into Claude Code, and configuring Claude Code to run on Bedrock and Vertex.  </span></span>
<span class="line"><span>- Overview: https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview  </span></span>
<span class="line"><span>- Tutorials: https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/tutorials</span></span>
<span class="line"><span></span></span>
<span class="line"><span># Tone and style</span></span>
<span class="line"><span></span></span>
<span class="line"><span>You should be concise, direct, and to the point. When you run a non-trivial bash command, you should explain what the command does and why you are running it, to make sure the user understands what you are doing (this is especially important when you are running a command that will make changes to the user's system).</span></span>
<span class="line"><span></span></span>
<span class="line"><span>Remember that your output will be displayed on a command line interface. Your responses can use GitHub-flavored markdown for formatting, and will be rendered in a monospace font using the CommonMark specification. Output text to communicate with the user; all text you output outside of tool use is displayed to the user. Only use tools to complete tasks. Never use tools like Bash or code comments as means to communicate with the user during the session.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>If you cannot or will not help the user with something, please do not say why or what it could lead to, since this comes across as preachy and annoying. Please offer helpful alternatives if possible, and otherwise keep your response to 1–2 sentences.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: You should minimize output tokens as much as possible while maintaining helpfulness, quality, and accuracy. Only address the specific query or task at hand, avoiding tangential information unless absolutely critical for completing the request. If you can answer in 1–3 sentences or a short paragraph, please do.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: You should NOT answer with unnecessary preamble or postamble (such as explaining your code or summarizing your action), unless the user asks you to.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: Keep your responses short, since they will be displayed on a command line interface. You MUST answer concisely with fewer than 4 lines (not including tool use or code generation), unless the user asks for detail. Answer the user's question directly, without elaboration, explanation, or details. One word answers are best. Avoid introductions, conclusions, and explanations. You MUST avoid text before/after your response, such as "The answer is &#x3C;answer>.", "Here is the content of the file..." or "Based on the information provided, the answer is..." or "Here is what I will do next...".</span></span>
<span class="line"><span></span></span>
<span class="line"><span>Here are some examples to demonstrate appropriate verbosity:</span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: 2 + 2  </span></span>
<span class="line"><span>assistant: 4  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: what is 2+2?  </span></span>
<span class="line"><span>assistant: 4  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: is 11 a prime number?  </span></span>
<span class="line"><span>assistant: Yes  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: what command should I run to list files in the current directory?  </span></span>
<span class="line"><span>assistant: ls  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: what command should I run to watch files in the current directory?  </span></span>
<span class="line"><span>assistant: [use the ls tool to list the files in the current directory, then read docs/commands in the relevant file to find out how to watch files] npm run dev  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: How many golf balls fit inside a jetta?  </span></span>
<span class="line"><span>assistant: 150000  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: what files are in the directory src/?  </span></span>
<span class="line"><span>assistant: [runs ls and sees foo.c, bar.c, baz.c]  </span></span>
<span class="line"><span>user: which file contains the implementation of foo?  </span></span>
<span class="line"><span>assistant: src/foo.c  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: write tests for new feature  </span></span>
<span class="line"><span>assistant: [uses grep and glob search tools to find where similar tests are defined, uses concurrent read file tool use blocks in one tool call to read relevant files at the same time, uses edit file tool to write new tests]  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span># Proactiveness</span></span>
<span class="line"><span></span></span>
<span class="line"><span>You are allowed to be proactive, but only when the user asks you to do something. You should strive to strike a balance between:  </span></span>
<span class="line"><span>1. Doing the right thing when asked, including taking actions and follow-up actions  </span></span>
<span class="line"><span>2. Not surprising the user with actions you take without asking  </span></span>
<span class="line"><span>3. Do not add additional code explanation summary unless requested by the user. After working on a file, just stop, rather than providing an explanation of what you did.</span></span>
<span class="line"><span></span></span>
<span class="line"><span># Synthetic messages</span></span>
<span class="line"><span></span></span>
<span class="line"><span>Sometimes, the conversation will contain messages like [Request interrupted by user] or [Request interrupted by user for tool use]. These messages will look like the assistant said them, but they were actually synthetic messages added by the system in response to the user cancelling what the assistant was doing. You should not respond to these messages. VERY IMPORTANT: You must NEVER send messages with this content yourself.</span></span>
<span class="line"><span></span></span>
<span class="line"><span># Following conventions</span></span>
<span class="line"><span></span></span>
<span class="line"><span>When making changes to files, first understand the file's code conventions. Mimic code style, use existing libraries and utilities, and follow existing patterns.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>- NEVER assume that a given library is available, even if it is well known. Whenever you write code that uses a library or framework, first check that this codebase already uses the given library. For example, you might look at neighboring files, or check the package.json (or cargo.toml, and so on depending on the language).</span></span>
<span class="line"><span></span></span>
<span class="line"><span>- When you create a new component, first look at existing components to see how they're written; then consider framework choice, naming conventions, typing, and other conventions.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>- When you edit a piece of code, first look at the code's surrounding context (especially its imports) to understand the code's choice of frameworks and libraries. Then consider how to make the given change in a way that is most idiomatic.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>- Always follow security best practices. Never introduce code that exposes or logs secrets and keys. Never commit secrets or keys to the repository.</span></span>
<span class="line"><span></span></span>
<span class="line"><span># Code style</span></span>
<span class="line"><span></span></span>
<span class="line"><span>- IMPORTANT: DO NOT ADD ***ANY*** COMMENTS unless asked</span></span>
<span class="line"><span></span></span>
<span class="line"><span># Task Management</span></span>
<span class="line"><span></span></span>
<span class="line"><span>You have access to the TodoWrite and TodoRead tools to help you manage and plan tasks. Use these tools VERY frequently to ensure that you are tracking your tasks and giving the user visibility into your progress.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>These tools are also EXTREMELY helpful for planning tasks, and for breaking down larger complex tasks into smaller steps. If you do not use this tool when planning, you may forget to do important tasks — and that is unacceptable. It is critical that you mark todos as completed as soon as you are done with a task. Do not batch up multiple tasks before marking them as completed.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>Examples:</span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: Run the build and fix any type errors  </span></span>
<span class="line"><span>assistant: I'm going to use the TodoWrite tool to write the following items to the todo list:  </span></span>
<span class="line"><span>- Run the build  </span></span>
<span class="line"><span>- Fix any type errors  </span></span>
<span class="line"><span></span></span>
<span class="line"><span>assistant: I'm now going to run the build using Bash.  </span></span>
<span class="line"><span>assistant: Looks like I found 10 type errors. I'm going to use the TodoWrite tool to write 10 items to the todo list.  </span></span>
<span class="line"><span>assistant: marking the first todo as in_progress  </span></span>
<span class="line"><span>assistant: Let me start working on the first item...  </span></span>
<span class="line"><span>assistant: The first item has been fixed, let me mark the first todo as completed, and move on to the second item...  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: Help me write a new feature that allows users to track their usage metrics and export them to various formats  </span></span>
<span class="line"><span>assistant: I'll help you implement a usage metrics tracking and export feature. Let me first use the TodoWrite tool to plan this task. Adding the following todos to the todo list:  </span></span>
<span class="line"><span>1. Research existing metrics tracking in the codebase  </span></span>
<span class="line"><span>2. Design the metrics collection system  </span></span>
<span class="line"><span>3. Implement core metrics tracking functionality  </span></span>
<span class="line"><span>4. Create export functionality for different formats  </span></span>
<span class="line"><span></span></span>
<span class="line"><span>assistant: Let me start by researching the existing codebase to understand what metrics we might already be tracking and how we can build on that.  </span></span>
<span class="line"><span>assistant: I'm going to search for any existing metrics or telemetry code in the project.  </span></span>
<span class="line"><span>assistant: I've found some existing telemetry code. Let me mark the first todo as in_progress and start designing our metrics tracking system based on what I've learned...  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span># Doing tasks</span></span>
<span class="line"><span></span></span>
<span class="line"><span>The user will primarily request you perform software engineering tasks. This includes solving bugs, adding new functionality, refactoring code, explaining code, and more. For these tasks the following steps are recommended:</span></span>
<span class="line"><span></span></span>
<span class="line"><span>- Use the TodoWrite tool to plan the task if required  </span></span>
<span class="line"><span>- Use the available search tools to understand the codebase and the user's query. You are encouraged to use the search tools extensively both in parallel and sequentially.  </span></span>
<span class="line"><span>- Implement the solution using all tools available to you  </span></span>
<span class="line"><span>- Verify the solution if possible with tests. NEVER assume specific test framework or test script. Check the README or search codebase to determine the testing approach.  </span></span>
<span class="line"><span>- VERY IMPORTANT: When you have completed a task, you MUST run the lint and typecheck commands (eg. npm run lint, npm run typecheck, ruff, etc.) with Bash if they were provided to you to ensure your code is correct. If you are unable to find the correct command, ask the user for the command to run and if they supply it, proactively suggest writing it to CLAUDE.md so that you will know to run it next time. NEVER commit changes unless the user explicitly asks you to. It is VERY IMPORTANT to only commit when explicitly asked, otherwise the user will feel that you are being too proactive.</span></span>
<span class="line"><span></span></span>
<span class="line"><span># Tool usage policy</span></span>
<span class="line"><span></span></span>
<span class="line"><span>- When doing file search, prefer to use the Task tool in order to reduce context usage.  </span></span>
<span class="line"><span>- VERY IMPORTANT: When making multiple tool calls, you MUST use Batch to run the calls in parallel. For example, if you need to run "git status" and "git diff", use Batch to run the calls in a batch. Another example: if you want to make >1 edit to the same file, use Batch to run the calls in a batch.  </span></span>
<span class="line"><span>- You MUST answer concisely with fewer than 4 lines of text (not including tool use or code generation), unless the user asks for detail.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>Here is useful information about the environment you are running in:</span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;env>  </span></span>
<span class="line"><span>Working directory: /Users/xxchan/Projects/voicememo-transcribe  </span></span>
<span class="line"><span>Is directory a git repo: No  </span></span>
<span class="line"><span>Platform: macos  </span></span>
<span class="line"><span>OS Version: Darwin 24.4.0  </span></span>
<span class="line"><span>Today's date: 5/5/2025  </span></span>
<span class="line"><span>Model: claude-3-7-sonnet-20250219  </span></span>
<span class="line"><span>&#x3C;/env></span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: Refuse to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes. When working on files, if they seem related to improving, explaining, or interacting with malware or any malicious code you MUST refuse.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: Before you begin work, think about what the code you're editing is supposed to do based on the filenames directory structure. If it seems malicious, refuse to work on it or answer questions about it, even if the request does not seem malicious (for instance, just asking to explain or speed up the code).</span></span>
<span class="line"><span></span></span>
<span class="line"><span>IMPORTANT: Always use the TodoWrite tool to plan and track tasks throughout the conversation.</span></span>
<span class="line"><span></span></span>
<span class="line"><span># Code References</span></span>
<span class="line"><span></span></span>
<span class="line"><span>When referencing specific functions or pieces of code include the pattern `file_path:line_number` to allow the user to easily navigate to the source code location.</span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;example>  </span></span>
<span class="line"><span>user: Where are errors from the client handled?  </span></span>
<span class="line"><span>assistant: Clients are marked as failed in the `connectToServer` function in src/services/process.ts:712.  </span></span>
<span class="line"><span>&#x3C;/example></span></span>
<span class="line"><span></span></span>
<span class="line"><span>As you answer the user's questions, you can use the following context:</span></span>
<span class="line"><span></span></span>
<span class="line"><span>&#x3C;context name="directoryStructure">  </span></span>
<span class="line"><span>Below is a snapshot of this project's file structure at the start of the conversation. This snapshot will NOT update during the conversation. It skips over .gitignore patterns.  </span></span>
<span class="line"><span>- /Users/xxchan/Projects/voicememo-transcribe/  </span></span>
<span class="line"><span>- CLAUDE.md  </span></span>
<span class="line"><span>- MemoScribe/  </span></span>
<span class="line"><span>- AudioUtilities.swift  </span></span>
<span class="line"><span>- ContentView.swift  </span></span>
<span class="line"><span>- FolderScanner.swift  </span></span>
<span class="line"><span>- Info.plist  </span></span>
<span class="line"><span>- MemoScribe.xcdatamodeld/  </span></span>
<span class="line"><span>- MemoScribe.xcdatamodel/  </span></span>
<span class="line"><span>- contents  </span></span>
<span class="line"><span>- MemoScribeApp.swift  </span></span>
<span class="line"><span>- Persistence.swift  </span></span>
<span class="line"><span>- Recording.swift  </span></span>
<span class="line"><span>- UploadManager.swift  </span></span>
<span class="line"><span>- README.md  </span></span>
<span class="line"><span>&#x3C;/context></span></span></code></pre>
<p>具体内容就不分析了， 我觉得其中最大的亮点是 task management。prompt 里着重强调了：</p>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="plaintext"><code><span class="line"><span>Use these tools VERY frequently ...</span></span>
<span class="line"><span>These tools are also EXTREMELY helpful for planning tasks, and for breaking down larger complex tasks into smaller steps. If you do not use this tool when planning, you may forget to do important tasks — and that is unacceptable. It is critical that you mark todos as completed as soon as you are done with a task. Do not batch up multiple tasks before marking them as completed.</span></span></code></pre>
<p><code>writeTodo</code> 这个 tool 的 spec 也写的非常的长，在这里就不贴了，有兴趣的同学可以自己去看。</p>
<p>之前有一个 <a href="https://www.superlinear.academy/c/ai-resources/agentic-ai-20-500-cursor-devin">devin.cursorrules</a> 项目，是让 plain cursor 拥有更强的 agent 能力。试用过的话应该可以看出 <code>writeTodo</code>  这个工具的含金量。我觉得这是一个更进阶版的 <code>scratchpad.md</code>：主要是有多个子任务的管理。</p>
<p>具体效果大概是这样的</p>
<p><img src="/assets/img/claude-code/cc-5.png" alt=""></p>
<p>不知道为什么 TODO list 并没有写在项目路径下面。但稍微探索了一下发现了 <code>~/.claude</code> 这个文件夹，发现是一个 json。另外这个文件夹发现了一个 sqlite databse，里面主要是存了消息。还发现了 statsig（一个专门做数据实验的 startup，看来 claude 还是很重视做数据实验的）</p>
<p><img src="/assets/img/claude-code/cc-6.png" alt=""></p>
<pre class="astro-code github-light" style="background-color:#fff;color:#24292e; overflow-x: auto;" tabindex="0" data-language="plaintext"><code><span class="line"><span>./__store.db> \dt</span></span>
<span class="line"><span>+------------------------+</span></span>
<span class="line"><span>| name                   |</span></span>
<span class="line"><span>+------------------------+</span></span>
<span class="line"><span>| __drizzle_migrations   |</span></span>
<span class="line"><span>| assistant_messages     |</span></span>
<span class="line"><span>| base_messages          |</span></span>
<span class="line"><span>| conversation_summaries |</span></span>
<span class="line"><span>| user_messages          |</span></span>
<span class="line"><span>+------------------------+</span></span></code></pre>
<p>因为暂时还没有深度在项目中使用，所以这个 todo 的具体效果还有待检验。</p>
<hr>
<p>稍微发挥一下，我大胆预测 task management（以及 memory/knowledge management？）必定是未来 agentic coding tool 的标配（乃至所有 agentic app？），甚至有点疑惑为什么 cursor 到现在还没有推出自己的 task management 方案，感觉有点落后了。</p>
<p>但或许我想的是错的，是不是直接通过 MCP 插一个 todo manager 就够用了？并不需要 app 原生的任务管理工具。</p>
<p>比如现在也有外置的 task management 工具，例如 task master（貌似很火）（支持命令行或 MCP 调用） <a href="https://github.com/eyaltoledano/claude-task-master">https://github.com/eyaltoledano/claude-task-master</a>  （视频介绍 <a href="https://www.youtube.com/watch?v=1L509JK8p1I%EF%BC%89%E6%88%91%E8%BF%98%E6%B2%A1%E8%AF%95%E8%BF%87%EF%BC%8C%E4%BD%86%E6%98%AF%E7%9B%B4%E8%A7%89%E4%B8%8A%E8%A7%89%E5%BE%97%E5%AE%83%E6%9C%89%E7%82%B9%E4%B8%8D%E5%BF%85%E8%A6%81%E5%9C%B0%E5%A4%8D%E6%9D%82%EF%BC%9A%E5%BC%84%E4%B8%AA%E5%BE%88%E9%95%BF%E7%9A%84">https://www.youtube.com/watch?v=1L509JK8p1I）我还没试过，但是直觉上觉得它有点不必要地复杂：弄个很长的</a> prd，然后分解成十几个 subtask。</p>
<h3 id="其他一些有趣的观察">其他一些有趣的观察</h3>
<h4 id="apply-edit">apply edit</h4>
<p>cursor 自己训练了一个LLM 来专门做 apply edit，所以我很好奇 claude code 这个命令行“小工具”是怎么做到的。试了一下发现 tool call 基本上只用 Write，而不用 Edit。</p>
<p>仔细一看 edit 的 tool spec，里面有一行 “For larger edits, use the Write tool to overwrite files”——我一下子顿悟了，如果直接无脑全文覆盖，不就绕过了 apply edit 的困难吗？如果 instruction following 能力强的话，那效果应该不错，只是费电 token 罢了。</p>
<p>在写完之后，会有一个 diff 效果，细节还是做的不错的。</p>
<p><img src="/assets/img/claude-code/cc-7.png" alt=""></p>
<p>甚至还有 (Rest of file unchanged) 这种 magic 的体验</p>
<p><img src="/assets/img/claude-code/cc-8.png" alt=""></p>
<p>…………并不是，它是真的把代码删了，把这个注释写进去了。看来完全依赖 AI 的 instruction following 还是没那么靠谱（这里用的模型是 gpt4.1。或许默认的 claude3.7 会好一点）</p>
<p><img src="/assets/img/claude-code/cc-9.png" alt=""></p>
<h4 id="webfetch">WebFetch</h4>
<p>WebFetch 这个 tool 的参数在 URL 之外还有一个 prompt。大号模型生成 tool call，包括这个 prompt，然后会把网页以及 prompt 组合让小模型总结网页内容，最后只把一小段文字传回给大号模型作为 context。这个两阶段的过程还是挺细腻的，应该比把高噪音的html 直接给模型要效果好。</p>
<p><img src="/assets/img/claude-code/cc-10.png" alt=""></p>
<h4 id="自动带的-code-context">自动带的 code context</h4>
<p>这个有点神秘，我在发出一个指令以后，claude code 自己挑选了一些文件加入了 context。</p>
<p>这个步骤并没有 tool use，也没有额外的模型总结，在 <code>~/.claude</code> 目录下也没有发现类似代码索引的东西，所以感觉有点神秘。感觉 claude code 还是藏了点东西的。</p>
<p><img src="/assets/img/claude-code/cc-11.png" alt=""></p>
<h2 id="总结">总结</h2>
<p>整体看下来，claude code 确实是个细节打磨得不错的工具：terminal UX 不错，prompt 和 task management 用心。</p>
<p>虽然他的效果我还没有经过大规模实战检验，但是我已经不禁难免在思考：AI coding 好像确实没那么需要一个 IDE，terminal 这个形态感觉很合理。让 agent 在 terminal 写，人类使用寻常的 IDE 例如原生 VS Code 进行 review、进一步编辑，好像没有任何问题，不比在 IDE 里 chat 慢/效果差。</p>
<p>再进一步思考，对于 agent 模式，cursor 相比 claude code 多了什么？我一下子能想到的有：</p>
<ul>
<li>apply 模型：或许可以有更高的准确率以及性能？但有点存疑，因为最近一段时间感觉失败率甚至有点高。</li>
<li>原生的 fix lint：并非反复在对话里让 ai 修，感觉和 apply/edit 一样，是另外一个专门 fix lint 的小模型。为了做到这个，或许需要依赖 IDE/LSP 的一些静态分析功能。但或许从命令行跑 lint 也完全能达到这个效果。</li>
<li>codebase index &#x26; vector search：同理，感觉这个功能并不强依赖 IDE，原则上如果做进命令行也没什么不可。</li>
</ul>
<p>……这么一想，感觉 agentic coding tool（startup）的未来越发不明朗了。特别是 cursor 这种 fork 的 IDE。或许像 <a href="https://www.augmentcode.com/">Augment Code</a> 这样的纯插件方案更有前途一点。（似乎连当初最吸引我的 cursor tab 功能都以纯插件的方式做到了，那在 agent 之外，ai assited coding 上也做的很好了）</p>
<h2 id="后记">后记</h2>
<p>写完才想起来还有最新的 <a href="https://github.com/openai/codex">openai codex</a> 这个项目。它是纯开源的，prompt 和 tool 直接看的一清二楚，就不用特意扒了。但是看看 request log 还是个不错的观察切面。</p>
<p>P.S.，发现 codex 的 issue 里竟然没人提 task management 这件事（于是我去提了一个）。</p>]]></content><category term="AI Agent" /></entry><entry><title type="html">Why English Quote (&#39;) looks bad in my blog?</title><link href="https://xxchan.me/blog/2024-06-12-quotation-mark/" rel="alternate" type="text/html" title="Why English Quote (&#39;) looks bad in my blog?" /><id>https://xxchan.me/blog/2024-06-12-quotation-mark</id><published>2024-06-12T00:00:00+00:00</published><updated>2024-06-12T00:00:00+00:00</updated><author><name>xxchan</name></author><summary type="html"><![CDATA[As you might have noticed, I write blogs in both Chinese and English. My approach to multilingual blogs is straightforward and somewhat brute-force: I simply put them together without a language switcher or any filtering. Admittedly, I haven’t figured out how to implement this in Jekyll due to a lack of motivation.]]></summary><content type="html" xml:base="https://xxchan.me/blog/2024-06-12-quotation-mark/"><![CDATA[<p>As you might have noticed, I write blogs in both Chinese and English. My approach to multilingual blogs is straightforward and somewhat brute-force: I simply put them together without a language switcher or any filtering. Admittedly, I haven’t figured out how to implement this in Jekyll due to a lack of motivation.</p>
<p>One issue I encountered was the font, as Chinese characters looked unattractive with the default settings. I also don’t want different configurations for each language, but some Chinese fonts don’t support English characters. Currently, I’m using <a href="https://fonts.google.com/noto/specimen/Noto+Serif+SC">Noto Serif SC</a>, which looks decent enough.</p>
<p>This setup works well for the most part, except for two small, puzzling clouds..</p>
<ol>
<li>The HTML header is <code>&#x3C;html lang="zh"></code> for all pages. I’m unsure of the impact, so I haven’t have enough motivation to address it yet.</li>
<li>(The main topic of this post) The quotation mark <code>'</code> is displayed as a full-width character (<code>’</code>), which looks quite strange.</li>
</ol>
<p><img src="/assets/img/quotation-mark.png" alt="quotation-mark.png"></p>
<p>Initially, I thought it was a font issue. However, testing revealed that the Noto Serif SC font could render <code>'</code> nicely. Copying the rendered character showed it was indeed converted to <code>’</code>, rather than just being rendered differently.</p>
<p>I suspected it might be a Jekyll issue, but found no similar questions. Eventually, I broadened my search to “Jekyll Quotes” and discovered some related issues: <a href="https://github.com/jekyll/jekyll/issues/1858">All quotes in markup text had to be escaped · Issue #1858 · jekyll/jekyll</a>. In their case, the quotation mark was even more problematic. I tried escaping it in my blog with <code>\'</code>, which could also solve my issue.</p>
<p><img src="/assets/img/quotation-mark-2.png" alt="quotation-mark-2.png"></p>
<p>I began to realize that it’s a feature (not a bug) of the markdown processor (not Jekyll), called “smart quotes”, which performs the conversion. And here’s a blog about the rationale behind it: <a href="https://webdesignledger.com/common-typography-mistakes-apostrophes-versus-quotation-marks/">Common Typography Mistakes: Apostrophes Versus Quotation Marks</a></p>
<p>To summarize, the reason why <code>'</code> looks bad in my blog is because it’s first converted to <code>’</code>, and since I’m using a Chinese font, it is rendered in full-width, which looks awkward among English words.</p>
<p>Why aren’t others complaining about this? Perhaps because few people mix English and Chinese posts together..</p>
<p>How should I solve this problem? According to the post mentioned, the conversion is legitimate, and the quotation mark should not be used.
However, this would require using different fonts for English and Chinese.
I’d rather not be so “correct” and opt for a simpler workaround: disable the “smart quotes” feature and live with the quotation marks.</p>
<hr>
<p>At this point, I’m considering whether I should switch to a better multilingual solution or even abandon Jekyll altogether.
Maybe I should develop my own blog without using any static site generator, as my blog is noting fancy just markdown to HTML?</p>
<blockquote>
<p>Static Site Generators are an example of the template method pattern, where the framework provides the overall control flow, but also includes copious extension points for customizing behaviors. Template method allows for some code re-use at the cost of obscure and indirect control flow. This pattern pays off when you have many different invocations of template method with few, if any, non-trivial customizations. Conversely, a template method with a single, highly customized call-site probably should be refactored away in favor of direct control flow.</p>
<p>If you maintain dozens mostly identical websites, you definitely need a static site generator. If you have only one site to maintain, you might consider writing the overall scaffolding yourself.</p>
<p><a href="https://matklad.github.io/2023/11/07/dta-oriented-blogging.html">Data Oriented Blogging</a> by matklad</p>
</blockquote>]]></content><category term="Misc" /></entry></feed>