10 Using Powerful AI to Write Evaluation Scripts to Grade Other AIs. Business models favor high-frequency beating low-frequency, while AI favors high-intelligence beating low-intelligence; those being beaten are completely helpless. On the last day of the Spring Festival holiday, I used GPT-5.2 xhigh to add automated evaluation scripts to the AI Programming 2.0 evaluation task I set up previously—it's a formal benchmark now. The script's correctness passed on the first try. I had it modify the test cases once because they didn't meet expectations, then asked it to add automated leaderboard statistics and auto-push features—all of which passed on the first try as well. Writing code with powerful AI now feels like this: regardless of
10 Using Powerful AI to Write Evaluation Scripts to Grade Other AIs Business models favor high-frequency beating low-frequency, while AI favors high-intelligence beating low-intelligence; those being beaten are completely helpless. On the last day of the Spring Festival holiday, I used GPT-5.2 xhigh to add automated evaluation scripts to the AI Programming 2.0 evaluation task I set up previously—it's a formal benchmark now. The script's correctness passed on the first try. I had it modify the test cases once because they didn't meet expectations, then asked it to add automated leaderboard statistics and auto-push features—all of which passed on the first try as well. Writing code with powerful AI now feels like this: regardless of the task size, you basically only need to state your needs, set requirements, and briefly verify the results—no need to look at the code at all. Next, I'll detail how I did it. As usual, look directly at the chat logs. Before starting, use /model to switch to the strongest GPT-5.2 xhigh. Then I...
No comments yet. Be the first to share your thoughts.