This article evaluates how fine-tuning affects AI reasoning on structured puzzle tasks. Using Open-LLaMA as a base, models were trained on datasets of varying sizes (1M, 10M, 100M). Results show clear scaling benefits: the 100M-sample model achieved the best pass@1 accuracy in both in-distribution and out-of-distribution tests. While smaller models struggled with limited reasoning steps or logical errors, larger fine-tuned models demonstrated deeper problem-solving ability, outperforming both base and prompt-engineered approaches.This article evaluates how fine-tuning affects AI reasoning on structured puzzle tasks. Using Open-LLaMA as a base, models were trained on datasets of varying sizes (1M, 10M, 100M). Results show clear scaling benefits: the 100M-sample model achieved the best pass@1 accuracy in both in-distribution and out-of-distribution tests. While smaller models struggled with limited reasoning steps or logical errors, larger fine-tuned models demonstrated deeper problem-solving ability, outperforming both base and prompt-engineered approaches.

Evaluating Fine-Tuned LLMs on Reasoning Puzzles

:::info Authors:

(1) Haolong Li, Tongji Universiy and work done during internship at ByteDance (furlongli322@gmail.com);

(2) Yu Ma, Seed Foundation, ByteDance (mayu.1231@bytedance.com);

(3) Yinqi Zhang, East China Normal University and work done during internship at ByteDance (zhang.inch@gmail.com);

(4) Chen Ye (Corresponding Author), ESSC Lab, Tongji Universiy (yechen@tongji.edu.cn);

(5) Jie Chen, Seed Foundation, ByteDance and a Project Leader (chenjiexjtu@gmail.com).

:::

Abstract and 1 Introduction

2 Problem Definition

2.1 Arithmetical Puzzle Problem

2.2 Data Synthesizing

2.3 Dataset

3 Model

4 Experiments

4.1 Evaluation

4.2 Results

4.3 Case Studies

5 Conclusion and Acknowledgements

6 Limitations

7 Ethics Statement and References

\ A Appendix

A.1 Hyperparameter Settings

A.2 Evaluation of the Base Model

A.3 Case Study

A.4 Visualization of the Proposed Puzzle

4.1 Evaluation

For the fine-tuned model, we use the greedy decoding strategy in a zero-shot setting to generate responses. To measure the model’s performance on the proposed puzzle, a corresponding verifier is designed to automatically evaluate the correctness of the responses. Specifically, a solution is deemed correct if it satisfies the following rules:

\ • No extra or illegal characters.

\ • There are only N − 1 equations and all the corresponding calculations are correct.

\ • F(X1, . . . , XN | ops) = T.

\ • All {Xi | i ∈ {1, 2, . . . , N}} and the intermediate calculation results are only used once.

\ Figure 1: Distributions of N and X for different training set sizes (1M / 10M / 100M samples). N denotes the total number of candidate integers of our puzzle, X = (X1, X2, . . . , XN ) denotes the candidate integers.

\ Figure 2: Distributions of the tokenized prompt and response lengths for different training set sizes (1M / 10M / 100M samples).

\ The detailed steps of evaluating the solution for this puzzle is described in Algorithm 2.

4.2 Results

As mentioned in Section 2.3, we have generated three training datasets with different sizes to explore the data scaling effects on the fine-tuned model. The pass@1 rate on different in-distribution and out-of-distribution test datasets are shown in Table 2. When the model is fine-tuned with 100M samples, it achieves the highest score with a zero-shot pass@1 of 0.44 in the in-distribution test dataset, and 0.33 and 0.35 in the two OOD datasets, respectively.

\ Furthermore, we have shown the training curves of the model fine-tuned on these three datasets in Figure 3. From Figure 3, a faster decaying rate is clearly observed in the training loss when increasing the training data size, which is consistent with the rapid increase of the pass@1 rate evaluated on the in-distribution dataset. The same enhancement of the performance also occurs in the two OOD test datasets as shown in Table 2.

\ Additionally, we have also conducted tests of this puzzle on the base model (open-llama-3B) and several other open-source and closed-source models with both few-shot and CoT prompting. The results and some of the generated cases are shown in Appendix A.2, demonstrating the necessity of fine-tuning with regard to solving such puzzle problems.

4.3 Case Studies

We further demonstrate the different solutions provided by models trained with 1M / 10M / 100M training data on the form OOD test dataset for several challenging queries. As shown in Figure 4 in Appendix A.3, the model trained on 1M samples is still limited to a fixed number of reasoning steps, whereas the models trained on 10M / 100M samples exhibit a higher-level understanding of the problem and perform an adequate number of reasoning steps. However, compared to the model trained on 100M samples, the model trained on 10M samples may still encounter computational or logical errors in the final step of reasoning.

\ \ Figure 3: The training loss and zero-shot pass@1 on ID dataset for different training set sizes (1M / 10M / 100M samples).

\ \ \ Table 2: Zero-shot pass@1 of the model fine-tuned with different training set sizes (1M / 10M / 100M samples) on ID, numerical OOD, and form OOD test datasets. The best results are highlighted.

\ \ \

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.0721
$0.0721$0.0721
+4.72%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment?

The post Is Doge Losing Steam As Traders Choose Pepeto For The Best Crypto Investment? appeared on BitcoinEthereumNews.com. Crypto News 17 September 2025 | 17:39 Is dogecoin really fading? As traders hunt the best crypto to buy now and weigh 2025 picks, Dogecoin (DOGE) still owns the meme coin spotlight, yet upside looks capped, today’s Dogecoin price prediction says as much. Attention is shifting to projects that blend culture with real on-chain tools. Buyers searching “best crypto to buy now” want shipped products, audits, and transparent tokenomics. That frames the true matchup: dogecoin vs. Pepeto. Enter Pepeto (PEPETO), an Ethereum-based memecoin with working rails: PepetoSwap, a zero-fee DEX, plus Pepeto Bridge for smooth cross-chain moves. By fusing story with tools people can use now, and speaking directly to crypto presale 2025 demand, Pepeto puts utility, clarity, and distribution in front. In a market where legacy meme coin leaders risk drifting on sentiment, Pepeto’s execution gives it a real seat in the “best crypto to buy now” debate. First, a quick look at why dogecoin may be losing altitude. Dogecoin Price Prediction: Is Doge Really Fading? Remember when dogecoin made crypto feel simple? In 2013, DOGE turned a meme into money and a loose forum into a movement. A decade on, the nonstop momentum has cooled; the backdrop is different, and the market is far more selective. With DOGE circling ~$0.268, the tape reads bearish-to-neutral for the next few weeks: hold the $0.26 shelf on daily closes and expect choppy range-trading toward $0.29–$0.30 where rallies keep stalling; lose $0.26 decisively and momentum often bleeds into $0.245 with risk of a deeper probe toward $0.22–$0.21; reclaim $0.30 on a clean daily close and the downside bias is likely neutralized, opening room for a squeeze into the low-$0.30s. Source: CoinMarketcap / TradingView Beyond the dogecoin price prediction, DOGE still centers on payments and lacks native smart contracts; ZK-proof verification is proposed,…
Share
BitcoinEthereumNews2025/09/18 00:14
Fed Decides On Interest Rates Today—Here’s What To Watch For

Fed Decides On Interest Rates Today—Here’s What To Watch For

The post Fed Decides On Interest Rates Today—Here’s What To Watch For appeared on BitcoinEthereumNews.com. Topline The Federal Reserve on Wednesday will conclude a two-day policymaking meeting and release a decision on whether to lower interest rates—following months of pressure and criticism from President Donald Trump—and potentially signal whether additional cuts are on the way. President Donald Trump has urged the central bank to “CUT INTEREST RATES, NOW, AND BIGGER” than they might plan to. Getty Images Key Facts The central bank is poised to cut interest rates by at least a quarter-point, down from the 4.25% to 4.5% range where they have been held since December to between 4% and 4.25%, as Wall Street has placed 100% odds of a rate cut, according to CME’s FedWatch, with higher odds (94%) on a quarter-point cut than a half-point (6%) reduction. Fed governors Christopher Waller and Michelle Bowman, both Trump appointees, voted in July for a quarter-point reduction to rates, and they may dissent again in favor of a large cut alongside Stephen Miran, Trump’s Council of Economic Advisers’ chair, who was sworn in at the meeting’s start on Tuesday. It’s unclear whether other policymakers, including Kansas City Fed President Jeffrey Schmid and St. Louis Fed President Alberto Musalem, will favor larger cuts or opt for no reduction. Fed Chair Jerome Powell said in his Jackson Hole, Wyoming, address last month the central bank would likely consider a looser monetary policy, noting the “shifting balance of risks” on the U.S. economy “may warrant adjusting our policy stance.” David Mericle, an economist for Goldman Sachs, wrote in a note the “key question” for the Fed’s meeting is whether policymakers signal “this is likely the first in a series of consecutive cuts” as the central bank is anticipated to “acknowledge the softening in the labor market,” though they may not “nod to an October cut.” Mericle said he…
Share
BitcoinEthereumNews2025/09/18 00:23
Stronger capital, bigger loans: Africa’s banking outlook for 2026

Stronger capital, bigger loans: Africa’s banking outlook for 2026

African banks spent 2025 consolidating, shoring up capital, tightening risk controls, and investing in digital infrastructure, following years of macroeconomic
Share
Techcabal2026/01/14 23:06