This article introduces a novel arithmetical puzzle dataset designed to test and enhance AI reasoning capabilities. The puzzles involve manipulating integers through arithmetic operations to reach a target, with each number used exactly once. A data synthesis pipeline generates large-scale datasets, with controlled parameters for training, in-distribution testing, and out-of-distribution evaluation. Using the LLaMA architecture with LoRA fine-tuning, the study achieves efficient parameter reduction while benchmarking AI’s ability to generalize across numerical scales and abstract puzzle forms.This article introduces a novel arithmetical puzzle dataset designed to test and enhance AI reasoning capabilities. The puzzles involve manipulating integers through arithmetic operations to reach a target, with each number used exactly once. A data synthesis pipeline generates large-scale datasets, with controlled parameters for training, in-distribution testing, and out-of-distribution evaluation. Using the LLaMA architecture with LoRA fine-tuning, the study achieves efficient parameter reduction while benchmarking AI’s ability to generalize across numerical scales and abstract puzzle forms.

A Framework for Synthesizing Arithmetical Puzzle Datasets for Large Language Models

2025/08/24 00:35
5 min read
For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com

:::info Authors:

(1) Haolong Li, Tongji Universiy and work done during internship at ByteDance (furlongli322@gmail.com);

(2) Yu Ma, Seed Foundation, ByteDance (mayu.1231@bytedance.com);

(3) Yinqi Zhang, East China Normal University and work done during internship at ByteDance (zhang.inch@gmail.com);

(4) Chen Ye (Corresponding Author), ESSC Lab, Tongji Universiy (yechen@tongji.edu.cn);

(5) Jie Chen, Seed Foundation, ByteDance and a Project Leader (chenjiexjtu@gmail.com).

:::

Abstract and 1 Introduction

2 Problem Definition

2.1 Arithmetical Puzzle Problem

2.2 Data Synthesizing

2.3 Dataset

3 Model

4 Experiments

4.1 Evaluation

4.2 Results

4.3 Case Studies

5 Conclusion and Acknowledgements

6 Limitations

7 Ethics Statement and References

\ A Appendix

A.1 Hyperparameter Settings

A.2 Evaluation of the Base Model

A.3 Case Study

A.4 Visualization of the Proposed Puzzle

2.1 Arithmetical Puzzle Problem

Arithmetical puzzle problem denotes a mathematical puzzle involving arithmetic operations and requires logical reasoning and numerical manipulation to derive a solution. The 24 Puzzle and Arithmetic Grid Puzzle are well-known examples of arithmetical puzzle problems.

\ In this paper, we propose a challenging arithmetical puzzle. Its objective is intricate yet precise: to deftly manipulate a set of given integers through a calculated sequence of arithmetic operations, to achieve a predetermined target integer. The problem strictly limits each integer to be used by one time exactly. For example, for the integers 3, 6, 7, 51, 58 and the target integer 4, one possible solution is: 58−51 = 7, 6−7 = −1, 3×−1 = −3, −3 + 7 = 4, as shown in Figure 5 in Appendix A.4.

\

2.2 Data Synthesizing

Given the arithmetical puzzle described above in Section 2.1, we create a data synthesizing pipeline to efficiently generate the proposed dataset.

\ Denote the set of candidate integers as X = {X1, X2, . . . , XN } and the target number as T, where N is the total number of candidate integers in a puzzle sample. Each candidate integer Xi is independently sampled from a uniform distribution Xi ∼ U(1, V ), where V is the upper bound of the sampled integers. To avoid data overlapping, we have strictly ensured that for each puzzle, the candidate integers are a set of distinct numbers. The arithmetic operators involved in this problem are ops = {+, −, ×, ÷} and all operations are limited to integer operations. For example, when solving the puzzle with a division operator, the operation should be considered in integer division like 14/3 = 4. The detailed steps of synthesizing data for this puzzle is described in Algorithm 1.

\ Besides, to construct the SFT dataset, the prompt is deliberately designed to excludes any natural language cues and instead focuses on purely symbolic language. See Table 1 for an example of the constructed prompt and response.

2.3 Dataset

We split the dataset into training and in-distribution and out-of-distribution test dataset by controlling the total number of candidate integers N and the upper bound of the sampled integers V . We set

\ \

\ \ V = 60 for the training dataset, and sampled the candidate integers with N = 5, 6, 7. Three training datasets with different sizes scaling from 1 million to 10 millions and 100 millions are generated. And another 7500 samples (2500 samples for each N) under the same setting are generated as the in-distribution test dataset. Figure. 1 shows the distribution of N and X in these three training datasets. And the corresponding distribution of the tokenized prompt and response length is shown in Figure. 2.

\ To further evaluate the model’s performance on extrapolation, we have also designed two benchmarks of out-of-distribution dataset:

\ Numerical OOD test datasets. The upper bound of the sampled integers V is raised to 100 and 1000 separately to test the model’s generalization ability with unseen larger numbers. Specifically, 6000 samples are generated for each value of V with 2000 samples for each N. An additional filtering pipeline is applied to ensure that for each sample, there exists at least one integer Xi that satisfies 60 < Xi < 100 for the dataset with V = 100 and 100 < Xi < 1000 for that with V = 1000.

\ Form OOD test dataset. In mathematics, abstract forms often extend, such as expanding from a two-variable linear equation to one with three variables. For the proposed arithmetic puzzle, the extrapolation of abstract forms can be achieved by changing the number of candidate integers N. Clearly, when N increases, the exploration space leading to a feasible solution would expand exponentially, which results in an increased demand for precise reasoning steps. From another perspective, when the total number of the candidate integers changes, it actually requires the model’s ability to absorb and adapt to the puzzle’s abstract forms. Therefore, to test the model’s generalization capability from this point of view, we create another benchmark for OOD test dataset with 5000 samples generated with setting N to 8. To control variables, all the candidate integers in this dataset are sampled with the same upper bound V = 60 as the training dataset.

3 Model

3.1 Framework

We adopt the llama architecture (Touvron et al., 2023a) and employ low-rank adaptation (LoRA) tuning (Hu et al., 2021) based on the implementation of TRL full stack library (von Werra et al., 2020). LoRA achieves a remarkable reduction of 89% in our trainable parameters, from 3B to 0.3B.

3.2 Implementation Details

We train our model by fine-tuning open-llama-3B. We systematically apply left-padding to the query text and right-padding to the answer text to control the overall context length. All experiments are conducted with 8× NVIDIA A100-SXM4-80GB GPUs. The specific hyperparameter settings are listed in Table 3 in Appendix A.1.

\

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 Deed (Attribution-Noncommercial-Sharelike 4.0 International) license.

:::

\

Market Opportunity
Sleepless AI Logo
Sleepless AI Price(SLEEPLESSAI)
$0.03181
$0.03181$0.03181
+15.71%
USD
Sleepless AI (SLEEPLESSAI) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Red state lawmaker warns something ominous hiding behind Supreme Court's 'five alarm fire'

Red state lawmaker warns something ominous hiding behind Supreme Court's 'five alarm fire'

A former lawmaker from a red state warned that something ominous is hiding behind the latest "five-alarm fire" from the Supreme Court, according to a new report
Share
Rawstory2026/05/15 08:07
One Of Frank Sinatra’s Most Famous Albums Is Back In The Spotlight

One Of Frank Sinatra’s Most Famous Albums Is Back In The Spotlight

The post One Of Frank Sinatra’s Most Famous Albums Is Back In The Spotlight appeared on BitcoinEthereumNews.com. Frank Sinatra’s The World We Knew returns to the Jazz Albums and Traditional Jazz Albums charts, showing continued demand for his timeless music. Frank Sinatra performs on his TV special Frank Sinatra: A Man and his Music Bettmann Archive These days on the Billboard charts, Frank Sinatra’s music can always be found on the jazz-specific rankings. While the art he created when he was still working was pop at the time, and later classified as traditional pop, there is no such list for the latter format in America, and so his throwback projects and cuts appear on jazz lists instead. It’s on those charts where Sinatra rebounds this week, and one of his popular projects returns not to one, but two tallies at the same time, helping him increase the total amount of real estate he owns at the moment. Frank Sinatra’s The World We Knew Returns Sinatra’s The World We Knew is a top performer again, if only on the jazz lists. That set rebounds to No. 15 on the Traditional Jazz Albums chart and comes in at No. 20 on the all-encompassing Jazz Albums ranking after not appearing on either roster just last frame. The World We Knew’s All-Time Highs The World We Knew returns close to its all-time peak on both of those rosters. Sinatra’s classic has peaked at No. 11 on the Traditional Jazz Albums chart, just missing out on becoming another top 10 for the crooner. The set climbed all the way to No. 15 on the Jazz Albums tally and has now spent just under two months on the rosters. Frank Sinatra’s Album With Classic Hits Sinatra released The World We Knew in the summer of 1967. The title track, which on the album is actually known as “The World We Knew (Over and…
Share
BitcoinEthereumNews2025/09/18 00:02
Data focus shifts to payrolls – Societe Generale

Data focus shifts to payrolls – Societe Generale

The post Data focus shifts to payrolls – Societe Generale appeared on BitcoinEthereumNews.com. Societe Generale analysts note a quiet data calendar ahead of key
Share
BitcoinEthereumNews2026/04/02 17:52

KAIO Global Debut

KAIO Global DebutKAIO Global Debut

Enjoy 0-fee KAIO trading and tap into the RWA boom