RECKONING demonstrates superior generalization capacity to longer reasoning chains unseen during trainingRECKONING demonstrates superior generalization capacity to longer reasoning chains unseen during training

Generalization and Robustness: RECKONING Excels on Longer Reasoning Chains Unseen During Training

Abstract and 1. Introduction

  1. Background

  2. Method

  3. Experiments

    4.1 Multi-hop Reasoning Performance

    4.2 Reasoning with Distractors

    4.3 Generalization to Real-World knowledge

    4.4 Run-time Analysis

    4.5 Memorizing Knowledge

  4. Related Work

  5. Conclusion, Acknowledgements, and References

\ A. Dataset

B. In-context Reasoning with Distractors

C. Implementation Details

D. Adaptive Learning Rate

E. Experiments with Large Language Models

4.1 Multi-hop Reasoning Performance

Main Results We first evaluate whether RECKONING learns to perform reasoning in the base setting. A model is given a set of supporting facts (without distractors) and a question (or hypothesis) as input and begins by performing a few CLM learning steps on the facts. Then, the updated model reads only the question and generates an answer. To answer correctly, the model must reason over both facts and the question, meaning it must encode the facts during the inner loop such that multi-hop reasoning can be performed over them later.

\ Table 1: Label accuracy of RECKONING on ProofWriter and CLUTRR-SG, compared to FT-ICR baselines where the supporting facts are given as part of the input. MT marks models trained with the multi-task objective, which optimizes both question-answering and knowledge memorization.

\ We train our models and the fine-tuned ICR (FT-ICR) baselines with both the single-task (LCE) and multi-task (LCE + LCLM) objectives. For multi-task (MT) training, the model learns to answer the question and generate its relevant knowledge in the outer loop. Table 1 shows the evaluation results on question answering (or hypothesis classification). For all hop numbers in ProofWriter and CLUTRR-SG, multi-task RECKONING outperforms the best result of all baselines (consistently obtained by multi-task FT-ICR) by an average of 1%. We conclude that RECKONING can effectively solve reasoning problems through its updated parametric knowledge and do so better than existing baselines. The multi-task objective is crucial for this success: not only is RECKONING’s performance consistently higher (by an average of 2.8% over the two datasets and their hop counts) when using the multi-task rather than single-task (ST) objective, but it also under-performs both FTICR baselines when trained with only the single-task objective. The multi-task objective also improves FT-ICR consistently (average 1.8%), though it is not enough to beat the multi-task RECKONING. In all further experiments, we consider only RECKONING and FT-ICR with a multi-task objective.

\ Generalizing to Longer Reasoning Chains Our first experiments assume an alignment between the number of reasoning hops in the questions in the training and test set. However, we may not be able to train on all n-hop reasoning questions we encounter in the wild, and we rarely know the number of reasoning hops in a question a priori. Consequently, we also measure the generalization capacity of our model to questions with hop numbers unseen during training. We compile interpolation (fewer hops than the train set) and extrapolation (more hops than the train set) test sets from the CLUTRRSG dataset. Again, we train models individually on 2-hop, 4-hop, and 6-hop examples and evaluate these three sets of models on the test sets, which contain 2-10-hop reasoning questions. Figure 3 shows that both RECKONING models and ICR baselines retain high performance on the interpolation test sets but exhibit decreasing performance as the number of hops increases. Importantly, though, RECKONING outperforms FT-ICR on all test sets regardless of the number of training hops, with the highest difference being more than 10% in every training setting (15%, 30%, 10%, respectively). These performance gains when testing on extrapolation data suggest that training with RECKONING better generalizes to examples with high OOD hop counts than in-context reasoning (ICR).

\ Figure 3: System generalization evaluation on CLUTRR-SG. From left to right, the models are trained on 2-hop, 4-hop, and 6-hop CLUTRR-SG data portions. We evaluate the model on 2-10 hop test sets. The higher the hops, the more facts a question has, and the more difficult that question is.

\ Figure 5: Robustness under distractors for ProofWriter. Each of the three plots corresponds to training and testing on a subset of questions in ProofWriter with a different number of hops (2,3,5-hops). Each bar corresponds to the number of distractors in the knowledge sets for those questions.

\ Does RECKONING’s performance depend on the number of inner loop gradient steps? In RECKONING, the model performs multi-hop reasoning over facts by encoding facts using multiple gradient steps in the inner loop optimization (§3). Naturally, this process prompts the question of whether there is a correlation between the number of reasoning hops and the number of gradient steps needed to reliably encode the knowledge (i.e., problems with more reasoning hops require more gradient steps in the inner loop to encode the facts). In Figure 4, we show for CLUTRR-SG that as the number of inner loop steps increases, the label accuracy of the outer-loop task also increases. Furthermore, when considering the performance gains for reasoning with 6 inner loop steps (i.e., knowledge encoding) as opposed to one, we observe that this gap is much more pronounced for 4-hop (42.3%) and 6-hop (34.7%) reasoning than it is for 2-hop reasoning (5.9%). These results show that problems requiring more hops of reasoning also greatly benefit from more steps of inner loop knowledge encoding.

\ Figure 4: Multi-hop reasoning performance as a function of the number of inner loop steps (x-axis), with each line focusing (by training and testing) on CLUTRR-SG with a different number of hops.

\

:::info Authors:

(1) Zeming Chen, EPFL (zeming.chen@epfl.ch);

(2) Gail Weiss, EPFL (antoine.bosselut@epfl.ch);

(3) Eric Mitchell, Stanford University (eric.mitchell@cs.stanford.edu)';

(4) Asli Celikyilmaz, Meta AI Research (aslic@meta.com);

(5) Antoine Bosselut, EPFL (antoine.bosselut@epfl.ch).

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

ZKP Crypto’s $1.7B Presale Changes the Math as ETH Struggles and Dogecoin Searches for Direction!

ZKP Crypto’s $1.7B Presale Changes the Math as ETH Struggles and Dogecoin Searches for Direction!

Uncover why Ethereum prediction remains cautious, Dogecoin price stays sentiment-driven, while ZKP crypto’s $1.7B presale scale positions it as the next crypto
Share
coinlineup2026/01/26 01:00
OpenVPP accused of falsely advertising cooperation with the US government; SEC commissioner clarifies no involvement

OpenVPP accused of falsely advertising cooperation with the US government; SEC commissioner clarifies no involvement

PANews reported on September 17th that on-chain sleuth ZachXBT tweeted that OpenVPP ( $OVPP ) announced this week that it was collaborating with the US government to advance energy tokenization. SEC Commissioner Hester Peirce subsequently responded, stating that the company does not collaborate with or endorse any private crypto projects. The OpenVPP team subsequently hid the response. Several crypto influencers have participated in promoting the project, and the accounts involved have been questioned as typical influencer accounts.
Share
PANews2025/09/17 23:58
How to earn from cloud mining: IeByte’s upgraded auto-cloud mining platform unlocks genuine passive earnings

How to earn from cloud mining: IeByte’s upgraded auto-cloud mining platform unlocks genuine passive earnings

The post How to earn from cloud mining: IeByte’s upgraded auto-cloud mining platform unlocks genuine passive earnings appeared on BitcoinEthereumNews.com. contributor Posted: September 17, 2025 As digital assets continue to reshape global finance, cloud mining has become one of the most effective ways for investors to generate stable passive income. Addressing the growing demand for simplicity, security, and profitability, IeByte has officially upgraded its fully automated cloud mining platform, empowering both beginners and experienced investors to earn Bitcoin, Dogecoin, and other mainstream cryptocurrencies without the need for hardware or technical expertise. Why cloud mining in 2025? Traditional crypto mining requires expensive hardware, high electricity costs, and constant maintenance. In 2025, with blockchain networks becoming more competitive, these barriers have grown even higher. Cloud mining solves this by allowing users to lease professional mining power remotely, eliminating the upfront costs and complexity. IeByte stands at the forefront of this transformation, offering investors a transparent and seamless path to daily earnings. IeByte’s upgraded auto-cloud mining platform With its latest upgrade, IeByte introduces: Full Automation: Mining contracts can be activated in just one click, with all processes handled by IeByte’s servers. Enhanced Security: Bank-grade encryption, cold wallets, and real-time monitoring protect every transaction. Scalable Options: From starter packages to high-level investment contracts, investors can choose the plan that matches their goals. Global Reach: Already trusted by users in over 100 countries. Mining contracts for 2025 IeByte offers a wide range of contracts tailored for every investor level. From entry-level plans with daily returns to premium high-yield packages, the platform ensures maximum accessibility. Contract Type Duration Price Daily Reward Total Earnings (Principal + Profit) Starter Contract 1 Day $200 $6 $200 + $6 + $10 bonus Bronze Basic Contract 2 Days $500 $13.5 $500 + $27 Bronze Basic Contract 3 Days $1,200 $36 $1,200 + $108 Silver Advanced Contract 1 Day $5,000 $175 $5,000 + $175 Silver Advanced Contract 2 Days $8,000 $320 $8,000 + $640 Silver…
Share
BitcoinEthereumNews2025/09/17 23:48