This ablation study on AdaMix highlights the factors driving its efficiency in parameter-efficient fine-tuning. Results show that adaptation merging consistently outperforms random or fixed routing, while consistency regularization proves essential to maintaining strong performance. Module sharing is particularly effective in low-resource tasks, boosting convergence speed and lowering training loss compared to models without sharing. Experiments with adaptation module count and bottleneck dimension reveal diminishing returns, stressing the importance of balance over brute force scaling. Overall, AdaMix demonstrates how thoughtful design choices yield superior results to full model tuning.

The Role of Consistency and Sharing in Efficient Fine-Tuning

نویسنده: Hackernoon
2025/10/01 21:00
Abstract and 1. Introduction

  1. Background

    2.1 Mixture-of-Experts

    2.2 Adapters

  2. Mixture-of-Adaptations

    3.1 Routing Policy

    3.2 Consistency regularization

    3.3 Adaptation module merging and 3.4 Adaptation module sharing

    3.5 Connection to Bayesian Neural Networks and Model Ensembling

  3. Experiments

    4.1 Experimental Setup

    4.2 Key Results

    4.3 Ablation Study

  4. Related Work

  5. Conclusions

  6. Limitations

  7. Acknowledgment and References

Appendix

A. Few-shot NLU Datasets B. Ablation Study C. Detailed Results on NLU Tasks D. Hyper-parameter

4.3 Ablation Study

We perform all the ablation analysis on AdaMix with adapters for parameter-efficient fine-tuning.

\ Analysis of adaptation merging. In this ablation study, we do not merge adaptation modules and consider two different routing strategies at inference time: (a) randomly routing input to any adaptation module, and (b) fixed routing where we route all the input to the first adaptation module in AdaMix. From Table 7, we observe AdaMix with adaptation merging to perform better than any of the other variants without the merging mechanism. Notably, all of the AdaMix variants outperform full model tuning.

\ Table 3: Results on E2E NLG Challenge with GPT-2 medium backbone. Best result on each task is in bold. We report AdaMix results with both adapters and LoRA as underlying PEFT method. AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.† denotes results reported from (Hu et al., 2021) and repr. denotes reproduced results. #Param. denotes the number of tunable adaptation parameters used during inference. Results on DART and WebNLG presented in Tables 4 and 5 in Appendix

\ Table 4: Results on DART with GPT-2 backbone encoder. Best result on each task is in bold. We report AdaMix results with both adapters and LoRA as underlying PEFT method. AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.† denotes results reported from (Hu et al., 2021) and repr. denotes reproduced results. #Param. denotes the number of tunable adaptation parameters used during inference.

\ Moreover, Figure 5 shows that the performance of merging mechanism is consistently better than the average performance of random routing and comparable to the best performance of random routing.

\ \ \ Table 5: Results on WebNLG with GPT-2 medium backbone. The results are based on all categories in the test set of WebNLG. Best result on each task is in bold. We report AdaMix results with both adapters and LoRA as underlying PEFT method. AdaMix outperforms all competing methods as well as fully fine-tuned large model with only 0.1% tunable parameters.† denotes results reported from (Hu et al., 2021) and repr. denotes reproduced results. #Param. denotes the number of tunable adaptation parameters used during inference.

\

\ Analysis of consistency regularization. We drop consistency regularization during training for ablation and demonstrate significant performance degradation in Table 8.

\ Analysis of adaptation module sharing. We remove adaptation module sharing in AdaMix for ablation and keep four different copies of projectdown and four project-up FFN layers. From Table 8 we observe the performance gap between AdaMix and AdaMix w/o sharing to increase with decrease in the dataset size demonstrating the importance of parameter sharing for low-resource tasks (e.g.,

\ Table 6: Average performance and standard deviation of several parameter-efficient fine-tuning strategies based on RoBERTa-large with |K| = 30 training labels. The best performance is shown in bold. Prompt-tuning, Head-only and BitFit tune 1M model parameters during inference. Houlsby Adapter, LiST Adapter and AdaMix Adapter tune 14M model parameters. * denotes that the results are taken from (Wang et al., 2021).

\ Table 7: AdaMix without adaptation merging and different routing and ensembling strategies. Average results are presented on GLUE development set with BERTbase encoder. Detailed task results in Table 14 of Appendix for BERT-base and RoBERTa-large encoders.

\ Figure 5: Violin plot of AdaMix-RandomRouting performance distribution with RoBERTa-large encoders. Red dot denotes the performance of AdaMix.

\ Table 8: Ablation study demonstrating the impact of consistency regularization and sharing in AdaMix.

\ RTE, MRPC). This is further demonstrated in Figure 7 in Appendix which shows a faster convergence and lower training loss of AdaMix with sharing compared to that without given the same number of training steps. We explore which adaptation module to share (project-up v.s. project-down) in Table 11 in Appendix that depict similar results. Impact of the number of adaptation modules. In this study, we vary the number of adaptation modules in AdaMix as 2, 4 and 8 during training. Table 9 shows diminishing returns on aggregate task performance with increasing number of modules. As we increase sparsity and the number of tunable parameters by increasing the number of adaptation modules, low-resource tasks like RTE and SST-2 – with limited amount of labeled data for fine-tuning – degrade in performance compared to high-resource tasks like MNLI and QNLI.

\ Table 9: Varying the number of adaptation modules in AdaMix with RoBERTa-large encoder. * denotes the number of modules used in AdaMix with adapters.

\ Impact of adapter bottleneck dimension. Table 10 shows the impact of bottleneck dimension of adapters with different encoders in AdaMix. The model performance improves with increase in the number of trainable parameters by increasing the bottleneck dimension with diminishing returns after a certain point.

\

:::info Authors:

(1) Yaqing Wang, Purdue University (wang5075@purdue.edu);

(2) Sahaj Agarwal, Microsoft (sahagar@microsoft.com);

(3) Subhabrata Mukherjee, Microsoft Research (submukhe@microsoft.com);

(4) Xiaodong Liu, Microsoft Research (xiaodl@microsoft.com);

(5) Jing Gao, Purdue University (jinggao@purdue.edu);

(6) Ahmed Hassan Awadallah, Microsoft Research (hassanam@microsoft.com);

(7) Jianfeng Gao, Microsoft Research (jfgao@microsoft.com).

:::

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

