This study explores how Mondrian Conformal Prediction (MCP) enhances traditional k-Nearest Neighbors (kNN) models in predicting hard drive failures. Using Baidu’s open-source dataset of over 23,000 Seagate HDDs, the experiment demonstrates that MCP increases accuracy for detecting failing disks, despite dataset imbalance. More importantly, it enables the selective scrubbing of only 22.7% of drives — drastically cutting energy use while maintaining reliability. The results highlight the value of confidence scoring in large-scale predictive maintenance systems.This study explores how Mondrian Conformal Prediction (MCP) enhances traditional k-Nearest Neighbors (kNN) models in predicting hard drive failures. Using Baidu’s open-source dataset of over 23,000 Seagate HDDs, the experiment demonstrates that MCP increases accuracy for detecting failing disks, despite dataset imbalance. More importantly, it enables the selective scrubbing of only 22.7% of drives — drastically cutting energy use while maintaining reliability. The results highlight the value of confidence scoring in large-scale predictive maintenance systems.

Predicting Hard Drive Failures Using Mondrian Conformal Prediction

Abstract and 1. Introduction

  1. Motivation and design goals

  2. Related Work

  3. Conformal prediction

    4.1. Mondrian conformal prediction (MCP)

    4.2. Evaluation metrics

  4. Mondrian conformal prediction for Disk Scrubbing: our approach

    5.1. System and Storage statistics

    5.2. Which disk to scrub: Drive health predictor

    5.3. When to scrub: Workload predictor

  5. Experimental setting and 6.1. Open-source Baidu dataset

    6.2. Experimental results

  6. Discussion

    7.1. Optimal scheduling aspect

    7.2. Performance metrics and 7.3. Power saving from selective scrubbing

  7. Conclusion and References

6. Experimental setting

In this section, we detail the dataset used for our study and the conducted experiments as well as their results.

6.1. Open-source Baidu dataset

This dataset (DrTycoon, 2023) consists of samples collected from Seagate ST31000524NS enterprise-level HDDs, with a total of 23395 units and 13 features describing SMART attributes as shown in Table 2. The labeling of each disk was based on its operational status, categorized as either functional or failed. A significant proportion of disks, totaling 22962, were classified as functional, while a smaller subset of 433 was marked as failed, resulting in an imbalanced dataset. The SMART attribute values were recorded at an hourly interval for each disk, generating 168 samples per week for operational disks which gives 1,048,573 actual rows in the dataset corresponding to 23,395 disks (sampling frequency of 1 hour over a period of 2 years). The number of rows represents only the sample of operational disks that are provided in the dataset. However, the failed disks had varying numbers of samples, up to 20 days prior to failure.

\ Table 2: Features’ description for the Open-source Baidu dataset.

\

6.2. Experimental results

For our experiments, we employed the Python programming language and used the MAPIE[3] library (map) for implementing Mondrian Conformal Prediction. The underlying algorithm employed in our experiments was the k Nearest Neighbors (kNN) algorithm.

\ The main goal of conducting the experimental evaluation is to showcase the significant reduction in the number of disk drives to be scrubbed that can be achieved by using the drive health predictor engine, i.e. exploiting the Mondrian conformal predictor.

\ Table 3 shows a comparison between the confusion matrix for the drive disk classification problem using the underlying algorithm alone kNN and adding Mondrian Conformal Prediction, where label ”0” indicates a disk failure and label ”1” indicates a functional one. We can notice that, adding MCP, the number of disks correctly classified as failing has increased from 51314 to 51669, i.e., a difference of 355. This shows MCP helps to identify more disks of the minority class, but with a drawback that is a decrease in the number of disks correctly classified as healthy which has reduced from 296689 to 268616, i.e., a difference of 28073.

\ This issue can be solved by considering the confidence scores and their respective health status, as shown in Figure 5. There are nearly 126,224 drives with a health score greater than 99.95% for the disks labeled as healthy (left), out of total 349,525 disks, but when considering the relative health score, we categorize the 79,396 disk drives with a health score less than 99.9% as less healthy. Consequently, as shown in Table 4, we only select these 79,396 disk drives for scrubbing and skip the remaining 270,129. This approach significantly reduces the number of disks to be scrubbed to only 22.7%, resulting in lower power and energy consumption, which is noteworthy.

\ \ Table 3: Comparison of confusion matrix results for disk drive classification using kNN and MCP.

\ \ \ Table 4: The number of relatively healthy drives based on the health score intervals

\ \

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 Deed (Attribution-Noncommercial-Noderivs 4.0 International) license.

:::

[3] https://github.com/adamzenith/MAPIE/tree/Mondrian


:::info Authors:

(1) Rahul Vishwakarma, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (rahuldeo.vishwakarma01@student.csullb.edu);

(2) Jinha Hwang, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (jinha.hwang01@student.csulb.edu);

(3) Soundouss Messoudi, HEUDIASYC - UMR CNRS 7253, Universit´e de Technologie de Compiegne, 57 avenue de Landshut, 60203 Compiegne Cedex - France (soundouss.messoudi@hds.utc.fr);

(4) Ava Hedayatipour, California State University Long Beach, 1250 Bellflower Blvd, Long Beach, CA 90840, United States (ava.hedayatipour@csulb.edu).

:::

\

Market Opportunity
Sidekick Logo
Sidekick Price(K)
$0.005943
$0.005943$0.005943
+3.14%
USD
Sidekick (K) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Who’s Building the Next Phase of Artificial Intelligence? 20 Innovators Shaping the AI Industry in 2026

Who’s Building the Next Phase of Artificial Intelligence? 20 Innovators Shaping the AI Industry in 2026

Artificial intelligence, the center of global investing in 2025, is evolving from an experimental phase. After a few volatile years – characterized by rapid model
Share
AI Journal2025/12/19 05:58
Will XRP Price Increase In September 2025?

Will XRP Price Increase In September 2025?

Ripple XRP is a cryptocurrency that primarily focuses on building a decentralised payments network to facilitate low-cost and cross-border transactions. It’s a native digital currency of the Ripple network, which works as a blockchain called the XRP Ledger (XRPL). It utilised a shared, distributed ledger to track account balances and transactions. What Do XRP Charts Reveal? […]
Share
Tronweekly2025/09/18 00:00
CME Group to launch options on XRP and SOL futures

CME Group to launch options on XRP and SOL futures

The post CME Group to launch options on XRP and SOL futures appeared on BitcoinEthereumNews.com. CME Group will offer options based on the derivative markets on Solana (SOL) and XRP. The new markets will open on October 13, after regulatory approval.  CME Group will expand its crypto products with options on the futures markets of Solana (SOL) and XRP. The futures market will start on October 13, after regulatory review and approval.  The options will allow the trading of MicroSol, XRP, and MicroXRP futures, with expiry dates available every business day, monthly, and quarterly. The new products will be added to the existing BTC and ETH options markets. ‘The launch of these options contracts builds on the significant growth and increasing liquidity we have seen across our suite of Solana and XRP futures,’ said Giovanni Vicioso, CME Group Global Head of Cryptocurrency Products. The options contracts will have two main sizes, tracking the futures contracts. The new market will be suitable for sophisticated institutional traders, as well as active individual traders. The addition of options markets singles out XRP and SOL as liquid enough to offer the potential to bet on a market direction.  The options on futures arrive a few months after the launch of SOL futures. Both SOL and XRP had peak volumes in August, though XRP activity has slowed down in September. XRP and SOL options to tap both institutions and active traders Crypto options are one of the indicators of market attitudes, with XRP and SOL receiving a new way to gauge sentiment. The contracts will be supported by the Cumberland team.  ‘As one of the biggest liquidity providers in the ecosystem, the Cumberland team is excited to support CME Group’s continued expansion of crypto offerings,’ said Roman Makarov, Head of Cumberland Options Trading at DRW. ‘The launch of options on Solana and XRP futures is the latest example of the…
Share
BitcoinEthereumNews2025/09/18 00:56