|
|
Objective: We want to predict whether the hard drive is going to fail or not based on various indicators of drive reliability. We also want to detect categorical outliers in the data.
|
|
|
|
|
|
1. License: Free to use with following constraints.
|
|
|
A. We are encouraged to Cite Backblaze as the source (not a mandatory requirement).
|
|
|
B. We accept that we are solely responsible for how we use the data.
|
|
|
C. We do not sell this data to anyone, it is free.
|
|
|
|
|
|
2. Data Source: https://www.backblaze.com/hard-drive-test-data.html
|
|
|
|
|
|
3. Field Meaning:
|
|
|
A. SMART_1_Raw: Read Error Rate -> Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface.
|
|
|
B. SMART_2_Raw: Reallocated Sectors Count —> Typically, raw value normally represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate.
|
|
|
C. SMART_3_Raw: Power-On Hours —> Raw value of this means total count of hours the drive is on power-on state.
|
|
|
D. SMART_4_Raw: Temperature Celsius —> Current internal temperature.
|
|
|
E. SMART_5_Raw: Current Pending Sector Count —> Shows the count of "unstable" sectors.
|
|
|
|
|
|
F. DiskFailure: ‘Yes’ means failure, ‘No’ means condition of the hard drive is fine.
|
|
|
G. Date: Refers to snapshot of hard drive in that particular day.
|
|
|
H. Model: Manufacturer-assigned model number of the hard drive.
|
|
|
I. SerialNumber: Manufacturer-assigned serial number of the hard drive.
|
|
|
J. CapacityBytes: Refers to hard drive capacity in bytes.
|
|
|
|
|
|
4. Parameter Selection:
|
|
|
A. Dashboard Usage: Predict Categorical Fields
|
|
|
Settings:
|
|
|
1) Search Command: | inputlookup disk_failures.csv | eventstats max(SMART_1_Raw) as max1 min(SMART_1_Raw) as min1 | eventstats max(SMART_2_Raw) as max2 min(SMART_2_Raw) as min2 | eventstats max(SMART_3_Raw) as max3 min(SMART_3_Raw) as min3 | eventstats max(SMART_4_Raw) as max4 min(SMART_4_Raw) as min4 | eventstats max(SMART_5_Raw) as max5 min(SMART_5_Raw) as min5 | eval SMART_1_Transformed = (SMART_1_Raw - min1)/(max1-min1) | eval SMART_2_Transformed = (SMART_2_Raw - min2)/(max2-min2) | eval SMART_3_Transformed = (SMART_3_Raw - min3)/(max3-min3) | eval SMART_4_Transformed = (SMART_4_Raw - min4)/(max4-min4) | eval SMART_5_Transformed = (SMART_5_Raw - min5)/(max5-min5) | table Date Model CapacityBytes SerialNumber DiskFailure SMART_1_Raw SMART_1_Transformed SMART_2_Raw SMART_2_Transformed SMART_3_Raw SMART_3_Transformed SMART_4_Raw SMART_4_Transformed SMART_5_Raw SMART_5_Transformed
|
|
|
2) Field to predict: DiskFailure
|
|
|
3) Fields to use for predicting: Model, SMART_1_Transformed, SMART_2_Transformed, SMART_3_Transformed, SMART_4_Transformed, SMART_5_Transformed
|
|
|
|
|
|
B. Dashboard: Detect Categorical Outliers
|
|
|
Settings:
|
|
|
1) Fields to analyze: Model, CapacityBytes, DiskFailure, SerialNumber
|
|
|
|
|
|
C. Dashboard: Cluster Numeric Fields
|
|
|
Settings:
|
|
|
0) Search: | inputlookup disk_failures.csv | eventstats max(SMART_1_Raw) as max1 min(SMART_1_Raw) as min1 | eventstats max(SMART_2_Raw) as max2 min(SMART_2_Raw) as min2 | eventstats max(SMART_3_Raw) as max3 min(SMART_3_Raw) as min3 | eventstats max(SMART_4_Raw) as max4 min(SMART_4_Raw) as min4 | eventstats max(SMART_5_Raw) as max5 min(SMART_5_Raw) as min5 | eval SMART_1_Transformed = (SMART_1_Raw - min1)/(max1-min1) | eval SMART_2_Transformed = (SMART_2_Raw - min2)/(max2-min2) | eval SMART_3_Transformed = (SMART_3_Raw - min3)/(max3-min3) | eval SMART_4_Transformed = (SMART_4_Raw - min4)/(max4-min4) | eval SMART_5_Transformed = (SMART_5_Raw - min5)/(max5-min5) | table Date Model CapacityBytes SerialNumber DiskFailure SMART_1_Raw SMART_1_Transformed SMART_2_Raw SMART_2_Transformed SMART_3_Raw SMART_3_Transformed SMART_4_Raw SMART_4_Transformed SMART_5_Raw SMART_5_Transformed
|
|
|
1) Model name: disk_failures
|
|
|
2) Fields to preprocess: SMART_1_Raw, SMART_1_Transformed, SMART_2_Raw, SMART_2_Transformed, SMART_3_Raw, SMART_3_Transformed, SMART_4_Raw, SMART_4_Transformed, SMART_5_Raw, SMART_5_Transformed
|
|
|
3) Apply StandardScaler
|
|
|
4) Apply PCA to reduce to 4 fields
|
|
|
5) Algorithm: Birch
|
|
|
6) Fields to use: PC_1, PC_2, PC_3, PC_4
|
|
|
7) K: 3
|