You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1 line
4.1 KiB

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

Objective: We want to predict whether the hard drive is going to fail or not based on various indicators of drive reliability. We also want to detect categorical outliers in the data.
1. License: Free to use with following constraints.
A. We are encouraged to Cite Backblaze as the source (not a mandatory requirement).
B. We accept that we are solely responsible for how we use the data.
C. We do not sell this data to anyone, it is free.
2. Data Source: https://www.backblaze.com/hard-drive-test-data.html
3. Field Meaning:
A. SMART_1_Raw: Read Error Rate -> Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface.
B. SMART_2_Raw: Reallocated Sectors Count —> Typically, raw value normally represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate.
C. SMART_3_Raw: Power-On Hours —> Raw value of this means total count of hours the drive is on power-on state.
D. SMART_4_Raw: Temperature Celsius —> Current internal temperature.
E. SMART_5_Raw: Current Pending Sector Count —> Shows the count of "unstable" sectors.
F. DiskFailure: Yes means failure, No means condition of the hard drive is fine.
G. Date: Refers to snapshot of hard drive in that particular day.
H. Model: Manufacturer-assigned model number of the hard drive.
I. SerialNumber: Manufacturer-assigned serial number of the hard drive.
J. CapacityBytes: Refers to hard drive capacity in bytes.
4. Parameter Selection:
A. Dashboard Usage: Predict Categorical Fields
Settings:
1) Search Command: | inputlookup disk_failures.csv | eventstats max(SMART_1_Raw) as max1 min(SMART_1_Raw) as min1 | eventstats max(SMART_2_Raw) as max2 min(SMART_2_Raw) as min2 | eventstats max(SMART_3_Raw) as max3 min(SMART_3_Raw) as min3 | eventstats max(SMART_4_Raw) as max4 min(SMART_4_Raw) as min4 | eventstats max(SMART_5_Raw) as max5 min(SMART_5_Raw) as min5 | eval SMART_1_Transformed = (SMART_1_Raw - min1)/(max1-min1) | eval SMART_2_Transformed = (SMART_2_Raw - min2)/(max2-min2) | eval SMART_3_Transformed = (SMART_3_Raw - min3)/(max3-min3) | eval SMART_4_Transformed = (SMART_4_Raw - min4)/(max4-min4) | eval SMART_5_Transformed = (SMART_5_Raw - min5)/(max5-min5) | table Date Model CapacityBytes SerialNumber DiskFailure SMART_1_Raw SMART_1_Transformed SMART_2_Raw SMART_2_Transformed SMART_3_Raw SMART_3_Transformed SMART_4_Raw SMART_4_Transformed SMART_5_Raw SMART_5_Transformed
2) Field to predict: DiskFailure
3) Fields to use for predicting: Model, SMART_1_Transformed, SMART_2_Transformed, SMART_3_Transformed, SMART_4_Transformed, SMART_5_Transformed
B. Dashboard: Detect Categorical Outliers
Settings:
1) Fields to analyze: Model, CapacityBytes, DiskFailure, SerialNumber
C. Dashboard: Cluster Numeric Fields
Settings:
0) Search: | inputlookup disk_failures.csv | eventstats max(SMART_1_Raw) as max1 min(SMART_1_Raw) as min1 | eventstats max(SMART_2_Raw) as max2 min(SMART_2_Raw) as min2 | eventstats max(SMART_3_Raw) as max3 min(SMART_3_Raw) as min3 | eventstats max(SMART_4_Raw) as max4 min(SMART_4_Raw) as min4 | eventstats max(SMART_5_Raw) as max5 min(SMART_5_Raw) as min5 | eval SMART_1_Transformed = (SMART_1_Raw - min1)/(max1-min1) | eval SMART_2_Transformed = (SMART_2_Raw - min2)/(max2-min2) | eval SMART_3_Transformed = (SMART_3_Raw - min3)/(max3-min3) | eval SMART_4_Transformed = (SMART_4_Raw - min4)/(max4-min4) | eval SMART_5_Transformed = (SMART_5_Raw - min5)/(max5-min5) | table Date Model CapacityBytes SerialNumber DiskFailure SMART_1_Raw SMART_1_Transformed SMART_2_Raw SMART_2_Transformed SMART_3_Raw SMART_3_Transformed SMART_4_Raw SMART_4_Transformed SMART_5_Raw SMART_5_Transformed
1) Model name: disk_failures
2) Fields to preprocess: SMART_1_Raw, SMART_1_Transformed, SMART_2_Raw, SMART_2_Transformed, SMART_3_Raw, SMART_3_Transformed, SMART_4_Raw, SMART_4_Transformed, SMART_5_Raw, SMART_5_Transformed
3) Apply StandardScaler
4) Apply PCA to reduce to 4 fields
5) Algorithm: Birch
6) Fields to use: PC_1, PC_2, PC_3, PC_4
7) K: 3