Objective: We want to predict whether the hard drive is going to fail or not based on various indicators of drive reliability. We also want to detect categorical outliers in the data. 1. License: Free to use with following constraints. A. We are encouraged to Cite Backblaze as the source (not a mandatory requirement). B. We accept that we are solely responsible for how we use the data. C. We do not sell this data to anyone, it is free. 2. Data Source: https://www.backblaze.com/hard-drive-test-data.html 3. Field Meaning: A. SMART_1_Raw: Read Error Rate -> Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. B. SMART_2_Raw: Reallocated Sectors Count —> Typically, raw value normally represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. C. SMART_3_Raw: Power-On Hours —> Raw value of this means total count of hours the drive is on power-on state. D. SMART_4_Raw: Temperature Celsius —> Current internal temperature. E. SMART_5_Raw: Current Pending Sector Count —> Shows the count of "unstable" sectors. F. DiskFailure: ‘Yes’ means failure, ‘No’ means condition of the hard drive is fine. G. Date: Refers to snapshot of hard drive in that particular day. H. Model: Manufacturer-assigned model number of the hard drive. I. SerialNumber: Manufacturer-assigned serial number of the hard drive. J. CapacityBytes: Refers to hard drive capacity in bytes. 4. Parameter Selection: A. Dashboard Usage: Predict Categorical Fields Settings: 1) Search Command: | inputlookup disk_failures.csv | eventstats max(SMART_1_Raw) as max1 min(SMART_1_Raw) as min1 | eventstats max(SMART_2_Raw) as max2 min(SMART_2_Raw) as min2 | eventstats max(SMART_3_Raw) as max3 min(SMART_3_Raw) as min3 | eventstats max(SMART_4_Raw) as max4 min(SMART_4_Raw) as min4 | eventstats max(SMART_5_Raw) as max5 min(SMART_5_Raw) as min5 | eval SMART_1_Transformed = (SMART_1_Raw - min1)/(max1-min1) | eval SMART_2_Transformed = (SMART_2_Raw - min2)/(max2-min2) | eval SMART_3_Transformed = (SMART_3_Raw - min3)/(max3-min3) | eval SMART_4_Transformed = (SMART_4_Raw - min4)/(max4-min4) | eval SMART_5_Transformed = (SMART_5_Raw - min5)/(max5-min5) | table Date Model CapacityBytes SerialNumber DiskFailure SMART_1_Raw SMART_1_Transformed SMART_2_Raw SMART_2_Transformed SMART_3_Raw SMART_3_Transformed SMART_4_Raw SMART_4_Transformed SMART_5_Raw SMART_5_Transformed 2) Field to predict: DiskFailure 3) Fields to use for predicting: Model, SMART_1_Transformed, SMART_2_Transformed, SMART_3_Transformed, SMART_4_Transformed, SMART_5_Transformed B. Dashboard: Detect Categorical Outliers Settings: 1) Fields to analyze: Model, CapacityBytes, DiskFailure, SerialNumber C. Dashboard: Cluster Numeric Fields Settings: 0) Search: | inputlookup disk_failures.csv | eventstats max(SMART_1_Raw) as max1 min(SMART_1_Raw) as min1 | eventstats max(SMART_2_Raw) as max2 min(SMART_2_Raw) as min2 | eventstats max(SMART_3_Raw) as max3 min(SMART_3_Raw) as min3 | eventstats max(SMART_4_Raw) as max4 min(SMART_4_Raw) as min4 | eventstats max(SMART_5_Raw) as max5 min(SMART_5_Raw) as min5 | eval SMART_1_Transformed = (SMART_1_Raw - min1)/(max1-min1) | eval SMART_2_Transformed = (SMART_2_Raw - min2)/(max2-min2) | eval SMART_3_Transformed = (SMART_3_Raw - min3)/(max3-min3) | eval SMART_4_Transformed = (SMART_4_Raw - min4)/(max4-min4) | eval SMART_5_Transformed = (SMART_5_Raw - min5)/(max5-min5) | table Date Model CapacityBytes SerialNumber DiskFailure SMART_1_Raw SMART_1_Transformed SMART_2_Raw SMART_2_Transformed SMART_3_Raw SMART_3_Transformed SMART_4_Raw SMART_4_Transformed SMART_5_Raw SMART_5_Transformed 1) Model name: disk_failures 2) Fields to preprocess: SMART_1_Raw, SMART_1_Transformed, SMART_2_Raw, SMART_2_Transformed, SMART_3_Raw, SMART_3_Transformed, SMART_4_Raw, SMART_4_Transformed, SMART_5_Raw, SMART_5_Transformed 3) Apply StandardScaler 4) Apply PCA to reduce to 4 fields 5) Algorithm: Birch 6) Fields to use: PC_1, PC_2, PC_3, PC_4 7) K: 3