|
|
1. Objective:
|
|
|
A. We want to detect outliers in the quantity of purchases at a supermarket.
|
|
|
B. We want to detect outliers in the whole transaction at a supermarket.
|
|
|
|
|
|
2. License: Free to use but requires citation of the following paper: Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti, F., ‘Explaining the Product Range Effect in Purchase Data’. In BigData, 2013.
|
|
|
|
|
|
3. Data Source: http://www.michelecoscia.com/?page_id=379
|
|
|
|
|
|
4. DataSet Info: This is a dataset obtained from one of the largest Italian retail distribution company named ‘Coop’. The original dataset contains around ~25 million purchase records from January 2007 to December 2011. We merged three three separate files that comes with the original dataset and include only the first 100000 purchases.
|
|
|
|
|
|
5. Field Meanings:
|
|
|
A. customer_id: Unique customer ID.
|
|
|
B. shop_id: Unique shop ID.
|
|
|
C. product_id: Unique product ID.
|
|
|
D. quantity: Quantity in which the product was purchased.
|
|
|
E. price: Product price.
|
|
|
F. distance: Distance between the customer’s house and the shop location in meters.
|
|
|
G. probable_cause: Field that has most influence for making a outlying transaction.
|
|
|
H. isOutlier: 1(Outlier)/0(Normal)
|
|
|
|
|
|
6. Parameter Selection:
|
|
|
A. Dashboard Usage: Detect Numerical Outlier
|
|
|
Settings:
|
|
|
1)Search command: | inputlookup supermarket.csv | head 1000
|
|
|
2)Field to analyze: quantity
|
|
|
3)Threshold method: Standard Deviation
|
|
|
4)Threshold multiplier: 5
|
|
|
5)Sliding window: N/A
|
|
|
|
|
|
B. Dashboard Usage: Detect Categorical Outlier
|
|
|
Settings:
|
|
|
1)Search command: | inputlookup supermarket.csv
|
|
|
2)Field(s) to analyze: customer_id, shop_id, product_id, quantity, price, distance
|