You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

27 lines
1.3 KiB

1. Objective:
We want to first hash the text dataset with HashingVectorizer or convert the text dataset to a matrix of TF-IDF features and then cluster the type of ssh login attempts by failure type.
2. License: Free to use with citation.
Security Repo (http://www.secrepo.com/) by Mike Sconzo is licensed under a Creative Commons Attribution 4.0 International License
3. Data Source:
http://www.secrepo.com/
"auth.log" under System header
4. Data Set Information:
The dataset contains about 30000 ssh login failure attempts.
5. Field Meanings:
A. Log: Includes the date, IP id, sshd/CRON and the error message all together as a string in one line of a log
6. Parameter Selection:
A. Dashboard: Detect clusters through Hashing Vectorizer
Settings:
1) Search: | inputlookup authorization.csv | fit HashingVectorizer Logs ngram_range=1-2 k=50 stop_words=english | fit KMeans Logs_hashed* k=3 | fields cluster* Logs | sample 5 by cluster | sort by cluster
2) Field to hash: Logs
B. Dashboard: Detect clusters through TFIDF
Settings:
1) Search: | inputlookup authorization.csv | fit TFIDF Logs ngram_range=1-2 ngram_range=1-2 max_df=0.6 min_df=0.2 stop_words=english | fit KMeans Logs_tfidf* k=3 | *fields cluster Logs | sample 6 by cluster | sort by cluster
2) Field to convert from text dataset to a matrix of TF-IDF features: Logs