You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
27 lines
1.3 KiB
27 lines
1.3 KiB
1. Objective:
|
|
We want to first hash the text dataset with HashingVectorizer or convert the text dataset to a matrix of TF-IDF features and then cluster the type of ssh login attempts by failure type.
|
|
|
|
2. License: Free to use with citation.
|
|
Security Repo (http://www.secrepo.com/) by Mike Sconzo is licensed under a Creative Commons Attribution 4.0 International License
|
|
|
|
3. Data Source:
|
|
http://www.secrepo.com/
|
|
"auth.log" under System header
|
|
|
|
4. Data Set Information:
|
|
The dataset contains about 30000 ssh login failure attempts.
|
|
|
|
5. Field Meanings:
|
|
A. Log: Includes the date, IP id, sshd/CRON and the error message all together as a string in one line of a log
|
|
|
|
6. Parameter Selection:
|
|
A. Dashboard: Detect clusters through Hashing Vectorizer
|
|
Settings:
|
|
1) Search: | inputlookup authorization.csv | fit HashingVectorizer Logs ngram_range=1-2 k=50 stop_words=english | fit KMeans Logs_hashed* k=3 | fields cluster* Logs | sample 5 by cluster | sort by cluster
|
|
2) Field to hash: Logs
|
|
B. Dashboard: Detect clusters through TFIDF
|
|
Settings:
|
|
1) Search: | inputlookup authorization.csv | fit TFIDF Logs ngram_range=1-2 ngram_range=1-2 max_df=0.6 min_df=0.2 stop_words=english | fit KMeans Logs_tfidf* k=3 | *fields cluster Logs | sample 6 by cluster | sort by cluster
|
|
2) Field to convert from text dataset to a matrix of TF-IDF features: Logs
|
|
|