SH-Deployer/apps/Splunk_ML_Toolkit/licenses/lookups/authorization.txt

1. Objective:
  We want to first hash the text dataset with HashingVectorizer or convert the text dataset to a matrix of TF-IDF features and then cluster the type of ssh login attempts by failure type.

2. License: Free to use with citation.
Security Repo (http://www.secrepo.com/) by Mike Sconzo is licensed under a Creative Commons Attribution 4.0 International License

3. Data Source:
  http://www.secrepo.com/
  "auth.log" under System header

4. Data Set Information:
  The dataset contains about 30000 ssh login failure attempts.

5. Field Meanings:
  A. Log: Includes the date, IP id, sshd/CRON and the error message all together as a string in one line of a log

6. Parameter Selection:
  A. Dashboard: Detect clusters through Hashing Vectorizer
      Settings:
    1) Search: | inputlookup authorization.csv | fit HashingVectorizer Logs ngram_range=1-2 k=50 stop_words=english | fit KMeans Logs_hashed* k=3 | fields cluster* Logs | sample 5 by cluster | sort by cluster
    2) Field to hash: Logs
  B. Dashboard: Detect clusters through TFIDF
      Settings:
    1) Search: | inputlookup authorization.csv | fit TFIDF Logs ngram_range=1-2 ngram_range=1-2 max_df=0.6 min_df=0.2 stop_words=english | fit KMeans Logs_tfidf* k=3 | *fields cluster Logs | sample 6 by cluster | sort by cluster
    2) Field to convert from text dataset to a matrix of TF-IDF features: Logs