NOTE: This version of the ITSI health check dashboard has been deprecated. This dashboard describes the operational status and configuration of an ITSI instance. Because it accesses sensitive indexes (_internal) and REST endpoints, reports will be incomplete if not run by an Admin user. (INFO|ER\w+|WAR\w+|FAT\w+|DEBUG|CRI\w+))\s+\[" | fillnull log_level value="UNKNOWN" | eval log_level=case(log_level="WAR","WARNING",1=1,log_level) | rex mode=sed field=component "s/(.*)-\d+/\1/" | bucket _time span=5m | stats count by _time log_level component host ]]> $field1.earliest$ $field1.latest$
-24h@h now
Splunk Server Information | rest splunk_server=local /services/server/info | stats values(version) as splunk_version, values(server_roles) as server_roles, values(os_name) as os, values(numberOfCores) as cpu_cores, values(numberOfVirtualCores) as virtual_cpu_cores, values(physicalMemoryMB) as physical_mem_MB by splunk_server | rename splunk_server as host | join type=left host [| rest splunk_server=local /services/apps/local/itsi | stats values(version) as itsi_version by splunk_server | rename splunk_server as host] | join type=left host [ search index=_introspection sourcetype=splunk_disk_objects component=Indexes data.name="*itsi*" | stats dc(data.name) as index_count, values(data.name) as indexes by host] | join type=left host [ search index=_internal splunk_server=local sourcetype=splunkd "Linux transparent hugepage support" latest=now() | head 1 | rex field=event_message "enabled= (?<enabled>\S+)" | eval THP_kernel_settings=if(enabled="always", "not ok", "ok") ] | table host splunk_version itsi_version os cpu_cores virtual_cpu_cores physical_mem_MB THP_kernel_settings server_roles index_count indexes $field1.earliest$ $field1.latest$ {"bad":#F7BC38,"ok":#65A637} [#D93F3C,#F7BC38,#65A637] 12,16 [#D93F3C,#F7BC38,#65A637] 24,32 [#D93F3C,#F7BC38,#65A637] 12288,16384
ITSI Migration Status | rest splunk_server=local /services/apps/local/itsi | stats values(version) as "Current ITSI version" | join [ | rest splunk_server=local /services/apps/local/SA-ITOA | stats values(version) as "Current SA-ITOA version" | join [|inputlookup itsi_migration_check | eval "Current KV Store version"=itsi_latest_version | fields - itsi_old_version, itsi_latest_version, is_migration_done]] -24h@h now
ITSI Upgrade Readiness |inputlookup itsi_service_template_sync_status_lookup | stats count(eval(sync_status=="syncing" OR (sync_status=="sync scheduled" AND isnull(scheduled_time)))) as my_count | eval upgrade_ready=if(my_count > 0, "False", "True") | fields - my_count -24h@h now
Basic ITSI Information | rest splunk_server=local /services/server/info | stats values(kvStoreStatus) as kvstore_status by splunk_server | rename splunk_server as host | join type=left host [ search index=_introspection sourcetype=kvstore component=KVStoreCollectionStats data.ns="*itsi*" | stats dc(data.ns) as kvstore_collections, count(eval(data.ok="0")) as kvstore_data_not_ok by host] | join type=left host [ search index=_introspection sourcetype=http_event_collector_metrics data.token_name="Auto Generated ITSI Event Management Token" | stats sum(data.num_of_errors) as HEC_errors, sum(data.num_of_parser_errors) as HEC_parser_errors, sum(data.total_bytes_indexed) as HEC_bytes_indexed by host] | join type=left host [ | rest splunk_server=local /servicesNS/nobody/SA-ITOA/itoa_interface/vLatest/service/count report_as=text | spath input=value | rename splunk_server as host, count as service_count | table host service_count] | join type=left host [ | rest splunk_server=local /servicesNS/nobody/SA-ITOA/itoa_interface/vLatest/entity/count report_as=text | spath input=value | rename splunk_server as host, count as entity_count | table host entity_count] | join type=left host [ search index=_internal sourcetype=scheduler savedsearch_name="Indicator*" | stats count as run_count, count(eval(status="delegated_remote_error" OR status="skipped")) as failed_count, count(eval(suppressed!="0")) as suppressed_count, avg(run_time) as avg_runtime, max(run_time) as max_runtime, earliest(_time) as first, latest(_time) as last by host, savedsearch_name | eval KPI_search_type=if(savedsearch_name like "%Shared%", "base", "ad hoc") | stats count(eval(KPI_search_type="base")) as kpi_base_searches, count(eval(KPI_search_type="ad hoc")) as kpi_adhoc_searches by host] | table host service_count kpi_base_searches kpi_adhoc_searches entity_count kvstore_status kvstore_collections kvstore_data_not_ok HEC_bytes_indexed HEC_errors HEC_parser_errors $field1.earliest$ $field1.latest$ [#65A637,#F7BC38] 1 [#D93F3C,#65A637] 1 [#65A637,#F7BC38] 1 [#65A637,#F7BC38] 1 [#F7BC38,#65A637] 29 [#F7BC38,#65A637,#F7BC38] 1,1000 [#F7BC38,#65A637,#F7BC38] 1,100 [#F7BC38,#65A637,#F7BC38] 1,100 [#F7BC38,#65A637,#F7BC38,#D93F3C] 1,1000,30000 {"ready":#65A637}
KPI Base Search Usage Summary | inputlookup service_kpi_sbs_lookup | eval zipped = mvzip(mvzip('kpis.base_search', 'kpis.search_type', "==@@=="), 'kpis.title', "==@@==") | fields - kpis._key, kpis.base_search, kpis.search_type, kpis.title, sec_grp, title, kpis.base_search | eval sharedBaseZipped=mvfilter(match(zipped, "shared_base")) | rename kpis.base_search_id as base_search_id | fields - zipped | eval t=mvzip(base_search_id, sharedBaseZipped, "==@@==") | fields - sharedBaseZipped, base_search_id | mvexpand t | eval x=split(t, "==@@==") | eval search_id = mvindex(x, 0) | eval search_str = mvindex(x, -3) | eval search_type = mvindex(x, -2) | eval kpi_title = mvindex(x, -1) | search search_type = shared_base | table search_str, search_id | stats count by search_id, search_str | rename search_id as key | join [| inputlookup kpi_base_search_title_lookup | eval key=_key] | rename title as kpi_base_search_title | table kpi_base_search_title, search_str, count -24h@h now

The maximum number of objects in each collection is 500,000. You may notice performance degradation as a collection approaches its limit.

KV Store Collections | rest splunk_server=local /services/server/introspection/kvstore/collectionstats | mvexpand data | spath input=data | rex field=ns "(?<App>.*)\.(?<Collection>.*)" | eval dbsize=size/1024/1024 | eval indexsize=totalIndexSize/1024/1024 | stats first(count) AS "Number of Objects" first(nindexes) AS Accelerations first(indexsize) AS "Acceleration Size (MB)" first(dbsize) AS "Collection Size (MB)" by App,Collection | sort - "Number of Objects" $field1.earliest$ $field1.latest$ [#65A637,#F7BC38,#D93F3C] 430000,500000
Concurrent Searches source="*/metrics.log" sourcetype=splunkd index=_internal active_hist_searches group=search_concurrency "system total" | stats max(active_hist_searches) as max_historical_searches, avg(active_hist_searches) as avg_historical_searches, max(active_realtime_searches) as max_realtime_searches, avg(active_realtime_searches) as avg_realtime_searches by splunk_server | rename splunk_server as host | eval avg_historical_searches=round(avg_historical_searches,0) | eval avg_realtime_searches=round(avg_realtime_searches,0) | join type=left host [ search source="*/metrics.log" sourcetype=splunkd index=_internal group=searchscheduler | stats max(skipped) as max_skipped, max(max_running) as max_running, max(total_runtime) as max_total_runtime, avg(total_runtime) as avg_total_runtime by splunk_server | rename splunk_server as host | eval max_total_runtime=round(max_total_runtime,0) | eval avg_total_runtime=round(avg_total_runtime,0)] $field1.earliest$ $field1.latest$ [#65A637,#F7BC38] 1
Interesting Indexes | tstats count as entries latest(_time) as most_recent where index=itsi* OR index=_internal by index, splunk_server | stats sum(entries) as entries, max(most_recent) as most_recent, values(splunk_server) as indexers by index | eval most_recent=strftime(most_recent,"%F %T") $field1.earliest$ $field1.latest$ [#F7BC38,#65A637] 1
Interesting Searches (If the real-time searches are not running, this could indicate a Java problem) | rest splunk_server=local /services/search/jobs/ | search label=itsi* | fields label dispatchState isFailed isRealTimeSearch runDuration | rename label as search_name $field1.earliest$ $field1.latest$ [#65A637,#F7BC38] 1

KPI Performance ("runtime_headroom" is (100 - runtime / scheduled interval). For a search scheduled to run every 60sec, with a runtime of 45sec, runtime_headroom_pct = 25. 100 is good, 0 is bad). Your avg_result_count or max_result_count should not exceed the max_action_results for scheduler in limits.conf (default: 50k)

limit = (number of KPIs * number of entities associated with KPIs) + (number of services * 2). Exceeding the limit may lead to inconsistent results for KPI aggregation. Increasing the limit can impact system performance because more memory must be allocated to support increased search results.

index=_internal sourcetype=scheduler savedsearch_name="Indicator*" | stats dc(sid) as run_count, count(eval(status="delegated_remote_error" OR status="skipped")) as failed_count, count(eval(suppressed!="0")) as suppressed_count, avg(run_time) as avg_runtime, max(run_time) as max_runtime, earliest(_time) as first, latest(_time) as last, max(result_count) as max_result_count, avg(result_count) as avg_result_count by savedsearch_name | eval KPI_search_type=if(savedsearch_name like "%Shared%", "base", "ad hoc") | eval runtime_headroom_pct=round((100-(max_runtime/((last-first)/(run_count-1))*100)),1) | eval avg_runtime=round(avg_runtime, 2) | eval max_runtime=round(max_runtime, 2) | eval avg_result_count=round(avg_result_count, 2) | eval max_result_count=round(max_result_count, 2) | table savedsearch_name KPI_search_type failed_count suppressed_count runtime_headroom_pct avg_runtime max_runtime avg_result_count max_result_count run_count | sort +runtime_headroom_pct $field1.earliest$ $field1.latest$ [#D93F3C,#F7BC38,#65A637] 25,50 [#65A637,#D93F3C] 1 [#65A637,#F7BC38] 1
Savedsearch Error Messages index=_internal sourcetype=scheduler savedsearch_name="Indicator*" | join sid [ search index=_internal sourcetype=splunk_search_messages app="itsi" log_level=ERROR] | stats count(savedsearch_name) as "count" avg(run_time) as "Avg Runtime(sec)" values(message_key) as "Message Key" values(message) as "Error Message" by savedsearch_name | eval Avg Runtime(sec)=round('Avg Runtime(sec)', 3) | rename savedsearch_name AS "Savedsearch Name" $field1.earliest$ $field1.latest$
Not Executed Searches (In last 1 hour) index=_internal source=*splunkd.log "search not executed" user="splunk-system-user" | timechart count span=1h $field1.earliest$ $field1.latest$ Refresh Queue Statistics

The refresh queue ensures data integrity and eventual consistency of your ITSI configuration. It runs as a single instance.

Refresh Queue Runtimes index=_internal sourcetype=itsi_internal_log source=*itsi_consumer* "Job Successful" |stats avg(transaction_time) as "Average Job Time", avg(queue_time) as "Average Queue Time", max(transaction_time) as "Maximum Job Time", max(queue_time) as "Maximum Queue Time" $field1.earliest$ $field1.latest$
Refresh Queue Failed Jobs index=_internal sourcetype=itsi_internal_log source=*itsi_consumer* "Job Failed" |stats count as "Failed Jobs" $field1.earliest$ $field1.latest$ [#65A637,#D93F3C] 1 search?q=index=_internal sourcetype=itsi_internal_log source=*itsi_consumer* "Job Failed"&earliest=$field1.earliest$&latest=$field1.latest$
ITSI Log Messages (deduplicated) -60m@m now Debug Info Warning Error All WAR*,ERROR log_level=" " OR index=_internal sourcetype=itsi_internal_log $LogLevel$ | rex max_match=3 "\[(?<itsi_components>[^\]]+)" | eval comp1=mvindex(itsi_components,0), comp2=mvindex(itsi_components,1), comp3=mvindex(itsi_components,2) | fillnull value="none" comp3 | dedup comp1 comp2 comp3 $field2.earliest$ $field2.latest$

If more than one entity is using the same alias field value, KPI base searches might have incorrect statistical aggregation results. To remedy duplicate entity alias values, click Configure > Entities and edit the entity definitions for the entities with duplicate aliases. Keep the alias value for one of the entities and edit the others to remove the duplicate alias value. Learn More

Check for Duplicate Entity Aliases 1 ]]> $field1.earliest$ $field1.latest$ 1