Hi Everyone,
The system_health session is an extended event session that is included by default with SQL Server. This blog post goes over how to easily read and interpret System Health extended event and sp_server_diagnostics with an example. For this post, we will focus on what sp_server_diagnostics reports during the period of 3/15/2023 at 4:00am to 3/15/2023 at 11:20am.
Let us say you are being approached to figure out the reason for a glitch during this time window, even if you did not have any monitoring solution in place, you can still tap into system_health session since this is available by default on your SQL Server instance.
1.1 What is inside system health xevent?
1.2 What is inside the output of the internal stored procedure: Sp_Server_Diagnostics?
Since the stored procedure sp_server_diagnostics is internal to SQL Server and cannot be scripted out, the section below expands on how to read the data column of each column from the output of this stored procedure and how to interpret the results.
The output of the stored procedure sp_server_diagnostics is logged in both system_health and sqldiag xevent sessions with different frequencies.
For System_health, the frequency is 5 minutes, and for SQLDiag, the default is 10 seconds with 30 seconds as the health check timeout.
Data captured by sp_server_diagnostics. (Reference: sp_server_diagnostics (Transact-SQL) - SQL Server | Microsoft Learn)
Column |
Data type |
Description |
create_time |
datetime |
Indicates the time stamp of row creation. Each row in a single rowset has the same time stamp. |
component_type |
sysname |
Indicates whether the row contains information for the SQL Server instance level component or for an Always On availability group: Collects data from a system perspective on spinlocks, severe processing conditions, non-yielding tasks, page faults, and CPU usage. This information is producing an overall health state recommendation.
Resource: Collects data from a resource perspective on physical and virtual memory, buffer pools, pages, cache, and other memory objects. This information produces an overall health state recommendation.
query_processing: Collects data from a query-processing perspective on the worker threads, tasks, wait types, CPU intensive sessions, and blocking tasks. This information produces an overall health state recommendation.
IO_subsystem: Collects data on IO. In addition to diagnostic data, this component produces a clean healthy or warning health state only for an IO subsystem.
Events: Collects data and surfaces through the stored procedure on the errors and events of interest recorded by the server, including details about ring buffer exceptions, ring buffer events about memory broker, out of memory, scheduler monitor, buffer pool, spinlocks, security, and connectivity . Events will always show 0 as the state.
<name of the availability group>: Collects data for the specified availability group (if component_type = "Always On:AvailabilityGroup").
|
component_name |
sysname |
Indicates the name of component or the name of the availability group: |
state |
int |
Indicates the health status of the component: |
state_desc |
sysname |
Describes the state column. Descriptions that correspond to the values in the state column are: 1: clean 2: warning 3: error |
data |
varchar (max) |
Specifies data that is specific to the component. |
2.0 How to open multiple extended events on SSMS (SQL Server Management Studio)?
3.0 How to analyze the merged extended event files?
Since the extended event has a running log of entries, the first thing to do is to apply a time filter to narrow the log entries down to the time of the issue.
So, when you open the xevent files on the ribbon click on filter.
A popup will show up, click on set time filter> set the start time and end time to the hours that you want to review.
Important!!! The xevent timestamp will show with your time zone, you will need to convert the time you can check the SQL error log to see the time zone. You also have the option to add the column timestamp (UTC). To add this column right click on the columns > click on choose column> look for timestamp (UTC)
For this demo, we will only focus on the output from sp_server_diagnostics and since the issue in this example happened at 6 am EST (UTC-4) and continue happening for sometime, we will start looking for events starting from 4 am (UTC -6) until 11:21 am as shown on the screenshot below. Next on the bottom we can use the filter “Name = sp_server_diagnostics_component_result” and “state not like clean”
After filtering it will give you a list of components that do not have a clean state. You can add the field as a column by right clicking the field and “show column in table.”
From here we will look at each component and what they are reporting. You can group by component and sort by time. To group by right click on the column and click on group by this column.
And it will look something like this.
4.0 What does each component report?
Now that we have the data that we want to look at least see what each component reports on the data column. This column is on varchar (max)
Some of the counters are very straight forward some are not, but here are some conditions that would cause warnings or errors.
4.1 Component: SYSTEM
Collects data from a system perspective on spinlocks, severe processing conditions, non-yielding tasks, page faults, and CPU usage.
The categories of information that are surfaced for the "system" component data are as follows:
<system spinlockBackoffs="0" sickSpinlockType="none" sickSpinlockTypeAfterAv="none" latchWarnings="0" isAccessViolationOccurred="0" writeAccessViolationCount="0" totalDumpRequests="6" intervalDumpRequests="0" nonYieldingTasksReported="0" pageFaults="1438916" systemCpuUtilization="19" sqlCpuUtilization="14" BadPagesDetected="0" BadPagesFixed="0" LastBadPageAddress="0x0"/> |
On the above example we can the following
4.2 Component: RESOURCE
Collects data from a resource perspective on physical and virtual memory, buffer pools, pages, cache, and other memory objects. The data column output is divided into three areas of interest.
The initial area of focus pertains to overall resource monitoring.
<resource lastNotification="RESOURCE_MEMPHYSICAL_LOW" outOfMemoryExceptions="0" isAnyPoolOutOfMemory="0" processOutOfMemoryPeriod="0">
The second segment of data derived from the "resource" component pertains to a memory report that encompasses Process and System counters and counts.
<memoryReport name="Process/System Counts" unit="Value"> <entry description="Available Physical Memory" value="49261801472"/> <entry description="Available Virtual Memory" value="139206615916544"/> <entry description="Available Paging File" value="48387641344"/> <entry description="Working Set" value="483305889792"/> <entry description="Percent of Committed Memory in WS" value="100"/> <entry description="Page Faults" value="1809733786"/> <entry description="System physical memory high" value="1"/> <entry description="System physical memory low" value="0"/> <entry description="Process physical memory low" value="0"/> <entry description="Process virtual memory low" value="0"/></memoryReport>
The third memory report sourced from the "resource" component data comprises Memory Manager counter values.
<memoryReport name="Memory Manager" unit="KB"> <entry description="VM Reserved" value="1488779088"/> <entry description="VM Committed" value="471862324"/> <entry description="Locked Pages Allocated" value="0"/> <entry description="Large Pages Allocated" value="0"/> <entry description="Emergency Memory" value="1024"/> <entry description="Emergency Memory In Use" value="16"/> <entry description="Target Committed" value="471859208"/> <entry description="Current Committed" value="471862328"/> <entry description="Pages Allocated" value="451123768"/> <entry description="Pages Reserved" value="361480"/> <entry description="Pages Free" value="1376480"/> <entry description="Pages In Use" value="106168864"/> <entry description="Page Alloc Potential" value="342097376"/> <entry description="NUMA Growth Phase" value="2"/> <entry description="Last OOM Factor" value="0"/> <entry description="Last OS Error" value="0"/></memoryReport></resource> |
Last Notification looks to be the memory manager’s latest report.
Some observations:
4.3 Component: Query
There are five helpful groupings of statistics into which the data can be categorized:
· Query processing statistics
· Top non-preemptive wait statistics by count
· Top non-preemptive wait statistics by duration
· Top preemptive waits by count
· Top preemptive waits by duration
The initial segment will comprise data related to query processing statistics.
Returning a point of time measurement of max workers, actual workers created, idle workers, pending tasks, oldest pending task waiting time (I’ll be paying attention to this one) and scheduler deadlock and non-yielding counts.
<queryProcessing maxWorkers="960" workersCreated="579" workersIdle="402" tasksCompletedWithinInterval="1284604" pendingTasks="1" oldestPendingTaskWaitingTime="0" hasUnresolvableDeadlockOccurred="0" hasDeadlockedSchedulersOccurred="0" trackingNonYieldingScheduler="0x0">
The second part we have waits. Preemptive and nonPreemptive
We have four sets of results for the Top 10 wait stats. Here is the output of the top non-preemptive waits, sorted by count based on the "waits" attribute.
<topWaits><nonPreemptive> <byCount> <wait waitType="MEMORY_ALLOCATION_EXT" waits="14771512961" averageWaitTime="0" maxWaitTime="2852"/> <wait waitType="RESERVED_MEMORY_ALLOCATION_EXT" waits="1058047487" averageWaitTime="0" maxWaitTime="2100"/> <wait waitType="ASYNC_NETWORK_IO" waits="493414963" averageWaitTime="0" maxWaitTime="5536"/> <wait waitType="HADR_WORK_QUEUE" waits="245076271" averageWaitTime="62" maxWaitTime="15388"/> <wait waitType="WRITELOG" waits="83482761" averageWaitTime="1" maxWaitTime="4632"/> <wait waitType="PAGELATCH_EX" waits="49895014" averageWaitTime="0" maxWaitTime="1795"/> <wait waitType="PARALLEL_REDO_WORKER_WAIT_WORK" waits="40554164" averageWaitTime="8" maxWaitTime="4790"/> <wait waitType="PAGEIOLATCH_SH" waits="26346072" averageWaitTime="1" maxWaitTime="3287"/> <wait waitType="UCS_SESSION_REGISTRATION" waits="17792927" averageWaitTime="0" maxWaitTime="161"/> <wait waitType="PAGEIOLATCH_EX" waits="11518408" averageWaitTime="2" maxWaitTime="5653"/></byCount>
Here is the output of the top non-preemptive waits, sorted by duration.
<byDuration> <wait waitType="HADR_WORK_QUEUE" waits="245076271" averageWaitTime="62" maxWaitTime="15388"/> <wait waitType="VDI_CLIENT_OTHER" waits="451646" averageWaitTime="13548" maxWaitTime="60009"/> <wait waitType="CLR_AUTO_EVENT" waits="35196" averageWaitTime="16700" maxWaitTime="10741940"/> <wait waitType="PARALLEL_REDO_WORKER_WAIT_WORK" waits="40554164" averageWaitTime="8" maxWaitTime="4790"/> <wait waitType="HADR_CLUSAPI_CALL" waits="2424632" averageWaitTime="121" maxWaitTime="868"/> <wait waitType="FT_IFTSHC_MUTEX" waits="188496" averageWaitTime="1560" maxWaitTime="1221459"/> <wait waitType="HADR_FILESTREAM_IOMGR_IOCOMPLETION" waits="585054" averageWaitTime="502" maxWaitTime="980"/> <wait waitType="HADR_TIMER_TASK" waits="663915" averageWaitTime="442" maxWaitTime="20004"/> <wait waitType="HADR_LOGCAPTURE_WAIT" waits="111501" averageWaitTime="2194" maxWaitTime="5044"/> <wait waitType="ASYNC_NETWORK_IO" waits="493414963" averageWaitTime="0" maxWaitTime="5536"/></byDuration></nonPreemptive>
Here is the output of the top preemptive sorted by waits count.
<preemptive> <byCount> <wait waitType="PREEMPTIVE_XE_CALLBACKEXECUTE" waits="28305259" averageWaitTime="0" maxWaitTime="7"/> <wait waitType="PREEMPTIVE_OS_AUTHENTICATIONOPS" waits="1286378" averageWaitTime="0" maxWaitTime="50"/> <wait waitType="PREEMPTIVE_OS_GENERICOPS" waits="1083014" averageWaitTime="0" maxWaitTime="15006"/> <wait waitType="PREEMPTIVE_OS_FILEOPS" waits="731347" averageWaitTime="0" maxWaitTime="2280"/> <wait waitType="PREEMPTIVE_OS_CLOSEHANDLE" waits="267390" averageWaitTime="0" maxWaitTime="47"/> <wait waitType="PREEMPTIVE_OS_AUTHORIZATIONOPS" waits="259924" averageWaitTime="0" maxWaitTime="46"/> <wait waitType="PREEMPTIVE_OS_REVERTTOSELF" waits="255969" averageWaitTime="0" maxWaitTime="39"/> <wait waitType="PREEMPTIVE_OS_QUERYCONTEXTATTRIBUTES" waits="255955" averageWaitTime="0" maxWaitTime="47"/> <wait waitType="PREEMPTIVE_OS_DELETESECURITYCONTEXT" waits="249887" averageWaitTime="0" maxWaitTime="409"/> <wait waitType="PREEMPTIVE_OS_LOOKUPACCOUNTSID" waits="181358" averageWaitTime="0" maxWaitTime="108"/></byCount>
Here is the output of the top preemptive sorted by duration.
<byDuration> <wait waitType="PREEMPTIVE_OS_WRITEFILEGATHER" waits="698" averageWaitTime="1460" maxWaitTime="105279"/> <wait waitType="PREEMPTIVE_OS_FILEOPS" waits="731347" averageWaitTime="0" maxWaitTime="2280"/> <wait waitType="PREEMPTIVE_OS_GENERICOPS" waits="1083014" averageWaitTime="0" maxWaitTime="15006"/> <wait waitType="PREEMPTIVE_OS_AUTHENTICATIONOPS" waits="1286378" averageWaitTime="0" maxWaitTime="50"/> <wait waitType="PREEMPTIVE_OS_DELETESECURITYCONTEXT" waits="249887" averageWaitTime="0" maxWaitTime="409"/> <wait waitType="PREEMPTIVE_OS_WRITEFILE" waits="57570" averageWaitTime="2" maxWaitTime="1211"/> <wait waitType="PREEMPTIVE_OS_LOOKUPACCOUNTSID" waits="181358" averageWaitTime="0" maxWaitTime="108"/> <wait waitType="PREEMPTIVE_OS_QUERYREGISTRY" waits="99563" averageWaitTime="0" maxWaitTime="76"/> <wait waitType="PREEMPTIVE_XE_CALLBACKEXECUTE" waits="28305259" averageWaitTime="0" maxWaitTime="7"/> <wait waitType="PREEMPTIVE_OS_CREATEFILE" waits="13676" averageWaitTime="2" maxWaitTime="314"/> </byDuration></preemptive></topWaits>
You will also get intensive request session.
<cpuIntensiveRequests> <request sessionId="1664" requestId="2" command="SELECT" taskAddress="0x2ce3f833848" cpuUtilization="99" cpuTimeMs="1266"/> <request sessionId="1309" requestId="2" command="SELECT" taskAddress="0x2e9153c3088" cpuUtilization="99" cpuTimeMs="906"/> <request sessionId="452" requestId="0" command="SELECT" taskAddress="0x30ec462cca8" cpuUtilization="99" cpuTimeMs="15"/> <request sessionId="552" requestId="2" command="SELECT" taskAddress="0x2f007dab848" cpuUtilization="98" cpuTimeMs="10219"/> <request sessionId="969" requestId="2" command="SELECT INTO" taskAddress="0x2f165f424e8" cpuUtilization="98" cpuTimeMs="875"/> </cpuIntensiveRequests>
Next will be pending task.
<pendingTasks><entryPoint name="Process Command" count="1"/></pendingTasks>
And in case there were any blocking happening it will show up here.
<blockingTasks></blockingTasks></queryProcessing>
|
For our example we can see that we have one pending task pendingTasks="1" This means the scheduler was hung and 1 task were waiting on it to get a scheduler and run.
SQL Server pending tasks refer to a set of activities that are currently waiting to be executed by the SQL Server but are temporarily blocked due to various reasons such as resource contention, I/O latency, or other dependencies. These tasks are queued in a pending state until the resources become available to process them further. The pending tasks can be monitored and managed using various performance monitoring and troubleshooting tools in SQL Server.
4.4 Component: IO
We have the I/O component here it will be very hard to get a warning, but we can see if there were any latch timeouts and long I/Os.
Long I/Os are considered when it takes 15 seconds or longer.
<ioSubsystem ioLatchTimeouts="0" intervalLongIos="0" totalLongIos="0"><longestPendingRequests></longestPendingRequests></ioSubsystem> |
4.5 Component: Event
On the event component, we will collect data on the errors and events of interest recorded by the server, including details about ring buffer exceptions, ring buffer events about memory broker, out of memory, scheduler monitor, buffer pool, spinlocks, security, and connectivity. Events will always show 0 as the state.
5.0 References:
Use the system_health Session - SQL Server | Microsoft Learn
Database Engine error severities - SQL Server | Microsoft Learn
Under the covers with sp_server_diagnostics data (Part I) - Joe Sack (sqlskills.com)
Under the covers with sp_server_diagnostics data (Part II) - Joe Sack (sqlskills.com)
SQL SERVER - PREEMPTIVE Waits - Simple Definition - SQL Authority with Pinal Dave
SQL Server, Memory Manager object - SQL Server | Microsoft Learn
sp_server_diagnostics (Transact-SQL) - SQL Server | Microsoft Learn
Use DBCC MEMORYSTATUS to monitor memory usage - SQL Server | Microsoft Learn
How It Works: sp_server_diagnostics – spinlock backoffs - Microsoft Community Hub
Using sp_server_diagnostics to Monitor SQL Server (mssqltips.com)
sys.dm_os_tasks (Transact-SQL) - SQL Server | Microsoft Learn
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.