Introduction
SQL Server Always On Availability Groups are a robust solution for achieving high availability and disaster recovery for SQL Server databases. However, simply configuring them is not enough—you also need a solid monitoring strategy to ensure data integrity and system reliability. One key aspect of this monitoring process is keeping an eye on lease timeouts, which can signal larger issues and help prevent potentially catastrophic split-brain scenarios.
In this post, we’ll walk through the various health checks available for Always On Availability Groups, discuss how lease timeouts work, and explore practical methods for monitoring and troubleshooting.
Understanding Health Checks in Always On Availability Groups
1. Database Health Checks
- Synchronization State: Confirms that secondary replicas are kept in sync with the primary.
- Data Loss Potential: Evaluates whether any transactions might be lost during a failover.
- Database Status: Verifies that databases are online and accessible.
2. Replica Health Checks
- Operational State: Ensures replicas are assigned the correct role (primary or secondary).
- Connected State: Monitors network connectivity between replicas.
- Synchronization Health: Assesses the overall health of data synchronization.
3. Cluster Health Checks
- Node Status: Checks if each cluster node is up and running.
- Quorum Configuration: Ensures the cluster can still make decisions if some nodes fail.
- Resource Status: Monitors the availability of cluster resources.
Lease Timeout in Always On
- Lease Mechanism: The lease acts like a heartbeat signal between SQL Server and the Windows Server Failover Cluster (WSFC).
- Lease Timeout: By default, if SQL Server cannot renew the lease within 20 seconds, the WSFC deems the instance unresponsive and begins a failover.
- Importance: Proper lease management prevents split-brain scenarios, where multiple nodes simultaneously believe they are the primary, which can lead to data corruption.
Monitoring Health Checks and Lease Timeouts
1. Using SQL Server Management Studio (SSMS)
- Open Object Explorer and expand Always On High Availability > Availability Groups.
- Right-click your availability group and select Show Dashboard.
- Review Dashboard Views for real-time information about replicas and databases.
2. Extended Events
- System Health Session: By default, SQL Server runs a
system_health
session to capture key events.
Querying Lease Timeout Events:
SELECT
event_data.value('(event/@name)[1]', 'varchar(50)') AS event_name,
event_data.value('(event/@timestamp)[1]', 'datetime') AS [timestamp],
event_data.value('(event/data[@name="message"]/value)[1]', 'varchar(max)') AS message
FROM
(SELECT CAST(target_data AS XML) AS event_data
FROM sys.dm_xe_session_targets t
JOIN sys.dm_xe_sessions s ON t.event_session_address = s.address
WHERE s.name = 'system_health' AND t.target_name = 'ring_buffer') AS x
WHERE
event_data.value('(event/@name)[1]', 'varchar(50)') = 'error_reported'
AND event_data.value('(event/data[@name="error_number"]/value)[1]', 'int') = 19407;
This query filters for the lease timeout error (error_number = 19407
) reported in Extended Events.
3. SQL Server Agent Alerts
- Open SQL Server Agent and create a New Alert.
- Set the Error Number to
19407
(lease timeout error). - Configure the Response to notify operators by email or run a job automatically.
4. Performance Monitor (PerfMon)
- Monitor Always On counters under SQLServer:Replica, such as Log Send Queue and Redo Queue Size.
- Check counters like Synchronization Health and Flow Control to get a sense of ongoing data replication health.
5. Failover Cluster Manager
- Open Failover Cluster Manager on the cluster node.
- Validate the status of cluster resources and nodes.
- Review recent cluster events to identify any issues or warnings.
6. PowerShell Cmdlets
- Retrieve Cluster Logs:
Get-ClusterLog -Destination "C:\ClusterLogs"
- Analyze the generated logs for lease timeout events, cluster resource failures, or other errors.
7. Custom Monitoring Scripts
- Check Replica Health:
SELECT ag.name AS AvailabilityGroupName, ar.replica_server_name, ar.availability_mode_desc, ar.failover_mode_desc, ar.synchronization_health_desc FROM sys.availability_replicas ar JOIN sys.availability_groups ag ON ar.group_id = ag.group_id;
- Monitor Database States:
SELECT db_name(database_id) AS DatabaseName, synchronization_state_desc, is_failover_ready FROM sys.dm_hadr_database_replica_states;
Best Practices for Monitoring
- Regular Health Checks: Automate frequent checks to detect issues early.
- Alerts Configuration: Set up immediate notifications for critical events like lease timeouts.
- Documentation: Maintain detailed records of your monitoring configurations and any incidents.
- Testing: Periodically test failover processes to validate your infrastructure’s resiliency.
- Updates: Keep SQL Server and Windows updated to benefit from the latest fixes and improvements.
Troubleshooting Lease Timeouts
Common Causes
- Network Latency: Delayed connectivity can prevent timely lease renewals.
- Resource Contention: High CPU or memory usage on the server can impact performance.
- Disk I/O Bottlenecks: Slow disk operations can degrade SQL Server responsiveness.
- Cluster Service Issues: Failures in the WSFC service can cause timeouts.
Steps to Resolve
- Review Error Logs: Check both SQL Server error logs and Cluster logs to pinpoint the issue.
- Analyze Performance Metrics: Use PerfMon or DMVs to spot bottlenecks.
- Check Network Stability: Ensure that the network connections between replicas remain healthy.
- Update Lease Timeout Settings: Adjust the lease timeout value if required. For instance:
Import-Module FailoverClusters Get-ClusterResource -Name "AG Resource Name" | Set-ClusterParameter -Name LeaseTimeout -Value 30000
(Note: The value is in milliseconds, e.g., 30000 ms = 30 seconds.)
Conclusion
Monitoring SQL Server Always On Availability Groups is critical for safeguarding high availability and data integrity. From health checks to lease timeouts, each aspect requires careful configuration and ongoing attention. By using the built-in tools (SSMS, Extended Events, SQL Server Agent alerts), third-party or custom scripts, and best practices outlined here, you can proactively identify and address issues—ensuring your Always On environment remains stable, performant, and ready to handle unexpected outages.