Posted by blakhani on May 20, 2016
Working with Root Cause Analysis (RCA) is also part of my work at Microsoft. In case of cluster failover RCA, it is very important to get cluster log. Sometimes there are situation where we want to generate cluster log for last few minutes for quicker analysis of live issues. This blog explains some common parameters which I used in my day-to-day troubleshooting.
I have 4 nodes cluster in my lab named SRV1, SRV2, SRV3, SRV4.
- Default command – generates Cluster.log file on ALL nodes in C:\Windows\Cluster\Reports folder. File name would be Cluster.log
- if we want the cluster log to be generated for specific node(s) then we can use –Node parameter. We can put comma separated node names as shown below.
Get-ClusterLog -Node SRV1, SRV3
- You might know that the time shown in cluster log is UTC be default. Sometimes its difficult to translate UTC time to local time, specially for time-zones which has daylight saving. Luckily, cluster log can be generated in local time using parameter UseLocalTime . Here is the sample code.
- Another useful parameter is to copy the files to specific location. This command would generate logs and also dump on specified location. in below example, I am dumping logs from all nodes to C:\Temp folder.
Get-ClusterLog –Destination “C:\Temp”
- TimeSpan is another parameter which can generate cluster log for last number of minutes specified. By default it would generate Cluster.log for complete time. I find it useful when I repro’ed a problem and I want to look at cluster log for last 2 to 3 minutes. Here is the command to generate log for last 3 minutes.
Get-ClusterLog –TimeSpan 3
So, this is my favorite command after reproducing cluster issue on local node.
Get-ClusterLog -Node SRV1 -TimeSpan 2 -UseLocalTime -Destination C:\
Hopefully it would be useful.
Posted in Cluster, Tips and Tricks, Troubleshooting | Tagged: get-clusterlog, parameters, Tips and Tricks | 2 Comments »
Posted by blakhani on April 7, 2015
Recently I have had a friend who reported below issues.
- When I expand database, it takes a lot of time.
- When I expand Jobs under SQL Server Agent node in SQL Server Management Studio, it freezes and finally it fails with error
“An exception occurred while executing a Transact-SQL statement or batch. (Microsoft.SqlServer.ConnectionInfo)
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding”
The symptom on the server that we were working on was that from Management Studio, when trying to browse SQL agent job, the interface would hang. I have asked to capture Profiler trace to find out which query is taking time and what is the waits for those query which are stuck.
In profiler trace, and the query to capture currently running queries (refer this blog) I found that it was running sp_help_job stored procedure from MSDB database. This is the procedure to get high level details about all jobs in MSDB database using msdb.dbo.sysjobs_view. When we looked further, we found that SQL Server is running function dbo.SQLAGENT_SUSER_SNAME and got stuck at SELECT @ret = SUSER_SNAME(@user_sid) statement. The wait for the session is was PREEMPTIVE_OS_LOOKUPACCOUNTSID and wait time was increasing. This wait is related to the communication/validation from Active Directory.
When we debugged further, here is the chain of reaching to function.
>> Query having – owner = dbo.SQLAGENT_SUSER_SNAME(sjv.owner_sid)
>> SELECT @ret = SUSER_SNAME(@user_sid)
This is getting stuck at PREEMPTIVE_OS_LOOKUPACCOUNTSID. We was identified that the function is used to convert SIDs stored in SQL Server table to the name by making call to Active Directory. Now the challenge was to find why and also was to identify if its happening with particular logins or all login. The complexity here was that SQL Server stores SID in varbinary format not in the format which OS would understand.
Luckily, I have had a blog post having script to convert the varbinary to well known format. So I have used that to convert SIDs obtained from below query
where owner_sid <> 0x01
Once we have SID value in OS understandable format, I used PsGetSID tool from sysinternals to get Windows account name. While running that it was taking a long time and finally it failed with below error.
Error querying SID:
The trust relationship between the primary domain and the trusted domain failed
So, it was something to do with two domains trust which seems to be broken. I asked him to work with this Windows Domain Admin team and networking team to get the issue resolved.
Same issue might happen during database expand also as, database owner is a SID stored in sys.databases and that has to be converted to name.
Hope this helps.
Posted in SQL Server Management Studio, SSMS, Troubleshooting | Tagged: PREEMPTIVE_OS_LOOKUPACCOUNTSID, Waittype | 4 Comments »
Posted by blakhani on September 11, 2014
While deploying SQL Server 2014 cluster in my lab I ran into this problem when I was trying to add second node (Node 2) to SQL cluster.
Rule Check Result
Rule "SQL Server Database Services feature state" failed. The SQL Server Database Services feature failed when it was initially installed. The feature must be removed before the current scenario can proceed.
Above error means that when I installed SQL Server on first cluster node, it didn’t install properly. There were some errors while creating cluster and due to that AddNode is blocked. In my case also this is true because I had an issue with SQL Server Network Name Resource and it didn’t come online earlier. Below was the original error on Node 1
Cluster network name resource ‘SQL Network Name (Balmukund)’ cannot be brought online. The computer object associated with the resource could not be updated in domain ‘MyDomain.com’ for the following reason:
Unable to update password for computer account.
The text for the associated error code is: Access is denied.
The cluster identity ‘WinCluster$’ may lack permissions required to update the object. Please work with your domain administrator to ensure that the cluster identity can update computer objects in the domain.
I fixed above error by giving proper permission to cluster computer object on Balmukund (which was the network name of SQL) After fixing that I created SQL Server and SQL Agent manually. Sometime it may also happen that SQL Agent didn’t come online on Node 1 due to some failure and it was fixed after setup was completed.
So, if we see above rule failure then we should go to the first node and check setup log files to see if the installation succeeded or failed. If the first node installation had some failures which were fixed later then we should do a repair of the installation. Repair option can be found under the "Maintenance" page in SQL Server Installation Center (setup). This action will clear the failure state and allow the node addition action.
If you are lazy like me to do repair then you can also check registry key
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL12.SQL2014\ConfigurationState
If you see value as 2 for any component then there were some failures of those components. In most of the cases since issue would be with SQL Server or SQL Agent, we might see 2 for MPT_AGENT_CORE_CNI, SQL_Engine_Core_Inst, SQL_FullText_Adv, SQL_Replication_Core_Inst
Note: that we MUST fix the error on first node before taking shortcut of registry value. Once error is fixed, we can make the value of those component to 1 (means success)
The highlighted value MSSQL12.SQL2014 might vary based on SQL version and Instance name. The first piece for SQL 2008 it would be MSSQL10, for SQL 2008 R2 it would be MSSQL10_50, for SQL 2012 it would be MSSQL11 and for SQL 2014 it would be MSSQL12. Second piece is instance name (for me instance name is SQL2014)
Hope this helps!
Author: SQL Server 2012 AlwaysOn – Paperback, Kindle
Posted in Installation, Troubleshooting | Tagged: addnode, Rule, setup, SQL Server Database Services feature state | 2 Comments »