Saturday, June 27, 2009

tech: Deadlocks due collisions in "SQL Server's hashing algorithm" (June 2009 SQL Server UserGroup)

At the SQLServer usergroup there was a brief overview of a very technical problem - there appears to be a bug in SQLServer's "hashing algorithm" which under certain circumstances is triggering "deadlocks" to occur.

James' detailed blog:


gives plenty of technical details behind this intriguing problem, and also the extensive collaborative investigation it took to unearth and document this bug.

I have gone through this great blog piece, highlighting what I see as the key concepts and steps. I have add some further references to key background concepts. This shows the process I went through to understand this issue, hopefully you find my comments a helpful introduction into understand this complex topics.

To start with an overview of what "deadlocks" are and how to monitor for them:

So what then is a deadlock? This is when two processes each hold a lock or set of locks that is incompatible with a lock that the other process is trying to acquire. As neither one process can complete the two processes are said to be deadlocked. Rather than look on in morbid fascination SQL Server chooses a victim based on which of the two processes is easiest to roll back based on the amount of resources it has consumed. The victim is then rolled back.
If you have trace flags 1204, 1222 running or are using profiler/SQL Trace to capture deadlock events you can gather information on what processes were involved in the deadlock and which one or ones were chosen as the victim.

James then links to another SQL Server MVP blog which shows how to setup a simple deadlock:

Below are the steps to generate a deadlock so that the behaviour of a deadlock can be illustrated:


-- 1) Create Objects for Deadlock Example
USE TEMPDB

CREATE TABLE dbo.foo (col1 INT)
INSERT dbo.foo SELECT 1

CREATE TABLE dbo.bar (col1 INT)
INSERT dbo.bar SELECT 1

-- 2) Run in first connection
BEGIN TRAN
UPDATE tempdb.dbo.foo SET col1 = 1

-- 3) Run in second connection
BEGIN TRAN
UPDATE tempdb.dbo.bar SET col1 = 1
UPDATE tempdb.dbo.foo SET col1 = 1

-- 4) Run in first connection
UPDATE tempdb.dbo.bar SET col1 = 1


Connection two will be chosen as the deadlock victim

ie.

Server: Msg 1205, Level 13, State 50, Line 1
Transaction (Process ID 56) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.



Next James, returning to his "deadlock problem" uses standard Microsoft guidelines for analyzing deadlocks.

To understand James blog article you need to be aware of a hobtids. In the Microsoft guidelines hobtid's (the "hash value" for the "index key range" we are locking) are mentioned very very briefly:

[LOCK] KEY. Identifies the key range within an index on which a lock is held or requested. KEY is represented as KEY: db_id:hobt_id(index key hash value). For example, KEY: 6:72057594057457664 (350007a4d329).

Returning again to James' deadlock problem. Using the standard "deadlock chart" functionality in SQL Profiler, he can see two SPIDs (processes) are trying the same hobtid:

It is telling me that SPID 82 is deadlocked against SPID 85. It is saying that they were both trying to get exclusive locks for a ROW in hobt 72057594693222400.

How was this possible? Why should SPID 82 have a lock on any data being loaded by SPID 85? There was no overlapping data that would allow this to happen. Why was this only starting to happen now? It just didn’t make any sense. Initially you might think that this is due to page splits or some other aspect of the storage where the data ranges may overlap but you will see these are both “Key Locks”.

I ran this query hoping for some clarification.


SELECT OBJECT_NAME(p.[object_id]) AS ObjectName

,p.[object_id] AS ObjectID

,i.name AS IndexName

,hobt_id AS HoBTID

,partition_id AS PartitionID

,partition_number AS PartitionNo

FROM sys.partitions p

JOIN sys.indexes i

ON p.index_id = i.index_id

AND p.[object_id] = i.[object_id]

WHERE hobt_id = 72057594693222400;


This just returned the name of the target table and the primary key, so not much help to be found there. What I can tell you was that both queries were loading into the same partition in the same hobt_id on the same index.


i.e. there shouldn't be be any lock conflict with inserting two distinct records in parallel? As these conflicts were occurring during a large batch load process of million of rows, it was not obvious which inserted rows had conflicting "index key hash values".


Having hit this apparent dead-end, James asked around for help:


I was struggling to come up with an answer. So I thought I’d run the scenario past my good friend Simon Sabin to see if he had any ideas. Sure enough he came up with an undocumented feature in SQL Server 2005 and above (courtesy of the late Ken Henderson) that helped us understand this particular problem. The feature is called %%lockres%%.


By exporting his SQL Profiler deadlock charts into XML format he was able to see the two distinct "lock resource hash values":


waitresource="KEY: 7:72057594693222400 (0801a4a3d7f1)

waitresource="KEY: 7:72057594693222400 (0801bf3924dc)


Feeding the "lock resource hash values" into the following query using the "undocumented feature" %%lockres%%, helped James to reducing the problem to inserting just 4 rows:

By filtering our query for the day and also by the lock resource hashes that had given us this problem we could really cut into this dataset. As the Date field is the first key of the clustered index this meant a clustered index seek was used and so %%lockres%% returned the key hashes for our clustered index which was want we wanted.


select *,%%lockres%% from    dbo.Table(nolock) where    %%lockres%% IN('(0801a4a3d7f1)','(0801bf3924dc)') and        Date >= '2009-05-18'

Sure enough we had the same hash value appearing in different countries for the same data. Even though the data itself was unique SQL Server’s hashing algorithm was not able to achieve a unique result. When this occurs it is called a collision ... once you have the values you can see the problem with only 4 rows of data. Check this out:


CREATE TABLE #tmp 
(  [Date]       DATETIME                    , CountryID   TINYINT                    ,GroupID     SMALLINT                    ,CodeID      INT)

INSERT INTO #tmp 
SELECT '20090519'         ,2         ,4271         ,5835066 UNION ALL 
SELECT         '20090519'         ,2         ,4619         ,2546652 UNION ALL 
SELECT         '20090519'         ,3         ,4245         ,2651987 UNION ALL 
SELECT         '20090519'         ,3         ,4657         ,5744053;  

CREATE UNIQUE CLUSTERED INDEX tmp_PK  ON #tmp 
 (    [Date]         ,CountryID         ,GroupID         ,CodeID     ) 

SELECT   %%lockres%% as LockHash         ,* FROM    #tmp 
ORDER BY %%lockres%%         ,CountryID 

DROP TABLE #tmp

to see:image


CountryID 2 could deadlock with CountryID 3!


James then goes on to replicate an actual deadlock based on these inserting these 4 rows and also discusses mitigation strategies, if you have got through to this point the rest of his blog piece is easy reading ;-)


No comments: