Monday, June 22, 2009

tech: SQLServer - GUIDS and clustered index



One of the differences between Oracle and SQL Server, is the dominance of clustered tables within SQL Server applications.

Briefly the concept of a clustered table is that the data is phyisical stored in the order of the primary key. For example if your order table has a primary key on order_num in the format yyyymmddnnn i.e. if you take 37 orders on the 22rd June 2009 then the order numbers would be 20090622001, 20090623002, 20090623003,... 20090623037. In this example as new orders come through, they are added at the end of the table/index and all works well.

However one commonly reported problem within the Microsoft software development community is when the developer is using a GUID (a 16-byte unique identifier generated by Microsoft programs), that quite naturally they use their GUID field as the unique/primary key. Now as each new record / GUID is created, they are not in a sequential order and so this makes a big mess of your table storage (page splits and fragmentation). The solution in SQL2005 was the NEWSEQUENTIALID() function:

CREATE TABLE myTable (ColumnA uniqueidentifier DEFAULT NEWSEQUENTIALID())
Creates a GUID that is greater than any GUID previously generated by this function on a specified computer.

This is still an area of confusion amongst some developers, the following well-written question was posted on stackoverflow in Nov08:


Should I get rid of clustered indexes on Guid columns
I am working on a database that usually uses GUIDs as primary keys.
By default SQL Server places a clustered index on primary key columns. I understand that this is a silly idea for GUID columns, and that non-clustered indexes are better.
What do you think - should I get rid of all the clustered indexes and replace them with non-clustered indexes?
Why wouldn't SQL's performance tuner offer this as a recommendation?
http://stackoverflow.com/questions/277625/should-i-get-rid-of-clustered-indexes-on-guid-columns


The responses to the question are that (a) SQL Server works best with clustered indexes and (b) the old fragmentation problems are preventable via newsequentialid()

With SQL 2005 and newsequentialid(), the fragmentation problem goes away to a large extent. It's best to measure by looking at sys.dm_db_index_physical_stats and sys_indexes. – RoadWarrior

Great that answers how to use GUIDs and Clustered Indexes.

However I have one further question, why are clustered table the "default" but their equivalent structure within Oracle, the Index Orientated Table is rare:

Index-organized tables provide faster access to table rows by the primary key. Also, since rows are stored in primary key order, range access by the primary key involves minimum block accesses. In order to allow even faster access to frequently accessed columns, the row overflow storage option can be used to push out infrequently accessed non-key columns from the B-tree leaf block to an optional overflow storage area. This limits the size and content of the row portion actually stored in the B-tree leaf block, resulting in a smaller B-tree and faster access.

Intuitively the constraints of having add rows in the order of the primary key, makes these sort of table tricky to work with; this may explain why these tables are not so popular in the Oracle world.

One clue may also be that "SQL Server has no command for reorganizing a heap table":

Another benefit of clustered indexes is that they assist in reorganizing a table. Unlike other database engines, SQL Server has no command for reorganizing a heap table. It only works on indexes. So, if there is a badly fragmented table, the DBA is usually stuck with executing a table purge, copying all the data out, truncating the table, and copying all the data back in. This feat is not easy to accomplish in an environment with high amounts of enforced referential integrity.

If the table has a clustered index, the DBA can simply rebuild the clustered index, which reorganizes the data pages because the leaf level of the clustered index is the data.



No comments: