Tuesday, 5 February 2008

Linking 2 tables based on row positions - SQL2000 vs SQL2005

A simple (and common??) scenario - you have 2 tables of ordered integers and want to combine them into one recordset where the first row of the first table is paired with the first row of the second table and so on, like


Table1
val(int)
---
1
3
5
9

Table2
val(int)
---
4
9
14
17


we want to return


val val
--- ---
1   4
3   9
5   14
9   17


And we want to achieve this WITHOUT the use of cursors (obviously!).

If we could assign a row number to each table then we can join on the row number, row1 = row1, row2=row2, etc... now SQL 2005 gives us a couple of ways of doing this using either the RANK() function or the ROW_NUMBER() function.


if object_id('tempdb..#t1') is not null drop table #t1
if object_id('tempdb..#t2') is not null drop table #t2

create table #t1 (val int)
create table #t2 (val int)

insert into #t1 values (1)
insert into #t1 values (3)
insert into #t1 values (5)
insert into #t1 values (9)

insert into #t2 values (4)
insert into #t2 values (9)
insert into #t2 values (14)
insert into #t2 values (17)


using RANK()


select t3a.val, t3b.val
from
(select val, rank() over (order by val) as seq from #t1) t3a
join
(select val, rank() over (order by val) as seq from #t2) t3b
on t3a.seq=t3b.seq


or using ROW_NUMBER()


select t3a.val, t3b.val
from
(select val, row_number() over (order by val) as seq from #t1) t3a
join
(select val, row_number() over (order by val) as seq from #t2) t3b
on t3a.seq=t3b.seq


But in SQL2000 these functions do not exist. An elegant way to achieve this is to join the table to itself and determine the 'rank' by counting the number of elements that are less than or equal to the current i.e.


select t11.val, count(*) as seq
from #t1 t11, #t1 t12 where t11.val>=t12.val group by t11.val


returns


val seq
--- ---
1   1
3   2
5   3
9   4


hence we can use this as the ranking function


select t3a.val, t3b.val
from
(select t11.val, count(*) as seq
from #t1 t11, #t1 t12 where t11.val>=t12.val group by t11.val) t3a
join
(select t21.val, count(*) as seq
from #t2 t21, #t2 t22 where t21.val>=t22.val group by t21.val) t3b
on t3a.seq=t3b.seq


[Credit goes to Richard Dean for the original idea]

SQL Server scalability

A while back I was asked to look at the scalability of our database environment, and what could be done to achieve a 'doubling in capacity' for what was going to be a busy trading period. [For info : the db was supporting a load-balanced server farm of web servers (providing both UI and web services) - daily traffic was expected to hit the 1 million mark]

Here is a summary of what my research found....

Scalability of SQL server can be achieved 2 ways - either scaling up or scaling out. Scaling up involves increasing the performance of a single server, i.e. adding processors, replacing older processors with faster processors, adding memory etc.., and is essentially what can be acheived by replacing existing hardware with a newer, bigger, faster machine. Scaling up works in the short term as long as the budget allows, however there are ultimately levels of volume that one server cannot handle.

Scaling out is to distribute the load across many (potentially lower cost) servers - this concept works for the infinite long-term, and is constrained by the amount of physical space, power supply, etc..., far less than the constraints of budget and individual server performance.

There are 2 approaches to scaling out:

  1. multiple servers that appear to the website as a single database - known as federated databases - the workload is then spread across all servers in the federation.
  2. multiple servers that appear as individual servers, however the workload is load balanced or the servers are dedicated to a particular function or work stream (e.g. a set of 1 or more web servers)

The main concern at the heart of any scaled-out SQL solution, is how to ensure that data is consistent across all the database servers. With option 1, the data is consistent as there is only one database, and data is spread rather than replicated across each server. The main downfall of this approach is the poor availability - if one of the federated databases fails, the whole federation fails.

With option 2, some data needs to be replicated from all servers to all servers. There are many methods of implementing such an architecture, however it seems to be clear that this general approach is the best, long-term, for scalability in this environment. Some data can differ across each database server as long as the user remains 'sticky' to that database (for example, session information), investigations started into this idea, using the web servers to remain 'sticky' to a particular database. [The idea was to split the load-balanced web farm over 2 database servers and issue a cookie from the Content Services Switch to keep the users on one half of the web farm - we never actually got this to work successfully]

It is worth pointing out at this stage a common misunderstanding about clustering SQL servers using Microsoft Cluster Services. Clustering itself does NOT provide scalability - it only provides high-availability. The only true way to scale a clustered SQL solution is to scale up - i.e. increase the power of each server in the cluster.