Wednesday, March 21, 2012

How to find the median?

I've noticed that SQL Server (and other DBMSs I've looked at) doesn't seem to have a built-in function for finding the median of a range of numbers.
Gack!gack is right

see this thread (http://dbforums.com/showthread.php?t=1210382&highlight=median) from way back in the archives|||Your required solution may not be quite as complicated as the one linked to. That particular poster needed the median of some not-yet-calculated-values so these values are calculated in derived tables.

If you simply have a table with several values that you want to find the median of then the SQL is simpler.|||Median does not have a set based solution. You have to iterate in order to find the median (order matters to compute a median). Engines that process a row at a time do medians easily, engines that process sets have a real problem with it.

-PatP|||okay, which is it? "does not have a set based solution" or "have a real problem with it"?

what about this type of solution:SELECT
CASE WHEN COUNT(*)%2=1
THEN x.Hours
ELSE (x.Hours+MIN(CASE WHEN y.Hours>x.Hours
THEN y.Hours
END))/2.0
END median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING
SUM(CASE WHEN y.Hours <= x.Hours
THEN 1 ELSE 0 END)>=(count(*)+1)/2 AND
SUM(CASE WHEN y.Hours >= x.Hours
THEN 1 ELSE 0 END)>=(count(*)/2)+1

-- from Transact-SQL Cookbook Chapter 8 Statistics in SQL (http://www.oreilly.com/catalog/transqlcook/chapter/ch08.html)|||Or:


SELECT A.TheValue AS TheMedian
FROM (SELECT DISTINCT TheValue FROM MyTable) A, MyTable B
GROUP BY A.TheValue
HAVING sum(SIGN(A.TheValue - B.TheValue )) IN (0, -1)|||Beg your pardon - blummin' thing has failed.|||The Median depends on order in every case, and non-deterministic rows for data with an even number of elements. Because Median depends on element order, there can't be any set based solution, because sets have no order.

I never said that you can't solve it with SQL, that's a very different claim. There are several ways to do that. You still can't solve it declaratively, and you can't solve it with sets.

-PatP|||sets may have no order, but order can still be imposed with predicates

i ask you, are you familiar with the query to rank result rows? it's a theta self-join, and what it does is count the rows that are greater (or lesser) than, based on the value of a column, then adds 1 to get the rank

same thing with partitioning a single set into two -- those values that are higher, and those values that are lower

and those are sets

:)|||there can't be any set based solutionwhat do you call the solution in post #5, please?|||On the surface, I would call it a performance problem ;-). How many rows in BulbLife? Never knew you could declare a "join" in the having clause. Or does the optimizer see it that way?|||On the surface, I would call it a performance problem ;-). okay, i see the smiley, so i won't respond the way i would've if it weren't there, which would've been something along the lines of "oh, so for challenging queries, you dump the data and do it in excel, eh?" :)

btw, there is no join in the HAVING clause (if you are referring to post #5)|||what do you call the solution in post #5, please?I'd call that a Transact-SQL solution.

-PatP|||And here I thought you came up with this on your own...

Looks like a must have book though...|||I'd call that a Transact-SQL solution.you are a slippery eel today, but squirm all you want, buddy, i think i gotcha

what part of it makes it Transact-SQL? MIN? CASE? COUNT?

looks to me like it's all SQL

or are you referring to the fact that the solution doesn't use FLOOR?

and what about my questions regarding sets?

is that or is that not a set-based solution?

eel, i say

:)|||You two go ahead and argue terminology, while I go ahead and give the solution. ;)
My shot:
set nocount on

create table #MyData (MyKey int identity, MyValue int)

insert into #MyData (MyValue)
select 1
UNION ALL select 2
UNION ALL select 5
UNION ALL select 2
UNION ALL select 5
UNION ALL select 2
UNION ALL select 6
UNION ALL select 2
UNION ALL select 5
UNION ALL select 2
UNION ALL select 6
UNION ALL select 8
UNION ALL select 3
UNION ALL select 5
UNION ALL select 5
UNION ALL select 2
UNION ALL select 6
UNION ALL select 2
UNION ALL select 6
UNION ALL select 2
UNION ALL select 5
UNION ALL select 2
UNION ALL select 6
UNION ALL select 8
UNION ALL select 3
UNION ALL select 5
UNION ALL select 4
UNION ALL select 2
UNION ALL select 3
UNION ALL select 3

select sum(OrderedData.MyValue)/2.0
from --OrderedData
(select #MyData.MyValue,
#MyData.MyKey,
count(*) as Ordinal
from #MyData
inner join #MyData Ordinals
on #MyData.MyValue > Ordinals.MyValue
or (#MyData.MyValue = Ordinals.MyValue
and #MyData.MyKey > Ordinals.MyKey)
group by #MyData.MyValue,
#MyData.MyKey) OrderedData
inner join --MedianRecords
(select floor(0.5 + count(*)/2.0) as Record1,
ceiling(1.0 + count(*)/2.0) as Record2
from #MyData) MedianRecords
on OrderedData.Ordinal = MedianRecords.Record1
or OrderedData.Ordinal = MedianRecords.Record2

drop table #MyData|||way to go, blindman

your post reminds me of this quote (which i have on my spanking-new redesigned web site; comments welcomed):The person who says it can't be done should not interrupt the person doing it

Chinese proverb

no idea if your solution works, but i would definitely have to say, it shore looks like a set-based solution to me

:)|||okay, which is it? "does not have a set based solution" or "have a real problem with it"?

what about this type of solution:SELECT
CASE WHEN COUNT(*)%2=1
THEN x.Hours
ELSE (x.Hours+MIN(CASE WHEN y.Hours>x.Hours
THEN y.Hours
END))/2.0
END median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING
SUM(CASE WHEN y.Hours <= x.Hours
THEN 1 ELSE 0 END)>=(count(*)+1)/2 AND
SUM(CASE WHEN y.Hours >= x.Hours
THEN 1 ELSE 0 END)>=(count(*)/2)+1

-- from Transact-SQL Cookbook Chapter 8 Statistics in SQL (http://www.oreilly.com/catalog/transqlcook/chapter/ch08.html)

That is precisely the formula we used in the end and it worked great, though it takes a long time to run. I ended up throwing a lot of data into temp tables to make the query more efficient. :)|||you might also be interested in some of the solutions here -->
http://tek-tips.com/viewthread.cfm?qid=1193016|||you might also be interested in some of the solutions here -->
http://tek-tips.com/viewthread.cfm?qid=1193016

I may play around with some of those to get the query to run faster. The self-join from the O'Reilly book has to be done on a rather large table on our database so we run the query three times with different criteria to cut down the number of rows we're operating on and insert this data into temp tables. We then calculate the median individually on each table (since we're after more than one median value; we're going for the median value in each of three columns) and insert them into another temp table; we then append the target table with the info from the last temp table. If we do it any other way it takes an age; even this solution takes the SQL Server about 20 seconds to chew its way through it. Doesn't help that our SQL server is 200 miles away, connected to us by a vewy vewy slow network. :eek: The first time I ran the non-optimized solution, I got snarky e-mails from the DBAs griping that I was 'consuming resources'! Hahaha! They don't like the one we're using right now, but since we only run it once a week they can live with it. :D

Then again, optimizing queries isn't my strong suit. I can write them so they work, but I can see it's going to take some tweaking to get them to run quickly. We have these updates we have to run each week, appending between 1,000 and 10,000 new rows to base data tables on four separate DBs. Doesn't help that some of them have 50 columns, so they're both 'tall' and 'fat' if that makes sense. :shocked:

No comments:

Post a Comment