Skip to main content
Harm_Millaard
Inspiring
September 19, 2011
Answered

Another MySQL question about averages. Remove outliers.

  • September 19, 2011
  • 1 reply
  • 2957 views

I have a number of observations that I can get from a MySQL database.

Let's assume the values I have are, ordered ascending:

3, 5, 5, 6, 6, 6, 8, 9, 15.

Just calculating the AVG results in 63/9 = 7.00

If I want to repeat that but leave out both the minimum and the maximum values, in this case both the 3 and the 15, giving the following list:

5, 5, 6, 6, 6, 8, 9

then the average is only 45/7 =  6.43 to gain more statistical accuracy, how do I need to modify my query?

Currently the query is like this:

$query_Recordset1 = 'SELECT COUNT(Graphics_card) AS Obs, Graphics_card, AVG(MPE_Off/MPE_On) AS Gain , AVG(MPE_On) AS Speed FROM `Personal_data` WHERE MPE_On >0 GROUP BY Graphics_card HAVING COUNT(Graphics_card) > 4 ORDER BY Speed ASC, Obs DESC, Gain DESC LIMIT 0, 20' ;

In this query the minimum number of observations is 5 for each distinct 'Graphics_card'. In the future I may change that to another figure and that is easy, just change the bold 4 to another figure but then, I may want to recalculate the adjusted average by leaving out the two bottom observations and the two top observations, so the range in the calculation would look like this:

5, 6, 6, 6, 8

and then the adjusted average is only 31/5 = 6.20

How do I need to adjust my query to achieve that?

Or, to put it another way, how can I first exclude the top X and bottom X observations from the query and only then calculate the AVG?

Here is the page I want to apply it to: http://ppbm5.com/MPE%20Charts.php

This topic has been closed for replies.
Correct answer Harm_Millaard

Agreed. Also take a look at http://www.sqlservercentral.com/articles/Miscellaneous/sqlserverstatisticalfunctions/1255/

There are a few techniques that could work - look at Catering for Extremes and Inter-Quartile ranges.

EDIT: I also see that you are running MySQL. I know there are a few statistical analysis add-ons for MySQL  - these might help.


I solved it for the first part like this:

SELECT COUNT(Graphics_card) AS Obs, Graphics_card, (SUM(MPE_On) - MAX(MPE_On))/(COUNT(Graphics_card)-1) AS Speed, (SUM(MPE_Off/MPE_On) - MIN(MPE_Off/MPE_On))/(COUNT(Graphics_card)-1) AS Gain FROM `Personal_data` WHERE MPE_On >0 GROUP BY Graphics_card HAVING COUNT(Graphics_card) > 5 ORDER BY Speed ASC LIMIT 0, 20

This effectively removes the most significant outlier (MAX for Speed and MIN for Gain) and calculates the average based on the number of observations minus 1 for each category. In the same way I could remove the outlier at the other side of the spectrum, MIN for Speed and MAX for Gain, but then reduce the number of observations by 2 instead of 1.

SELECT COUNT(Graphics_card) AS Obs, Graphics_card, (SUM(MPE_On) - MAX(MPE_On) - MIN(MPE_On))/(COUNT(Graphics_card)-2) AS Speed, (SUM(MPE_Off/MPE_On) - MIN(MPE_Off/MPE_On) - MAX(MPE_Off/MPE_On))/(COUNT(Graphics_card)-2) AS Gain FROM `Personal_data` WHERE MPE_On >0 GROUP BY Graphics_card HAVING COUNT(Graphics_card) > 5 ORDER BY Speed ASC LIMIT 0, 20

I struggled with the Group function to use the suggestions in the link you gave me and of course the different functionality between MySQL and Oracle SQL.

I look forward to having a MEDIAN function in MySQL.

1 reply

Participating Frequently
September 20, 2011

>then the average is only 45/7 =  6.43 to gain more

>statistical accuracy, how do I need to modify my query?

In your example, you could easily reduce 3, 5, 5, 6, 6, 6, 8, 9, 15 to 5, 5, 6, 6, 6, 8, 9 by including a filter in the where clause to eliminate entries that are equal to the min, or equal to the max. However, this may not be what you want. That would eliminate all values equal to the min and max, so

3, 3, 3, 5, 5, 6, 6, 6, 8, 9, 15, 15, 15 would also turn into 5, 5, 6, 6, 6, 8, 9

Is that OK, or do you need to eliminate only the first and last value?

Harm_Millaard
Inspiring
September 20, 2011

Only the first X records and the last Y records from the ordered set . Usually X = Y and for now could be equal to 1. I want to avoid the exact problem you meant and the number of observations for each category is pretty small.

For instance when you have 3, 3, 3, 5, 5, 5, 6, 6, 6, 6, 99 then 99 is a clear outlier. The alternative would be to only use observations that are within + or - X standard deviations, where with X = 2, you would get 95% of the obeservations. I found this snippet of information http://jasonsouthwell.com/removing-outliers-in-a-sql-server-query but am confused by the w and d selections.

Participating Frequently
September 20, 2011

Agreed. Also take a look at http://www.sqlservercentral.com/articles/Miscellaneous/sqlserverstatisticalfunctions/1255/

There are a few techniques that could work - look at Catering for Extremes and Inter-Quartile ranges.

EDIT: I also see that you are running MySQL. I know there are a few statistical analysis add-ons for MySQL  - these might help.