Another MySQL question about averages. Remove outliers.

Question

I have a number of observations that I can get from a MySQL database.

Let's assume the values I have are, ordered ascending:

3, 5, 5, 6, 6, 6, 8, 9, 15.

Just calculating the AVG results in 63/9 = 7.00

If I want to repeat that but leave out both the minimum and the maximum values, in this case both the 3 and the 15, giving the following list:

5, 5, 6, 6, 6, 8, 9

then the average is only 45/7 = 6.43 to gain more statistical accuracy, how do I need to modify my query?

Currently the query is like this:

$query_Recordset1 = 'SELECT COUNT(Graphics_card) AS Obs, Graphics_card, AVG(MPE_Off/MPE_On) AS Gain , AVG(MPE_On) AS Speed FROM `Personal_data` WHERE MPE_On >0 GROUP BY Graphics_card HAVING COUNT(Graphics_card) > 4 ORDER BY Speed ASC, Obs DESC, Gain DESC LIMIT 0, 20' ;

In this query the minimum number of observations is 5 for each distinct 'Graphics_card'. In the future I may change that to another figure and that is easy, just change the bold 4 to another figure but then, I may want to recalculate the adjusted average by leaving out the two bottom observations and the two top observations, so the range in the calculation would look like this:

5, 6, 6, 6, 8

and then the adjusted average is only 31/5 = 6.20

How do I need to adjust my query to achieve that?

Or, to put it another way, how can I first exclude the top X and bottom X observations from the query and only then calculate the AVG?

Here is the page I want to apply it to: http://ppbm5.com/MPE%20Charts.php

Harm_Millaard · Accepted Answer

Agreed. Also take a look at http://www.sqlservercentral.com/articles/Miscellaneous/sqlserverstatisticalfunctions/1255/

There are a few techniques that could work - look at Catering for Extremes and Inter-Quartile ranges.

EDIT: I also see that you are running MySQL. I know there are a few statistical analysis add-ons for MySQL - these might help.

I solved it for the first part like this:

SELECT COUNT(Graphics_card) AS Obs, Graphics_card, (SUM(MPE_On) - MAX(MPE_On))/(COUNT(Graphics_card)-1) AS Speed, (SUM(MPE_Off/MPE_On) - MIN(MPE_Off/MPE_On))/(COUNT(Graphics_card)-1) AS Gain FROM `Personal_data` WHERE MPE_On >0 GROUP BY Graphics_card HAVING COUNT(Graphics_card) > 5 ORDER BY Speed ASC LIMIT 0, 20

This effectively removes the most significant outlier (MAX for Speed and MIN for Gain) and calculates the average based on the number of observations minus 1 for each category. In the same way I could remove the outlier at the other side of the spectrum, MIN for Speed and MAX for Gain, but then reduce the number of observations by 2 instead of 1.

SELECT COUNT(Graphics_card) AS Obs, Graphics_card, (SUM(MPE_On) - MAX(MPE_On) - MIN(MPE_On))/(COUNT(Graphics_card)-2) AS Speed, (SUM(MPE_Off/MPE_On) - MIN(MPE_Off/MPE_On) - MAX(MPE_Off/MPE_On))/(COUNT(Graphics_card)-2) AS Gain FROM `Personal_data` WHERE MPE_On >0 GROUP BY Graphics_card HAVING COUNT(Graphics_card) > 5 ORDER BY Speed ASC LIMIT 0, 20

I struggled with the Group function to use the suggestions in the link you gave me and of course the different functionality between MySQL and Oracle SQL.

I look forward to having a MEDIAN function in MySQL.

bregent · Answer

>then the average is only 45/7 = 6.43 to gain more

>statistical accuracy, how do I need to modify my query?

In your example, you could easily reduce 3, 5, 5, 6, 6, 6, 8, 9, 15 to 5, 5, 6, 6, 6, 8, 9 by including a filter in the where clause to eliminate entries that are equal to the min, or equal to the max. However, this may not be what you want. That would eliminate all values equal to the min and max, so

3, 3, 3, 5, 5, 6, 6, 6, 8, 9, 15, 15, 15 would also turn into 5, 5, 6, 6, 6, 8, 9

Is that OK, or do you need to eliminate only the first and last value?

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded