Fixing Least Squares Greatest Flaw, Outliers

Least squares regression is a powerful tool however as we learned it does have its flaws. Mainly this method is generally used under the assumption of a normal distribution. As a result skewed outliers can greatly affect the regression and reduce its accuracy. Even just one large outlier could invalid a regression of hundreds of points. To help counter the issue of not being robust to outliers variations on standard linear regression have been created to deal with this.

One such useful method is an extension of standard linear regression that eliminates data points if they are beyond a certain cutoff from their expected value. As with the regular method the data should be plotted and a regular least squares regression should be fit to the points. Now biased on this line the average standard deviation of the regression to the points can be calculated. If your standard deviation is 0 you know your line fits perfectly (or to machine precision) and everything is done. However if it is greater then 0, it is possible that there are outliers and there is a better fit (by excluding some points). The question is what points are inaccurate or insignificant and should be omitted. For this general statistics confidence intervals are used. To only include only statistically significant points, if a data point varies by more than 4 standard deviations that point will be excluded. The distinction of 4 standard deviations is not a hard cutoff but guarantees a confidence interval of >99%.

After calculating all individual standard deviations and excluding points if their deviation is greater than the specified cut off a new regression can be formed. With the new regression the procedure should be repeated to see if with the new line and standard deviation has points that should be excluded. This is repeated until no points are able to be excluded. As shown below in a graph from Cornbleet and Gochman’s paper,with just two outliers and two iterations of the method a regression line can be made significantly more accurate.

graph2

While this method may help reduce error it should be noted that there are cases where outliers are important and should not be ignored. This method is generally safe and commonly used used in cases of machine measurements returning inaccurate data.

Source

http://www.clinchem.org/cgi/reprint/25/3/432.pdf

Posted in Topics: Uncategorized

Jump down to leave a comment.

One response to “Fixing Least Squares Greatest Flaw, Outliers”

  1. predictor Says:

    Another alternative is least absolute errors regression (sometimes called “L-1 regression”). As the name implies, absolute errors, rather than squared ones are minimized. Interestingly, this method of regression pre-dates the now popular least squares variety. I wrote about this technique briefly, and its implementation in MATLAB in my Oct-23-2007 posting, “L-1 Linear Regression”:

    http://matlabdatamining.blogspot.com/2007/10/l-1-linear-regression.html

    Even more resistant to outliers are the various forms of robust regression, which weight, rather than exclude observations.

Leave a Comment

You must be logged in to post a comment.



* You can follow any responses to this entry through the RSS 2.0 feed.