# Robust regression

those on! First time hear! The matchless..

The ordinary least squares estimates for linear regression are optimal when all of the regression assumptions are valid. When some of these assumptions are invalid, least squares regression can perform poorly. Residual diagnostics can help guide you to where the breakdown in assumptions occur, but can be time consuming and sometimes difficult to the untrained eye. Robust regression methods provide an alternative to least squares regression by requiring less restrictive assumptions. These methods attempt to dampen the influence of outlying cases in order to provide a better fit to the majority of the data. Outliers have a tendency to pull the least squares fit too far in their direction by receiving much more "weight" than they deserve.

However, outliers may receive considerably more weight, leading to distorted estimates of the regression coefficients.

Regularization Part 1: Ridge Regression

This distortion results in outliers which are difficult to identify since their residuals are much smaller than they would otherwise be if the distortion wasn't present. As we have seen, scatterplots may be used to assess outliers when a small number of predictors are present.

However, the complexity added by additional predictor variables can hide the outliers from view in these scatterplots. Robust regression down-weights the influence of outliers, which makes their residuals larger and easier to identify. For our first robust regression method, suppose we have a data set of size n such that. Thus, observations with high residuals and high squared residuals will pull the least squares fit more in that direction. Formally defined, the least absolute deviation estimator is.

Another quite common robust regression method falls into a class of estimators called M -estimators and there are also other related classes such as R -estimators and S -estimators, whose properties we will not explore. Formally defined, M -estimators are given by.

Robitussin and bromfed

Some M -estimators are influenced by the scale of the residuals, so a scale-invariant version of the M -estimator is used:. Minimization of the above is accomplished primarily in two steps:. Breadcrumb Home 13 It is offered as an introduction to this advanced topic and, given the technical nature of the material, it could be considered optional in the context of this course. Andrews' Sine. Huber's Method. Tukey's Biweight. Font size. Font family A A.

Content Preview Arcu felis bibendum ut tristique et egestas quis: Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris Duis aute irure dolor in reprehenderit in voluptate Excepteur sint occaecat cupidatat non proident.

Lorem ipsum dolor sit amet, consectetur adipisicing elit. Odit molestiae mollitia laudantium assumenda nam eaque, excepturi, soluta, perspiciatis cupiditate sapiente, adipisci quaerat odio voluptates consectetur nulla eveniet iure vitae quibusdam? Excepturi aliquam in iure, repellat, fugiat illum voluptate repellendus blanditiis veritatis ducimus ad ipsa quisquam, commodi vel necessitatibus, harum quos a dignissimos.

Close Save changes. Help F1 or? Save changes Close.In robust statisticsrobust regression is a form of regression analysis designed to overcome some limitations of traditional parametric and non-parametric methods. Regression analysis seeks to find the relationship between one or more independent variables and a dependent variable. Certain widely used methods of regression, such as ordinary least squareshave favourable properties if their underlying assumptions are true, but can give misleading results if those assumptions are not true; thus ordinary least squares is said to be not robust to violations of its assumptions.

Robust regression methods are designed to be not overly affected by violations of assumptions by the underlying data-generating process. In particular, least squares estimates for regression models are highly sensitive to outliers.

While there is no precise definition of an outlier, outliers are observations which do not follow the pattern of the other observations. This is not normally a problem if the outlier is simply an extreme observation drawn from the tail of a normal distribution, but if the outlier results from non-normal measurement error or some other violation of standard ordinary least squares assumptions, then it compromises the validity of the regression results if a non-robust regression technique is used.

One instance in which robust estimation should be considered is when there is a strong suspicion of heteroscedasticity. In the homoscedastic model, it is assumed that the variance of the error term is constant for all values of x.

Heteroscedasticity allows the variance to be dependent on xwhich is more accurate for many real scenarios. For example, the variance of expenditure is often larger for individuals with higher income than for individuals with lower incomes. Software packages usually default to a homoscedastic model, even though such a model may be less accurate than a heteroscedastic model.

One simple approach Tofallis, is to apply least squares to percentage errors, as this reduces the influence of the larger values of the dependent variable compared to ordinary least squares. Another common situation in which robust estimation is used occurs when the data contain outliers.

In the presence of outliers that do not come from the same data-generating process as the rest of the data, least squares estimation is inefficient and can be biased.

Because the least squares predictions are dragged towards the outliers, and because the variance of the estimates is artificially inflated, the result is that outliers can be masked. In many situations, including some areas of geostatistics and medical statistics, it is precisely the outliers that are of interest. Although it is sometimes claimed that least squares or classical statistical methods in general are robust, they are only robust in the sense that the type I error rate does not increase under violations of the model.

In fact, the type I error rate tends to be lower than the nominal level when outliers are present, and there is often a dramatic increase in the type II error rate.

The reduction of the type I error rate has been labelled as the conservatism of classical methods. Despite their superior performance over least squares estimation in many situations, robust methods for regression are still not widely used. Several reasons may help explain their unpopularity Hampel et al.

One possible reason is that there are several competing methods [ citation needed ] and the field got off to many false starts. Also, computation of robust estimates is much more computationally intensive than least squares estimation; in recent years, however, this objection has become less relevant, as computing power has increased greatly.

Another reason may be that some popular statistical software packages failed to implement the methods Stromberg, The belief of many statisticians that classical methods are robust may be another reason [ citation needed ]. Although uptake of robust methods has been slow, modern mainstream statistics text books often include discussion of these methods for example, the books by Seber and Lee, and by Faraway; for a good general description of how the various robust regression methods developed from one another see Andersen's book. Also, modern statistical software packages such as RStatsmodelsStata and S-PLUS include considerable functionality for robust estimation see, for example, the books by Venables and Ripley, and by Maronna et al. The simplest methods of estimating parameters in a regression model that are less sensitive to outliers than the least squares estimates, is to use least absolute deviations. Even then, gross outliers can still have a considerable impact on the model, motivating research into even more robust approaches.

InHuber introduced M-estimation for regression. The M in M-estimation stands for "maximum likelihood type". The method is robust to outliers in the response variable, but turned out not to be resistant to outliers in the explanatory variables leverage points. In fact, when there are outliers in the explanatory variables, the method has no advantage over least squares.

In the s, several alternatives to M-estimation were proposed as attempts to overcome the lack of resistance.Documentation Help Center. This example compares the results among regression techniques that are and are not robust to influential outliers.

Influential outliers are extreme response or predictor observations that influence parameter estimates and inferences of a regression analysis. Responses that are influential outliers typically occur at the extremes of a domain.

For example, you might have an instrument that measures a response poorly or erratically at extreme levels of temperature. With enough evidence, you can remove influential outliers from the data. If removal is not possible, you can use regression techniques that are robust to outliers. Estimate the coefficients and error variance by using simple linear regression.

Plot the regression line. LSMdl is a fitted LinearModel model object. The intercept and slope appear to be respectively higher and lower than they should be. Create a Bayesian linear regression model with a diffuse joint prior for the regression coefficients and error variance.

### Robust regression

Specify one predictor for the model. PriorDiffuseMdl is a diffuseblm model object that characterizes the joint prior distribution. PosteriorDiffuseMdl is a conjugateblm model object that characterizes the joint posterior distribution of the linear model parameters. The estimates of a Bayesian linear regression model with diffuse prior are almost equal to those of a simple linear regression model.

Both models represent a naive approach to influential outliers, that is, the techniques treat outliers like any other observation. Specify that the errors follow a t distribution with 3 degrees of freedom, but no lagged terms.

This specification is effectively a regression model with t distributed errors. It is a template for estimation. Because the t distribution is more diffuse, the regression line attributes more variability to the influential outliers than to the other observations.

Fluoride dangers

Therefore, the regression line appears to be a better predictive model than the other models. QRMdl is a fitted TreeBagger model object. Predict median responses for all observed x values, that is, implement quantile regression.Doubly robust estimation combines a form of outcome regression with a model for the exposure i. When used individually to estimate a causal effect, both outcome regression and propensity score methods are unbiased only if the statistical model is correctly specified.

The doubly robust estimator combines these 2 approaches such that only 1 of the 2 models need be correctly specified to obtain an unbiased effect estimator. In this introduction to doubly robust estimators, the authors present a conceptual overview of doubly robust estimation, a simple worked example, results from a simulation study examining performance of estimated and bootstrapped standard errors, and a discussion of the potential advantages and limitations of this method.

Correct specification of the regression model is a fundamental assumption in epidemiologic analysis. When the goal is to adjust for confounding, the estimator is consistent and therefore asymptotically unbiased if the model reflects the true relations among exposure and confounders with the outcome. In practice, we can never know whether any particular model accurately depicts those relations. Doubly robust estimation combines outcome regression with weighting by the propensity score PS such that the effect estimator is robust to misspecification of one but not both of these models 1 — 4.

While many estimators with the doubly robust property have been described in the statistical literature 4, p. In this introduction, we present a conceptual overview of doubly robust estimation, sample calculations for a simple example, results from a simulation study examining performance of model-based and bootstrapped confidence intervals, and a discussion of the potential advantages and limitations of this method.

Doubly robust estimation combines 2 approaches to estimating the causal effect of an exposure or treatment on an outcome. We examine in greater detail the 2 component models before describing how they are combined such that the resulting estimator is doubly robust.

We have k covariates Z 1Z 2 ,…, Z kmeasured prior to exposure, which may confound the relation between statin initiation and lipid levels at follow-up. Letting Z denote the collection of Z 1 ,…, Z kwe specify a single model in which we simultaneously estimate the exposure-outcome association and the confounder-outcome associations as follows:.

This estimate of the effect of exposure is unconfounded assuming no unmeasured confounders and assuming that the outcome regression model has been correctly specified.

Department of law

If the confounders are misspecified in this model, the estimated effect of exposure may be biased. This effect estimate can be interpreted as a causal effect estimate under several key assumptions, detailed below.

Alternatively, we could use the estimated parameters from this model in conjunction with each individual's actual covariate values to calculate the predicted mean response lipid level at follow-up under each exposure condition one of which is counterfactual for each person in the cohort. The predicted responses can be used to calculate a mean marginal difference due to exposure. Note that this step is not actually necessary in the case of a linear model without interactions between the treatment indicator and the covariates because the parameter estimate already has a marginal interpretation.

This approach is more formally known as estimation by maximum likelihood of the g-computation formula 67 and is the equivalent of maximum likelihood estimation of the parameters of a marginal structural model 8.

Research topics in education

As we discuss in more detail below, the doubly robust estimator uses the outcome regression models in this marginalized approach. This effect estimate is consistent and therefore asymptotically unbiased if there are no unmeasured confounders and the outcome regression models have been correctly specified. It is interpretable as a causal effect under the assumptions noted below. Rather than control confounding by adjusting for the association between covariates and the outcome, we could control confounding by using the PS, defined as the conditional probability of exposure given covariates.

The PS is typically estimated from the observed data with a model such as the following:. Alternatively one could use, for example, a probit model or a machine learning approach 9 The estimated parameters from this model can be used in conjunction with each individual's actual covariate values to calculate the predicted probability of statin initiation conditional on those covariates, the PS, for each person in the cohort The PS can be used to control for confounding in a variety of ways, one of which is to weight the observed data.

Weighting by this quantity creates a pseudopopulation in which the distributions of confounders among the exposed and unexposed are the same as the overall distribution of those confounders in the original total population Robust regression is an alternative to least squares regression when data are contaminated with outliers or influential observations, and it can also be used for the purpose of detecting influential observations. This page uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install. Version info: Code for this page was tested in R version 3. Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do.

In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up analyses. Residual : The difference between the predicted value based on the regression equation and the actual, observed value.

Outlier : In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables.

An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage : An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean.

High leverage points can have a great amount of effect on the estimate of regression coefficients. Influence : An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients.

Influence can be thought of as the product of leverage and outlierness. Robust regression can be used in any situation in which you would use least squares regression. When fitting a least squares regression, we might find some outliers or high leverage data points. We have decided that these data points are not data entry errors, neither they are from a different population than most of our data. So we have no compelling reason to exclude them from the analysis.

Robust regression might be a good strategy since it is a compromise between excluding these points entirely from the analysis and including all the data points and treating all them equally in OLS regression. The idea of robust regression is to weigh the observations differently based on how well behaved these observations are. Roughly speaking, it is a form of weighted and reweighted least squares regression.Linear regression Simple regression Ordinary least squares Polynomial regression General linear model.

In robust statisticsrobust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric and non-parametric methods. Regression analysis seeks to find the relationship between one or more independent variables and a dependent variable.

## Content Preview

Certain widely used methods of regression, such as ordinary least squareshave favourable properties if their underlying assumptions are true, but can give misleading results if those assumptions are not true; thus ordinary least squares is said to be not robust to violations of its assumptions.

Robust regression methods are designed to be not overly affected by violations of assumptions by the underlying data-generating process. In particular, least squares estimates for regression models are highly non- robust to outliers.

While there is no precise definition of an outlier, outliers are observations which do not follow the pattern of the other observations. This is not normally a problem if the outlier is simply an extreme observation drawn from the tail of a normal distribution, but if the outlier results from non-normal measurement error or some other violation of standard ordinary least squares assumptions, then it compromises the validity of the regression results if a non-robust regression technique is used.

One instance in which robust estimation should be considered is when there is a strong suspicion of heteroscedasticity.

In the homoscedastic model, it is assumed that the variance of the error term is constant for all values of x. Heteroscedasticity allows the variance to be dependent on x, which is more accurate for many real scenarios. For example, the variance of expenditure is often larger for individuals with higher income than for individuals with lower incomes. Software packages usually default to a homoscedastic model, even though such a model may be less accurate than a heteroscedastic model.

### Compare Robust Regression Techniques

One simple approach Tofallis, is to apply least squares to percentage errors as this reduces the influence of the larger values of the dependent variable compared to ordinary least squares.

Another common situation in which robust estimation is used occurs when the data contain outliers. In the presence of outliers that do not come from the same data-generating process as the rest of the data, least squares estimation is inefficient and can be biased. Because the least squares predictions are dragged towards the outliers, and because the variance of the estimates is artificially inflated, the result is that outliers can be masked. In many situations, including some areas of geostatistics and medical statistics, it is precisely the outliers that are of interest.

Although it is sometimes claimed that least squares or classical statistical methods in general are robust, they are only robust in the sense that the type I error rate does not increase under violations of the model. In fact, the type I error rate tends to be lower than the nominal level when outliers are present, and there is often a dramatic increase in the type II error rate.

The reduction of the type I error rate has been labelled as the conservatism of classical methods. Other labels might include inefficiency or inadmissibility.

Despite their superior performance over least squares estimation in many situations, robust methods for regression are still not widely used. Several reasons may help explain their unpopularity Hampel et al. One possible reason is that there are several competing methods and the field got off to many false starts. Also, computation of robust estimates is much more computationally intensive than least squares estimation; in recent years however, this objection has become less relevant as computing power has increased greatly.

Another reason may be that some popular statistical software packages failed to implement the methods Stromberg, The belief of many statisticians that classical methods are robust may be another reason.Documentation Help Center.

If the distribution of errors is asymmetric or prone to outliers, model assumptions are invalidated, and parameter estimates, confidence intervals, and other computed statistics become unreliable.

Use fitlm with the RobustOpts name-value pair to create a model that is not much affected by outliers. The robust fitting method is less sensitive than ordinary least squares to large changes in small parts of the data. Robust regression works by assigning a weight to each data point. Weighting is done automatically and iteratively using a process called iteratively reweighted least squares. In the first iteration, each point is assigned equal weight and model coefficients are estimated using ordinary least squares.

At subsequent iterations, weights are recomputed so that points farther from model predictions in the previous iteration are given lower weight. Model coefficients are then recomputed using weighted least squares. The process continues until the values of the coefficient estimates converge within a specified tolerance. This example shows how to use robust regression. It compares the results of a robust fit to a standard least-squares fit. Load the moore data.

The data is in the first five columns, and the response in the sixth. The residuals from the robust fit right half of the plot are nearly all closer to the straight line, except for the one obvious outlier. This weight of the outlier in the robust fit is much less than a typical weight of an observation. LinearModel fitlm plotResiduals. A modified version of this example exists on your system. Do you want to open this version instead? Choose a web site to get translated content where available and see local events and offers.

Based on your location, we recommend that you select:. Select the China site in Chinese or English for best site performance. Other MathWorks country sites are not optimized for visits from your location. Toggle Main Navigation. Search Support Support MathWorks. Search MathWorks.

Back to top