Biographies Characteristics Analysis

The concept of heteroskedasticity of residuals in a regression model. Heteroscedasticity of random residuals

Heteroscedasticity

A random error is a deviation in a linear model multiple regression:

εi=yi–β0–β1x1i–…–βmxmi

Due to the fact that the value random error regression model is an unknown quantity, calculated sample assessment random error of the regression model according to the formula:

where ei are the residuals of the regression model.

The term heteroscedasticity in in a broad sense is understood as an assumption about the random error variance of the regression model.

When constructing a normal linear regression model, we take into account following conditions regarding the random error of the regression model:

6) the mathematical expectation of the random error of the regression model is zero in all observations:

7) the random error variance of the regression model is constant for all observations:

8) there is no systematic relationship between the values ​​of random errors of the regression model in any two observations, i.e. the random errors of the regression model are not correlated with each other (the covariance of random errors of any two different observations is zero):

Second condition

means homoscedasticity (homogeneous spread) of the random error variances of the regression model.

Homoscedasticity refers to the assumption that the variance of the random error βi is known constant value for all observations.

But in practice, the assumption of homoscedasticity of the random error βi or regression model residuals ei is not always satisfied.

Heteroscedasticity is the assumption that random error variances are different for all observations, which means a violation of the second condition of the normal linear multiple regression model:

Heteroscedasticity can be written through the covariance matrix of random errors of the regression model:

Then we can say that the random error of the regression model βi obeys the normal distribution law with zero mathematical expectation and variance G2Ω:

where Ω is the random error covariance matrix.

If the random error variances

regression models are known in advance, the problem of heteroscedasticity is easily eliminated. However, in most cases, not only the variances of random errors are unknown, but also the regression function itself y=f(x), which has to be constructed and estimated.

To detect heteroskedasticity of the regression model residuals, it is necessary to analyze them. In this case, the following hypotheses are tested.

The main hypothesis H0 assumes the constancy of the random error variances of the regression model, i.e. the presence of homoscedasticity conditions in the model:

Alternative hypothesis H1 assumes the variability of random error variances in different observations, i.e., the presence of heteroskedasticity conditions in the model:

Heteroscedasticity of regression model residuals can lead to negative consequences:

1) estimates of unknown coefficients of a normal linear regression model are unbiased and consistent, but the property of efficiency is lost;

2) there is a high probability that the estimates of the standard errors of the regression model coefficients will be calculated incorrectly, which may ultimately lead to the statement of an incorrect hypothesis about the significance of the regression coefficients and the significance of the regression model as a whole.

Homoscedasticity

Homoscedasticity of the residuals means that the variance of each variance is the same for all values ​​of x. If this condition is not met, then heteroscedasticity occurs. The presence of heteroscedasticity can be clearly seen from the correlation field.

Because dispersion characterizes the deviation; it is clear from the figures that in the first case, the dispersion of the residuals increases as x increases, and in the second, the dispersion of the residuals reaches a maximum value at average values ​​of x and decreases at minimum and maximum values ​​of x. The presence of heteroscedasticity will reduce the efficiency of estimating the parameters of the regression equation. The presence of homoscedasticity or heteroscedasticity can also be determined by plotting the dependence of the residuals on the theoretical values.

One of the assumptions of OLS is that the variance of the residuals be homoscedastic. This means that for each value of factor X, the residuals e have the same variance. If this condition is not met, then heteroscedasticity occurs. The presence of heteroskedasticity can be clearly demonstrated in the correlation field (see figure).

Homoscedasticity of the residuals means that the variance of the residuals is the same for each value of X. Using a 3D image, the following graphs can be obtained to illustrate homo- and heteroscedasticity


The figure with homoscedasticity shows that for each value of X, the distributions of the residuals are the same, in contrast to heteroscedasticity.

For multiple regression, the type of graphs is the most visual way to study homo- and heteroscedasticity.

The presence of heteroscedasticity can in some cases lead to biased estimates of regression coefficients, although the unbiased estimates of regression coefficients, as a rule, depends on compliance with the second premise of OLS, i.e., the independence of residuals and factor values. Heteroscedasticity will reduce the effectiveness of estimates b. In particular, it becomes difficult to use the formula for the standard error of the regression coefficient Sb, which assumes a uniform variance of the residuals for any factor values.

Definition of Heteroskedasticity

With a small sample size, which is typical for most, to assess heteroskedasticity, the Goldfeld-Quandt method is used, which was developed in 1965 by Goldfeld and Quandt, where they considered a one-factor linear model, for which the variance of the residuals increases in proportion to the square of the factor. To evaluate the violation of homoscedasticity, they proposed to perform the following operations.

  1. Arrange observations as factor X increases.
  2. Exclude from consideration C central observations, and (n - C): 2 > p, where p is the number of parameters being estimated.
  3. Divide the set of (n - C) observations into two groups (with small and large values factor X).
  4. Determine the residual sum of squares for the first (S1) and second (S2) groups and find the ratio: R = S1: S2.

By doing null hypothesis about homoscedasticity, the relation R will satisfy the Fisher criterion with (n - C - 2p): 2 degrees of freedom for each residual amount squares. The more R value exceeds table value F-test, those in to a greater extent the premise of equality of variances of residual values ​​is violated.

Answers to exam papers in econometrics Yakovleva Angelina Vitalievna

57. Heteroscedasticity of regression model residuals

Random error is called the deviation in a linear multiple regression model:

?i=yi–?0–?1x1i–…–?mxmi

Due to the fact that the magnitude of the random error of the regression model is an unknown quantity, a sample estimate of the random error of the regression model is calculated using the formula:

Where ei– residuals of the regression model.

The term heteroskedasticity is broadly understood as an assumption about the variance of random errors in a regression model.

When constructing a normal linear regression model, the following conditions regarding the random error of the regression model are taken into account:

6) expected value The random error of the regression model is zero in all observations:

7) the random error variance of the regression model is constant for all observations:

8) there is no systematic relationship between the values ​​of random errors of the regression model in any two observations, i.e. the random errors of the regression model are not correlated with each other (the covariance of random errors of any two different observations is zero):

Second condition

means homoscedasticity (homogeneous spread) of the random error variances of the regression model.

Under homoscedasticity refers to the assumption that the variance of the random error ?i is a known constant for all observations.

But in practice, the assumption of homoscedasticity of the random error?i or the residuals of the regression model ei is not always fulfilled.

Under heteroscedasticity(heteroscedasticity - heterogeneous scatter) refers to the assumption that random error variances are different values ​​for all observations, which means a violation of the second condition of the normal linear multiple regression model:

Heteroscedasticity can be written through the covariance matrix of random errors of the regression model:

Then we can say that the random error of the regression model ?i obeys the normal distribution law with zero mathematical expectation and dispersion G2?:

?i~N(0; G2?),

Where ? – random error covariance matrix.

If the random error variances

regression models are known in advance, the problem of heteroscedasticity is easily eliminated. However, in most cases, not only the variances of random errors are unknown, but also the regression function itself y=f(x), to be built and evaluated.

To detect heteroskedasticity of the regression model residuals, it is necessary to analyze them. In this case, the following hypotheses are tested.

Main hypothesis H0 assumes the constancy of the random error variances of the regression model, i.e. the presence of homoscedasticity conditions in the model:

Alternative hypothesis H1 assumes the variability of random error variances in different observations, i.e., the presence of heteroskedasticity conditions in the model:

Heteroscedasticity of regression model residuals can lead to negative consequences:

1) estimates of unknown coefficients of a normal linear regression model are unbiased and consistent, but the property of efficiency is lost;

2) there is a high probability that the estimates standard errors coefficients of the regression model will be calculated incorrectly, which may ultimately lead to the statement of an incorrect hypothesis about the significance of the regression coefficients and the significance of the regression model as a whole.

From the book Answers to exam papers in econometrics author Yakovleva Angelina Vitalievna

14. Estimation of coefficients of a paired regression model using sampling coefficient regressions besides method least squares, with the help of which in most cases the unknown parameters of the regression model are determined, in the case of a linear paired regression model

From the author's book

15. Estimation of the random error variance of the regression model When conducting regression analysis the main difficulty is that general variance random error is an unknown quantity, which necessitates the calculation of its unbiased

From the author's book

18. Characteristics of the quality of a regression model The quality of a regression model is the adequacy of the constructed model to the original (observed) data. To assess the quality of a regression model, special indicators are used. The quality of a linear paired regression model

From the author's book

35. Testing the hypothesis about the significance of regression coefficients and the multiple regression model as a whole. Testing the significance of regression coefficients means testing the main hypothesis that they are significantly different from zero. The main hypothesis is the assumption of insignificance

From the author's book

39. Regression models nonlinear with respect to factor variables When studying socio-economic phenomena and processes, not all dependencies can be described using linear connection. Therefore, in econometric modeling the class of nonlinear

From the author's book

40. Regression models that are nonlinear with respect to the estimated coefficients. Regression models that are nonlinear with respect to the estimated parameters are models in which the resultant variable yi nonlinearly depends on the model coefficients?0...?n. Regression models that are nonlinear with respect to

From the author's book

41. Regression Models with Breakpoints Definition. Regression models with breakpoints are models that cannot be reduced to a linear form, i.e. internally nonlinear regression models. Regression models are divided into two classes: 1) piecewise linear regression models; 2)

From the author's book

44. Methods for nonlinear estimation of regression model coefficients The loss or error function is called a functional of the form Also, the sum of the modules of deviations of the observed values ​​of the resultant characteristic y from the theoretical ones can be used as a loss function

From the author's book

46. ​​Testing the hypothesis about the significance of a nonlinear regression model. Testing the hypothesis about linear dependence between regression model variables To nonlinear regression models that are internally linear, i.e., reducible to linear form, everyone is spreading

From the author's book

58. Glaser test for detecting heteroskedasticity of regression model residuals There are several tests for detecting heteroskedasticity of regression model residuals. Let us consider the application of the Glaser test using the example of a linear paired regression model. Let us assume that

From the author's book

59. Goldfeld-Quandt test for detecting heteroskedasticity of regression model residuals The main condition for carrying out the Goldfeld-Quandt test is the assumption that normal law random error distribution?i regression model. Consider the application of this

From the author's book

60. Eliminating heteroscedasticity in regression model residuals There are many methods for eliminating heteroskedasticity in regression model residuals. Let's look at some of them. The most simple method eliminating heteroskedasticity of regression model residuals

From the author's book

61. Autocorrelation of regression model residuals. Implications of autocorrelation. Autocorrelation function Autocorrelation is the correlation that occurs between the levels of the variable being studied. This is a correlation that occurs over time. The presence of autocorrelation most often

From the author's book

62. Durbin-Watson criterion for detecting autocorrelation of regression model residuals In addition to autocorrelation and partial autocorrelation functions The Durbin-Watson test is used to detect autocorrelation of regression model residuals. However, this criterion

From the author's book

63. Eliminating autocorrelation of regression model residuals Due to the fact that the presence of autocorrelation between model residuals in a regression model can lead to negative results the entire process of estimating unknown model coefficients, autocorrelation of residuals

From the author's book

67. Regression models with variable structure. Dummy Variables When building a regression model, a situation may arise when it is necessary to include not only quantitative, but also qualitative variables (for example, age, education, gender, race).

Assessing the accuracy of regression models.

To assess accuracy, two indicators are most often used, which are for linear and for nonlinear models have the form:

1. Average approximation error

2. Mean square error of approximation

8.1. The essence and reasons for heteroskedasticity

The second Gauss–Markov condition on homoscedasticity, that is, equivariability of residuals, is one of the most important prerequisites of OLS.

Since the mathematical expectation of the residuals in each observation is zero, the squares of the residuals can serve as estimates of their variances.

These squared remainders are included in ESS(which is minimized in OLS) with identical unit weights, and this is not always legal, since in practice heteroskedasticity is not so rare.

For example, as income increases, not only does average level consumption, but also variation in consumption. It is more typical for subjects with high incomes, since they have greater scope for income distribution. The problem of heteroskedasticity is more common for spatial samples. Obviously, in the presence of heteroskedasticity, observations with greater variance should be ESS give less weight and vice versa, and not consider them equally weighted, as is done in classical OLS.

A point on a scatterplot that comes from an observation with less variance defines the direction of the regression line more accurately than a point from an observation with more variance.

The consequences of heteroskedasticity are:

1. The parameter estimates will not be efficient, that is, they will not have the smallest variance compared to other estimates; however, they will remain undisplaced.

2. The variances of the estimates will be shifted, since the variance will be shifted by one degree of freedom, which is used in calculating estimates of the variances of all coefficients.

3. Conclusions based on overestimated F And t statistician, and interval estimates will be unreliable.

8.2.Detecting heteroscedasticity

This is quite a difficult task; dispersion σ 2 (ε i) usually cannot be determined, since for a specific value the explanatory variable x i or a specific vector value x with multiple regression we have only a single value for the dependent variable y i and we can calculate the single model value of the variable

However, a number of methods and tests have now been developed to detect heteroscedasticity:

1. Graphic- we have already said that M(ε i)=0; this means that the variance of the remainder can be replaced by its estimate, and the value can be taken as this estimate. In this case, you can build a graph in coordinates: there is a function of x i and use it to study the nature of this dependence. If there are several explanatory variables, then the dependence for each variable is checked x j, that is, the dependence is studied


You can also examine the dependence since the variable at is a linear combination of all explanatory variables.

2. Test rank correlation Spearman

Values x i And ε i are ordered in ascending order, and for each observation in the series X and in a row ε its rank (number) is established in accordance with this ordering. Difference d i between ranks x And ε for each observation number is calculated as

Then the rank correlation coefficient is calculated:

.

It is known that if the residuals do not correlate with the explanatory variables, then the statistics

has a Student distribution with the number of degrees of freedom

df = n−2.

If the calculated value t– statistics exceeds the tabulated critical value at the assigned significance level γ of the hypothesis N 0, then the hypothesis about the absence of heteroscedasticity is rejected and heteroskedasticity is recognized as significant. Critical value t– statistics is determined from the table as

If the regression model is multiple, hypothesis testing N 0 holds for each explanatory variable.

3. Goldfeld–Quandt test

It is assumed that the variance of the residuals in each observation is proportional or inversely proportional to the regressor of interest, it is also assumed that the residuals are normally distributed and there is no autocorrelation in the residuals.

In the case of multiple regression, it is advisable to conduct the test for each regressor separately.

Test sequence:

a) observations (table rows) are ordered in ascending order of the regressor of interest to us;

b) the sample ordered in this way is divided into 3 subsamples with volumes , , , and we can assume that the Authors of the test offer following values: n= 30, k = 11; n= 60, k = 22; n= 100, k= 36…38; n= 300, k = 110 and so on (see Table 8.1).

When estimating the parameters of the regression equation, we use the least squares method. In this case, we make certain assumptions regarding the random component . In the model

at = A + b 1  x + 

the random component  is an unobservable variable. After the model parameters have been assessed, calculating the differences between the actual and theoretical values ​​of the resulting characteristic at, we can determine estimates of the random component ( at). When changing the model specification, adding new observations to it, sample estimates of residuals i, may change. Therefore, the task of regression analysis includes not only the construction of the model itself, but also the study of random deviations  i, i.e. residual values.

The previous section examined formal tests of the statistical significance of regression and correlation coefficients using t-Student's test and F-criteria. When using these criteria, assumptions are made regarding the behavior of the residuals  i. The residuals are independent random variables and their mean is 0; they have the same (constant) variance and follow a normal distribution.

Estimates of regression parameters must meet certain criteria: to be unbiased, consistent, and efficient.

Unbiased estimate means that the mathematical expectation of the remainders is zero. Consequently, with a large number of sample estimates, the residuals will not accumulate and the found regression parameter b i can be thought of as the average of a possible large number of unbiased estimates.

For practical purposes, not only unbiasedness is important, but also the efficiency of the estimates. Grades count effective, if they are characterized by the smallest dispersion.

The degree of realism of confidence intervals for regression parameters is ensured if the estimates are not only unbiased and efficient, but also wealthy. The consistency of estimates is characterized by an increase in their accuracy with increasing sample size.

Residue studies  i involve checking the presence of the following five OLS prerequisites (see Gauss-Markov conditions):

    Random nature of the residues.

To do this, plot the dependence of the residuals  i from the theoretical values ​​of the resulting characteristic .If there is no directionality in the arrangement of points on the graph  i, then the remainders  i represent random variables and OLS is justified, theoretical values good approximation of actual values at.

    Zero average value balances, independent of X i .

The second OLS assumption regarding zero mean residuals means that ( at) = 0. This is feasible for linear models and models that are nonlinear with respect to the included variables. For models that are nonlinear in the estimated parameters and can be reduced to linear form by logarithm, average error is equal to zero for logarithms of the original data. So, for a model of the form

    Homoscedasticityvariance of each deviationi is the same for all valuesX.

The third premise of the least squares method requires that the variance of the residuals be homoscedastic. This means that for each factor value X i the residuals have the same variance. If this condition for applying the least squares method is not met, then heteroskedasticity(Fig. 1).

Homoscedasticity of the residuals means that the variance of the residuals  i is the same for each value X.

The presence of heteroskedasticity in in some cases may lead to biased estimates of regression coefficients, although the unbiased estimates of regression coefficients mainly depends on compliance with the second premise of OLS, i.e. independence of residuals and factor values.

Heteroscedasticity will reduce the effectiveness of estimates b i. In particular, it becomes difficult to use the formula for the standard error of the regression coefficient , which assumes a uniform variance of residuals for any factor values.

Let's consider tests, which allow us to analyze the model for homoscedasticity.

With a small sample size, which is most typical for econometric studies, heteroskedasticity can be assessed. Goldfeld method Quandt , developed in 1965. Goldfeld and Quandt considered a one-factor linear model for which the variance of the residuals increases with the square of the factor. In order to evaluate violation of homoscedasticity, they proposed parametric test which includes the following steps:

    Ordering P observations as the variable increases X.

    Exclusion from consideration WITH central observations; wherein ( P C)/2 > R, Where R number of parameters to be estimated.

From experimental calculations carried out by the authors of the method for the case of one factor, it is recommended that P= 30 take WITH= 8, and at P= 60 – accordingly WITH = 16.

    Dividing a population from ( PWITH) observations into two groups (with small and large values ​​of the factor, respectively X) and determination of regression equations for each group.

    Determination of the residual sum of squares for the first ( S 1) and second ( S 2) groups and finding their relationships: R = S 1 /S 2 where S 1 > S 2 .

When the null hypothesis of homoscedasticity is satisfied, the ratio R will satisfy F-criterion with ( PWITH2R)/2 degrees of freedom for each residual sum of squares. The larger the value R exceeds the table value F-criterion, the more the premise of equality of variances of residual values ​​is violated.

The Goldfeld-Quandt criterion is also used to test multiple regression residuals for heteroscedasticity.

The presence of heteroscedasticity in regression residuals can also be checked using Spearman rank correlation . The essence of the test is that in the case of heteroscedasticity, the absolute residuals  i correlated with factor values X i. This correlation can be measured using Spearman's rank correlation coefficient:

, (31)

Where d absolute difference between the ranks of values X i and | i |.

The statistical significance of  can be assessed using t-criteria:

. (32)

Comparing this value with the table value at  = 0.05 and the number of degrees of freedom ( Pm). It is generally accepted that if t  > t , then the correlation between  i And X i statistically significant, i.e. there is heteroscedasticity of the residuals. Otherwise, the hypothesis that the residuals are not heteroscedastic is accepted.

The considered criteria do not provide a quantitative assessment of the dependence of the variance of regression errors on the corresponding values ​​of the factors included in the regression. They only allow one to determine the presence or absence of heteroscedasticity in the residuals. Therefore, if the heteroskedasticity of the residuals is established, it is possible to quantify the dependence of the variance of regression errors on the values ​​of the factors. For this purpose, the tests of White, Park, Glaser, etc. can be used.

White's test assumes that the regression error variance is a quadratic function of factor values, i.e. in the presence of one factor  2 = A+ bx + cx 2 + u, or in the presence of factors:

 2 = a + b 1 x 1 + b 11 +b 2 x 2 + b 22 +b 12 x 1 x 2 + … + b p x p + b pp + + b 1 p x 1 x p + b 2 p x 2 x p + … + u.

So the model includes not only the values ​​of the factors, but also their squares, as well as pairwise products. Since each model parameter =f(X i) must be calculated on the basis of a sufficient number of degrees of freedom, then the smaller the volume of the population under study, the less the quadratic function can contain pairwise products of factors. For example, if the regression is built on 30 observations as y i = a + b 1 x +  i, then the subsequent quadratic function for the remainders can only be represented as

 2 = A + b 1 x + b 11 X 2 + u,

since for each parameter at X there must be at least 6–7 observations. White's test is now included in the standard regression analysis program in the Econometric Views package. The presence or absence of heteroscedasticity of the residuals is judged by the value F-Fisher criterion for quadratic function regression of residuals. If the actual value F-criteria are higher than the tabular one, then, consequently, there is a clear correlation between the error variance and the values ​​of the factors included in the regression, and heteroskedasticity of the residuals takes place. Otherwise ( F fact< F Table) it is concluded that there is no heteroscedasticity in the regression residuals.

Park test also refers to formal tests of heteroskedasticity. It is assumed that the dispersion of the residuals is related to the values ​​of the function factors ln  2 = A + b ln X + And. This regression is built for each factor in a multifactor model. The significance of the regression coefficient is checked b By t-Student's test. If the regression coefficient for the equation ln 2 turns out to be statistically significant, then, consequently, there is a dependence of ln 2 on ln X, i.e. there is heteroskedasticity of the residuals.

If the White and Park tests are intended to assess heteroscedasticity for the squared residuals  2, then Glaser test is based on regression of the absolute values ​​of the residuals ||, i.e. the function | is considered i | = A + b + And i. Regression | i| from X i is being built at different meanings parameter With, and then select the function for which the regression coefficient b turns out to be the most significant, i.e. has the greatest significance t-Student's test or F-Fisher criterion and R 2 .

When heteroscedasticity in regression residuals is detected, the goal is to eliminate it, which is achieved by using the generalized least squares method (see below).

    No autocorrelation of residuals. Residue values i , distributed independently of each other.

Autocorrelation of residuals means the presence of a correlation between the residuals of current and previous (subsequent) observations.

When building regression models, compliance with this condition is extremely important. Correlation coefficient between  i and  i-1, where  i remnants of current observations,  i-1  the residuals of previous observations can be defined as

, (33)

which corresponds to the formula for the linear correlation coefficient. If this coefficient turns out to be significantly different from zero, then the residuals are autocorrelated and the probability density function F() depends on j-observation point and on the distribution of residual values ​​at other observation points.

The absence of autocorrelation of residuals ensures consistency and efficiency of estimates of regression coefficients. It is especially important to comply with this premise of OLS when constructing regression models based on time series, where, in the presence of a trend, subsequent levels of the time series, as a rule, depend on their previous levels.

    The residuals follow a normal distribution.

The assumption of normal distribution of residuals allows for testing regression and correlation parameters using criteria t And F. At the same time, regression estimates found using OLS have good properties even in the absence of a normal distribution of residuals, i.e. when the fifth premise of the least squares method is violated.

Along with the prerequisites of the least squares method as a method for estimating regression parameters, when constructing regression models, certain requirements regarding the variables included in the model must be met. First of all, the number of variables T should be no more than
. Otherwise, the regression parameters turn out to be statistically insignificant. IN general view the use of OLS is possible if the number of observations P exceeds the number of estimated parameters T, i.e. system normal equations has a solution only when P > T.

If the basic assumptions of OLS are not met, it is necessary to adjust the model, changing its specification, adding (excluding) some factors, transforming the original data in order to obtain estimates of regression coefficients that have the property of being unbiased, have a lower value of the dispersion of the residuals and therefore provide more effective statistical testing of the significance of regression parameters. This goal, as already indicated, is served by the use of the generalized least squares method.