Biographies Characteristics Analysis

Multiple regression example. Introduction to Multiple Regression

The material will be illustrated with a through example: sales forecasting for OmniPower. Imagine that you are the marketing manager for a large national grocery chain. AT last years Nutrient bars containing a large number of fats, carbohydrates and calories. They allow you to quickly restore the energy reserves spent by runners, climbers and other athletes in grueling workouts and competitions. In recent years, food bar sales have exploded, and OmniPower's management has come to the conclusion that this market segment is very promising. Before introducing a new type of bar to the national market, the company would like to evaluate the impact of its cost and advertising costs on sales. 34 stores were selected for marketing research. You need to create a regression model that allows you to analyze the data obtained during the study. Is it possible to use the simple linear regression model discussed in the previous note for this? How should it be changed?

Multiple regression model

For market research, OmniPower created a sample of 34 stores with approximately the same sales volume. Consider two independent variables - the price of an OmniPower bar in cents ( X 1) and monthly budget advertising campaign held in the store, expressed in dollars ( X 2). This budget includes the cost of signage and shop windows, as well as the distribution of coupons and free samples. Dependent variable Y represents the number of OmniPower bars sold per month (Figure 1).

Rice. 1. Monthly sales volume of OmniPower bars, their price and advertising costs

Download note in or format, examples in format

Interpretation of regression coefficients. If multiple explanatory variables are being examined in a problem, the simple linear regression model can be extended by assuming that there is a gap between the response and each of the independent variables. linear dependence. For example, if there is k explanatory variables, the multiple linear regression model takes the form:

(1) Y i = β 0 + β 1 X 1i + β 2 X 2i + … + β k X ki + ε i

where β 0 - shift, β 1 - straight line slope Y, depending on the variable X 1 if the variables X 2 , X 3 , … , X k are constants β 2 - straight line slope Y, depending on the variable X 2 if the variables X 1 , X 3 , … , X k are constants β k- straight line slope Y, depending on the variable X k if the variables X 1 , X 2 , … , X k-1 are constants ε i Y in i m observation.

Specifically, a multiple regression model with two explanatory variables:

(2) Y i = β 0 + β 1 X 1 i + β 2 X 2 i + ε i

where β 0 - shift, β 1 - straight line slope Y, depending on the variable X 1 if the variable X 2 is a constant, β 2 - straight line slope Y, depending on the variable X 2 if the variable X 1 is a constant, ε i- random variable error Y in i m observation.

Let's compare this multiple linear regression model and the simple linear regression model: Y i = β 0 + β 1 X i + ε i. In a simple linear regression model, the slope β 1 Y when the value of the variable X changes by one and does not take into account the influence of other factors. In a multiple regression model with two independent variables (2), the slope β 1 represents the change in the mean value of the variable Y when changing the value of a variable x1 per unit, taking into account the influence of the variable X 2. This value is called the coefficient of pure regression (or partial regression).

As in a simple linear regression model, the sample regression coefficients b 0 , b 1 , and b 2 are estimates of the parameters of the corresponding general population β 0 , β 1 and β 2 .

Multiple regression equation with two independent variables:

(3) = b 0 + b 1 X 1 i + b 2 X 2 i

To calculate the regression coefficients, the method is used least squares. In Excel, you can use Analysis package, option Regression. Unlike building a linear regression, just set as Input interval X area including all independent variables (Fig. 2). In our example, this is $C$1:$D$35.

Rice. 2. Batch Regression Window Excel analysis

The results of the Analysis Package are shown in fig. 3. As we see, b 0 = 5 837,52, b 1 = -53.217 and b 2 = 3.163. Hence, = 5 837,52 –53,217 X 1 i + 3,163 X 2 i, where Ŷ i- predicted sales of OmniPower nutrition bars in i-m store (pieces), X 1i- bar price (in cents) in i-m store, X 2i- monthly ad spend i store (in dollars).

Rice. 3. Multiple regression study of OmniPower bar sales volume

Selective slope b 0 is 5,837.52 and is an estimate of the average number of OmniPower Bars sold per month at zero price and no advertising costs. Since these conditions are meaningless, in this situation the value of the slope b 0 has no reasonable interpretation.

Selective slope b 1 is -53.217. This means that for a given monthly amount of advertising spend, a one-cent increase in the price of a candy bar would result in a decrease in expected sales by 53,217 units. Similarly, the sampling slope b 2 , equal to 3.613, means that when fixed price a $1 increase in monthly advertising spend is accompanied by an increase in expected bar sales of 3,613. These estimates provide a better understanding of the impact of price and advertising on sales. For example, with a fixed amount of advertising spending, a 10-cent decrease in the price of a bar will increase sales by 532,173 units, and with a fixed price of a bar, a $100 increase in advertising costs will increase sales by 361.31 units.

Interpretation of slopes in a multiple regression model. The coefficients in a multiple regression model are called pure regression coefficients. They estimate the average change in response Y when changing the value X per unit if all other explanatory variables are "frozen". For example, in the OmniPower bar problem, a store with a fixed amount of advertising spend per month will sell 53,217 fewer bars if they increase their price by one cent. Another interpretation of these coefficients is possible. Imagine the same stores with the same amount of advertising spend. If the price of a bar decreases by one cent, sales in these stores will increase by 53,217 bars. Consider now two stores where the bars cost the same, but the advertising costs are different. If these costs increase by one dollar, the sales volume in these stores will increase by 3,613 units. As we can see, a reasonable interpretation of the slopes is possible only under certain restrictions imposed on the explanatory variables.

Predicting the values ​​of the dependent variable Y. Once we find that the accumulated data allows us to use a multiple regression model, we can predict the monthly sales of OmniPower Bars and build confidence intervals for the average and predicted sales. To predict the average monthly sales of 79 cents OmniPower Bars in a store that spends $400 per month on advertising, use the multiple regression equation: Y = 5,837.53 – 53.2173*79 + 3.6131*400 = 3,079. Therefore, the expected sales volume for stores selling OmniPower bars priced at 79 cents and spending $400 per month on advertising is 3,079.

Calculating the value Y and by evaluating the residuals, one can construct confidence intervals containing expected value and the predicted response value. we considered this procedure within the framework of a simple linear regression model. However, the construction of similar estimates for the multiple regression model is associated with great computational difficulties and is not presented here.

Multiple mixed correlation coefficient. Recall that the regression model allows you to calculate the coefficient of mixed correlation r2. Because there are at least two explanatory variables in a multiple regression model, the multiple mixed correlation coefficient is the fraction of the variable's variance Y, explained by a given set of explanatory variables:

where SSR is the sum of squares of the regression, SSTtotal amount squares.

For example, in the problem of selling an OmniPower bar SSR = 39 472 731, SST= 52 093 677 and k = 2. Thus,

This means that 75.8% of the variation in sales volumes is due to price changes and fluctuations in advertising spend.

Residual analysis for a multiple regression model

Residual analysis allows you to determine whether a multiple regression model with two (or more) explanatory variables can be applied. Usually carried out the following types residue analysis:

The first graph (Fig. 4a) allows us to analyze the distribution of residuals depending on the predicted values ​​of . If the value of the residuals does not depend on the predicted values ​​and takes both positive and negative values(as in our example), the condition for the linear dependence of the variable Y on both explanatory variables is satisfied. Unfortunately, in Analysis package For some reason this graph is not being created. Can be in the window Regression(see fig. 2) turn on Remains. This will allow you to display a table with the remainders, and already build on it scatter plot(Fig. 4).

Rice. 4. Dependence of residuals on the predicted value

The second and third graphs show the dependence of the residuals on the explanatory variables. These plots can reveal a quadratic effect. In this situation, it is necessary to add a squared explanatory variable to the multiple regression model. These plots are displayed by the Analysis Package (see Fig. 2) if you enable the Residual Graph option (Fig. 5).

Rice. 5. Dependence of residuals on price and advertising costs

Testing the significance of a multiple regression model.

After confirming, using residual analysis, that the linear multiple regression model is adequate, it can be determined whether there is a statistically significant relationship between the dependent variable and the set of explanatory variables. Since the model includes several explanatory variables, the null and alternative hypotheses are formulated as follows: H 0: β 1 = β 2 = ... = β k = 0 (there is no linear relationship between the response and the explanatory variables), H 1: there is at least one value β j ≠ 0 (there is a linear dependence between the response and at least one explanatory variable).

To test the null hypothesis, we use F-criterion - test F-statistic equals regression mean square (MSR) divided by error variance (MSE):

where F F- distribution with k and n–k–1 degrees of freedom k- the number of independent variables in the regression model.

The decision rule looks like this: at a significance level of α, the null hypothesis H 0 rejected if F > F U(k,n – k – 1), otherwise the hypothesis H 0 is not rejected (Fig. 6).

Rice. 6. Summary table of analysis of variance to test the hypothesis about statistical significance multiple regression coefficients

ANOVA summary table completed using Analysis package Excel when solving the problem of selling OmniPower bars is shown in fig. 3 (see area A10:F14). If the significance level is 0.05, the critical value F-distributions with two and 31 degrees of freedom FU(2.31)\u003d F. OBR (1-0.05; 2; 31) \u003d equal to 3.305 (Fig. 7).

Rice. 7. Testing the hypothesis about the significance of the regression coefficients at the significance level α = 0.05, with 2 and 31 degrees of freedom

As shown in fig. 3, F-statistic is 48.477 > FU(2.31)= 3.305, and p-value close to 0.000< 0,05. Следовательно, нулевая гипотеза Н 0 отклоняется, и объем продаж линейно связан хотя бы с одной из объясняющих переменных (ценой и/или затратами на рекламу).

Statistical inferences about the population of regression coefficients

To identify a statistically significant relationship between variables X and Y in a simple linear regression model, a slope hypothesis test was performed. In addition, to estimate the slope of the general population, we built confidence interval(cm. ).

Hypothesis testing. To test the hypothesis that the slope of the population β 1 in a simple linear regression model is zero, the formula t = (b 1 – β 1)/S b 1 is used. It can be extended to a multiple regression model:

where t is a test statistic that has t- distribution with n–k–1 degrees of freedom bj- the slope of the variable Xj with respect to variable Y if all other explanatory variables are constants, Sbj is the root mean square error of the regression coefficient bj, k- the number of explanatory variables in the regression equation, β j - the hypothetical slope of the population of responses j-th with respect to a variable when all other variables are fixed.

On fig. 3 (bottom table) shows the results of applying t-criteria (obtained using Analysis package) for each of the independent variables included in the regression model. Thus, if it is necessary to determine whether a variable has X 2(advertising costs) a significant impact on sales at a fixed price of an OmniPower bar, the null and alternative hypotheses are formulated: H 0: β2 = 0, H 1: β2 ≠ 0. In accordance with formula (6), we obtain:

If the significance level is 0.05, the critical values t-distributions with 31 degrees of freedom are t L = STUDENT.OBR(0.025;31) = –2.0395 and t U = STUDENT.OBR(0.975;31) = 2.0395 (Fig. 8). R-value =1-STUDENT.DIST(5.27;31;TRUE) and is close to 0.0000. Based on one of the inequalities t= 5.27 > 2.0395 or R = 0,0000 < 0,05 нулевая гипотеза H 0 is rejected. Therefore, at a fixed price of a bar between the variable X 2(advertising costs) and sales volume, there is a statistically significant relationship. Thus, there is an extremely small chance of rejecting null hypothesis if there is no linear relationship between advertising costs and sales volumes.

Rice. 8. Testing the hypothesis about the significance of the regression coefficients at a significance level of α = 0.05, with 31 degrees of freedom

Testing the significance of specific regression coefficients is actually a test of the hypothesis about the significance of a particular variable included in the regression model along with others. Hence, t-criterion for testing the hypothesis about the significance of the regression coefficient is equivalent to testing the hypothesis about the influence of each of the explanatory variables.

Confidence intervals. Instead of testing the hypothesis about the slope of the population, you can estimate the value of this slope. In a multiple regression model, the following formula is used to build a confidence interval:

(7) bj ± t nk –1 Sbj

We use this formula to construct a 95% confidence interval containing the slope of the population β 1 (the effect of price x1 on sales volume Y with a fixed amount of advertising costs X 2). According to formula (7) we get: b 1 ± t nk –1 Sb 1 . Insofar as b 1 = –53.2173 (see Fig. 3), Sb 1 = 6.8522, critical value t-statistics at 95% confidence level and 31 degrees of freedom t nk –1 \u003d STUDENT.OBR (0.975; 31) \u003d 2.0395, we get:

–53.2173 ± 2.0395*6.8522

–53.2173 ± 13.9752

–67.1925 ≤ β 1 ≤ –39.2421

Thus, taking into account the effect of advertising costs, it can be argued that with an increase in the price of a bar by one cent, the sales volume decreases by an amount that ranges from 39.2 to 67.2 units. There is a 95% chance that this interval correctly estimates the relationship between the two variables. Since this confidence interval does not contain zero, it can be argued that the regression coefficient β 1 has a statistically significant effect on sales.

Assessing the Significance of Explanatory Variables in a Multiple Regression Model

A multiple regression model should include only those explanatory variables that accurately predict the value of the dependent variable. If any of the explanatory variables does not meet this requirement, it must be removed from the model. As an alternative method to estimate the contribution of the explanatory variable, as a rule, a private F-criterion. It consists in assessing the change in the sum of squares of the regression after the inclusion of the next variable in the model. A new variable is included in the model only when it leads to a significant increase in the accuracy of the prediction.

In order to apply a partial F-test to the OmniPower bar sales problem, it is necessary to evaluate the contribution of the variable X 2(advertising costs) after including the variable in the model x1(bar price). If the model includes multiple explanatory variables, the contribution of the explanatory variable Xj can be determined by excluding it from the model and evaluating the regression sum of squares (SSR) computed over the remaining variables. If the model includes two variables, the contribution of each of them is determined by the formulas:

Estimating the contribution of a variable X 1 X 2:

(8a) SSR(X 1 |X 2) = SSR(X 1 and X 2) – SSR(X 2)

Estimating the contribution of a variable X 2 provided that the variable is included in the model X 1:

(8b) SSR(X 2 |X 1) = SSR(X 1 and X 2) – SSR(X 1)

Quantities SSR(X2) and SSR(X 1) are, respectively, the sums of squares of the regression calculated for only one of the explained variables (Fig. 9).

Rice. 9. The coefficients of a simple linear regression model, taking into account: (a) the volume of sales and the price of the bar - SSR(X 1); (b) sales volume and advertising costs - SSR(X2)(obtained using the Excel Analysis ToolPack)

Null and alternative hypotheses about the contribution of a variable X 1 are formulated as follows: H 0- enable variable X 1 does not lead to a significant increase in the accuracy of the model, which takes into account the variable X 2; H 1- enable variable X 1 leads to a significant increase in the accuracy of the model, which takes into account the variable X 2. The statistics underlying the quotient F-criterion for two variables, calculated by the formula:

where MSE is the variance of the error (residual) for two factors simultaneously. A-priory F-statistics has F-distribution from one and n-k-1 degrees of freedom.

So, SSR(X2)= 14 915 814 (Fig. 9), SSR(X 1 and X 2)= 39 472 731 (Fig. 3, cell C12). Therefore, according to formula (8a), we obtain: SSR (X 1 | X 2) \u003d SSR (X 1 and X 2) - SSR (X 2) \u003d 39 472 731 - 14 915 814 = 24 556 917. So, for SSR(X 1 |X 2) = 24 556 917 and MSE (X 1 and X 2) = 407 127 (Fig. 3, cell D13), using formula (9), we obtain: F= 24,556,917 / 407,127 = 60.32. If the significance level is 0.05, then the critical value F-distributions with one and 31 degrees of freedom = F. OBR (0.95; 1; 31) = 4.16 (Fig. 10).

Rice. 10. Testing the hypothesis about the significance of the regression coefficients at a significance level of 0.05, with one and 31 degrees of freedom

Since the calculated value F-statistics more than critical (60.32 > 4.17), hypothesis H 0 is rejected, hence accounting for the variable X 1(prices) greatly improves a regression model that already includes the variable X 2(advertising costs).

Similarly, one can evaluate the influence of the variable X 2(advertising costs) per model that already includes the variable X 1(price). Do the calculations yourself. The decision condition causes 27.8 > 4.17, and hence the inclusion of the variable X 2 also leads to a significant increase in the accuracy of the model, which takes into account the variable X 1 . So, including each of the variables improves the accuracy of the model. Therefore, both variables must be included in the multiple regression model: price and advertising costs.

Curiously, the value t-statistics calculated by formula (6), and the value of the private F-statistics, given formula(9) are uniquely interconnected:

where a is the number of degrees of freedom.

Dummy variable regression models and interaction effects

When discussing multiple regression models, we have assumed that each independent variable is a number. However, in many situations it is necessary to include categorical variables in the model. For example, in the OmniPower bar sales problem, price and advertising costs were used to predict average monthly sales. In addition to these numerical variables, you can try to take into account in the model the location of the goods inside the store (for example, in the window or not). In order to account for categorical variables in the regression model, dummy variables must be included in it. For example, if a categorical explanatory variable has two categories, one dummy variable is enough to represent them Xd: X d= 0 if the observation belongs to the first category, X d= 1 if the observation belongs to the second category.

To illustrate the dummy variables, consider a model for predicting the average appraised value of real estate based on a sample of 15 houses. As explanatory variables, we choose the living area of ​​the house (thousand square feet) and the presence of a fireplace (Fig. 11). Dummy variable X 2(presence of a fireplace) is defined as follows: X 2= 0 if there is no fireplace in the house, X 2= 1 if the house has a fireplace.

Rice. 11. Estimated value predicted by living space and presence of a fireplace

Let us assume that the slope of the estimated value, depending on the living area, is the same for houses with and without a fireplace. Then the multiple regression model looks like this:

Y i = β 0 + β 1 X 1i + β 2 X 2i + ε i

where Y i- assessed value i-th house, measured in thousand dollars, β 0 - response shift, x1i,- living space i-go house, measured in thousand square meters. feet, β 1 - the slope of the estimated value, depending on the living area of ​​the house with a constant value of the dummy variable, x1i, is a dummy variable indicating the presence or absence of a fireplace, β 1 - the slope of the estimated value, depending on the living area of ​​the house with a constant value of the dummy variable β 2 - the effect of increasing the estimated value of the house, depending on the presence of a fireplace when constant value living area, ε i– a random error in the estimated value i th house. The results of calculating the regression model are shown in fig. 12.

Rice. 12. Results of calculating the regression model for the estimated value of houses; obtained with Analysis package in Excel; a table similar to Fig. 1 was used for the calculation. 11, with the only change: "Yes" are replaced by ones, and "No" by zeros

In this model, the regression coefficients are interpreted as follows:

  1. If the dummy variable is constant, an increase in living space per 1,000 sq. feet results in a $16.2k increase in the predicted average appraised value.
  2. If living space is constant, having a fireplace increases the average home value by $3,900.

Pay attention (Fig. 12), t-statistic corresponding to living area is 6.29, and R- value is almost zero. In the same time t-statistic corresponding to the dummy variable is 3.1, and p-value - 0.009. Thus, each of these two variables contributes significantly to the model if the significance level is 0.01. In addition, the multiple mixed correlation coefficient means that 81.1% of the variation in the appraised value is due to the variability of the home's living space and the presence of a fireplace.

Interaction effect. In all regression models discussed above, the effect of response on the explanatory variable was considered to be statistically independent of the effect of response on other explanatory variables. If this condition is not met, there is an interaction between the dependent variables. For example, it is likely that advertising has a large impact on sales of low-priced products. However, if the price of a product is too high, an increase in advertising spending cannot significantly increase sales. In this case, there is an interaction between the price of the product and the cost of its advertising. In other words, one cannot make general statements about the dependence of sales on advertising costs. The impact of advertising costs on sales depends on the price. This influence is taken into account in the multiple regression model using the interaction effect. To illustrate this concept, let us return to the problem of the cost of houses.

In the regression model we developed, it was assumed that the effect of house size on its value does not depend on whether the house has a fireplace. In other words, it was believed that the slope of the estimated value, depending on the living area of ​​the house, was the same for houses with and without a fireplace. If these slopes differ from each other, there is an interaction between the size of the house and the presence of a fireplace.

Testing the hypothesis of equal slopes comes down to estimating the contribution that the product of the explanatory variable makes to the regression model x1 and a dummy variable X 2. If this contribution is statistically significant, the original regression model cannot be applied. Results of a regression analysis involving variables X 1, X 2 and X 3 \u003d X 1 * X 2 shown in fig. thirteen.

Rice. 13. Results obtained with Analysis package Excel for a regression model that takes into account living space, the presence of a fireplace and their interaction

In order to test the null hypothesis H 0: β 3 = 0 and the alternative hypothesis H 1: β 3 ≠ 0, using the results shown in Fig. 13, note that t-statistic corresponding to the effect of the interaction of variables is equal to 1.48. Insofar as R-value is 0.166 > 0.05, the null hypothesis is not rejected. Therefore, the interaction of variables does not have a significant effect on the regression model, which takes into account living space and the presence of a fireplace.

Summary. This note shows how a marketing manager can use multiple linear analysis to predict sales volume based on price and advertising spend. Various multiple regression models are considered, including quadratic models, models with dummy variables, and models with interaction effects (Fig. 14).

Rice. fourteen. Structural scheme notes

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 873–936

Suppose a developer is valuing a group of small office buildings in a traditional business district.

A developer can use multiple regression analysis to estimate the price of an office building in a given area based on the following variables.

y is the estimated price of an office building;

x 1 - total area in square meters;

x 2 - number of offices;

x 3 - the number of inputs (0.5 input means an input only for the delivery of correspondence);

x 4 - time of operation of the building in years.

This example assumes that there is a linear relationship between each independent variable (x 1 , x 2 , x 3 and x 4) and the dependent variable (y), i.e. the price of an office building in a given area. The initial data is shown in the figure.

The settings for solving the task are shown in the figure of the window " Regression". The calculation results are placed on a separate sheet in three tables

As a result, we got the following mathematical model:

y = 52318 + 27.64*x1 + 12530*x2 + 2553*x3 - 234.24*x4.

The developer can now determine the appraised value of an office building in the same area. If this building has an area of ​​2500 square meters, three offices, two entrances and a lifetime of 25 years, you can estimate its value using the following formula:

y \u003d 27.64 * 2500 + 12530 * 3 + 2553 * 2 - 234.24 * 25 + 52318 \u003d 158 261 c.u.

In regression analysis, the most important results are:

  • coefficients for variables and Y-intersection, which are the desired parameters of the model;
  • multiple R characterizing the accuracy of the model for the available input data;
  • Fisher F-test(in the considered example, it significantly exceeds the critical value equal to 4.06);
  • t-statistic– values ​​characterizing the degree of significance of individual coefficients of the model.

Special attention should be paid to t-statistics. Very often, when building a regression model, it is not known whether this or that factor x influences y. Inclusion in the model of factors that do not affect the output value degrades the quality of the model. Computing the t-statistic helps to detect such factors. An approximate estimate can be made as follows: if for n>>k the absolute value of the t-statistics is significantly greater than three, the corresponding coefficient should be considered significant, and the factor should be included in the model, otherwise excluded from the model. Thus, it is possible to propose a technology for constructing a regression model, consisting of two stages:

1) process the package " Regression"all available data, analyze t-statistic values;

2) remove from the table of initial data columns with those factors for which the coefficients are insignificant and process with the package " Regression"new table.

In the previous notes, the focus has often been on a single numerical variable, such as mutual fund returns, Web page load time, or soft drink consumption. In this and the following notes, we will consider methods for predicting the values ​​of a numeric variable depending on the values ​​of one or more other numeric variables.

The material will be illustrated with a through example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company does not currently have a systematic approach to selecting new outlets. The location where the company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager's idea of ​​the ideal location of the store. Imagine that you are the head of the Special Projects and Planning Department. You have been assigned to develop strategic plan opening new stores. This plan should contain a forecast of annual sales in newly opened stores. Do you think that trading area is directly related to the amount of revenue, and you want to take this fact into account in the decision-making process. How to develop statistical model to predict annual sales based on new store size?

Typically, regression analysis is used to predict the values ​​of a variable. Its goal is to develop a statistical model that predicts the values ​​of the dependent variable, or response, from the values ​​of at least one independent, or explanatory, variable. In this note, we will consider a simple linear regression - statistical method, allowing to predict the values ​​of the dependent variable Y by the values ​​of the independent variable X. The following notes will describe a multiple regression model designed to predict the values ​​of the independent variable Y by the values ​​of several dependent variables ( X 1 , X 2 , …, X k).

Download note in or format, examples in format

Types of regression models

where ρ 1 is the autocorrelation coefficient; if ρ 1 = 0 (no autocorrelation), D≈ 2; if ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; if ρ 1 = -1 (negative autocorrelation), D ≈ 4.

In practice, the application of the Durbin-Watson criterion is based on a comparison of the value D with critical theoretical values d L and d U for a given number of observations n, numbers of independent model variables k(for simple linear regression k= 1) and significance level α. If a D< d L , independence hypothesis random deviations rejected (hence positive autocorrelation present); if D > dU, the hypothesis is not rejected (that is, there is no autocorrelation); if d L< D < d U there is not enough reason to make a decision. When the calculated value D exceeds 2, then d L and d U it is not the coefficient itself that is being compared D, and the expression (4 – D).

To calculate the Durbin-Watson statistics in Excel, we turn to the bottom table in Fig. fourteen Balance withdrawal. The numerator in expression (10) is calculated using the function = SUMMQDIFF(array1, array2), and the denominator = SUMMQ(array) (Fig. 16).

Rice. 16. Formulas for calculating Durbin-Watson statistics

In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that there is a positive autocorrelation? It is necessary to correlate the value of D with the critical values ​​( d L and d U) depending on the number of observations n and significance level α (Fig. 17).

Rice. 17. Critical values ​​of Durbin-Watson statistics (table fragment)

Thus, in the problem of the volume of sales in a store delivering goods to your home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Hence, d L= 1.08 and dU= 1.36. Insofar as D = 0,883 < d L= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be applied.

Testing Hypotheses about Slope and Correlation Coefficient

The above regression was applied solely for forecasting. To determine regression coefficients and predict the value of a variable Y at given value variable X the method of least squares was used. In addition, we considered the standard error of the estimate and the coefficient of mixed correlation. If the residual analysis confirms that the applicability conditions of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

Applicationt -criteria for slope. By checking whether the slope of the population β 1 is equal to zero, one can determine whether there is a statistically significant relationship between the variables X and Y. If this hypothesis is rejected, it can be argued that between the variables X and Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (no linear relationship), H1: β 1 ≠ 0 (there is a linear relationship). A-priory t-statistic is equal to the difference between the sample slope and the hypothetical population slope, divided by the standard error of the slope estimate:

(11) t = (b 1 β 1 ) / Sb 1

where b 1 is the slope of the direct regression based on sample data, β1 is the hypothetical slope of the direct general population, , and test statistics t It has t- distribution with n - 2 degrees of freedom.

Let's check if there is a statistically significant relationship between store size and annual sales at α = 0.05. t-criteria is displayed along with other parameters when using Analysis package(option Regression). The full results of the Analysis Package are shown in Fig. 4, a fragment related to t-statistics - in fig. eighteen.

Rice. 18. Application results t

Because the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level α = 0.05 can be found by the formula: t L=STUDENT.INV(0.025;12) = -2.1788 where 0.025 is half the significance level and 12 = n – 2; t U\u003d STUDENT.INV (0.975, 12) \u003d +2.1788.

Insofar as t-statistics = 10.64 > t U= 2.1788 (Fig. 19), null hypothesis H 0 is rejected. On the other side, R-value for X\u003d 10.6411, calculated by the formula \u003d 1-STUDENT.DIST (D3, 12, TRUE), is approximately equal to zero, so the hypothesis H 0 is rejected again. The fact that R-value is almost zero, meaning that if there were no real linear relationship between store size and annual sales, it would be almost impossible to detect it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

Rice. 19. Testing the hypothesis about the slope of the general population at a significance level of 0.05 and 12 degrees of freedom

ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of a simple linear regression is to use F-criteria. Recall that F-criterion is used to test the relationship between two variances (see details). When testing the slope hypothesis with a measure random errors is the error variance (sum of squared errors divided by the number of degrees of freedom), so F-test uses the ratio of the variance explained by the regression (i.e., the values SSR divided by the number of independent variables k), to the error variance ( MSE=S YX 2 ).

A-priory F-statistic is equal to the mean squared deviations due to regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, where MSR=SSR / k, MSE =SSE/(n– k – 1), k is the number of independent variables in the regression model. Test statistics F It has F- distribution with k and n– k – 1 degrees of freedom.

For a given significance level α decision rule formulated as follows: if F > FU, the null hypothesis is rejected; otherwise, it is not rejected. The results, presented in the form of a summary table of the analysis of variance, are shown in fig. 20.

Rice. 20. Table of analysis of variance to test the hypothesis of the statistical significance of the regression coefficient

Similarly t-criterion F-criteria is displayed in the table when using Analysis package(option Regression). Full results of the work Analysis package shown in fig. 4, fragment related to F-statistics - in fig. 21.

Rice. 21. Application results F- Criteria obtained using the Excel Analysis ToolPack

F-statistic is 113.23 and R-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained from the formula F U\u003d F. OBR (1-0.05; 1; 12) \u003d 4.7472 (Fig. 22). Insofar as F = 113,23 > F U= 4.7472, and R-value close to 0< 0,05, нулевая гипотеза H 0 deviates, i.e. The size of a store is closely related to its annual sales volume.

Rice. 22. Testing the hypothesis about the slope of the general population at a significance level of 0.05, with one and 12 degrees of freedom

Confidence interval containing slope β 1 . To test the hypothesis about the existence of a linear relationship between variables, you can build a confidence interval containing the slope β 1 and make sure that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±t n –2 Sb 1

As shown in fig. eighteen, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 \u003d STUDENT.OBR (0.975, 12) \u003d 2.1788. Hence, b 1 ±t n –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, the slope of the population with a probability of 0.95 lies in the range from +1.328 to +2.012 (i.e., from $1,328,000 to $2,012,000). Because these values ​​are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that every 1,000 sq. feet results in an increase in average sales of $1,328,000 to $2,012,000.

Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking for the existence of a correlation:

where r = + , if b 1 > 0, r = – , if b 1 < 0. Тестовая статистика t It has t- distribution with n - 2 degrees of freedom.

In the problem of the Sunflowers store chain r2= 0.904, and b 1- +1.670 (see Fig. 4). Insofar as b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t- statistics:

At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

When discussing inferences about population slopes, confidence intervals and criteria for testing hypotheses are interchangeable tools. However, the calculation of the confidence interval containing the correlation coefficient turns out to be more difficult, since the form of the sampling distribution of the statistic r depends on the true correlation coefficient.

Expectation Estimation and Prediction individual values

This section discusses methods for estimating the expected response Y and predictions of individual values Y for given values ​​of the variable X.

Construction of a confidence interval. In example 2 (see above section Least square method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales in a store with an area of ​​4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is a point. to estimate the mathematical expectation of the general population, the concept of a confidence interval was proposed. Similarly, one can introduce the concept confidence interval for the mathematical expectation of the response at set value variable X:

where , = b 0 + b 1 X i– predicted value variable Y at X = X i, S YX is the mean square error, n is the sample size, Xi- the given value of the variable X, µ Y|X = Xi– mathematical expectation of a variable Y at X = Х i,SSX=

Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given level of significance, an increase in the amplitude of fluctuations around the regression line, measured using the mean square error, leads to an increase in the width of the interval. On the other hand, as expected, an increase in the sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the value of the variable Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values ​​far from the mean.

Let's say that when choosing a location for a store, we want to build a 95% confidence interval for the average annual sales in all stores with an area of ​​4000 square meters. feet:

Therefore, the average annual sales volume in all stores with an area of ​​​​4,000 square meters. feet, with a 95% probability lies in the range from 6.971 to 8.317 million dollars.

Compute the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains a predicted value and not an estimate of the parameter. Interval for predicted response YX = Xi for a specific value of the variable Xi is determined by the formula:

Let's assume that when choosing a location for a retail outlet, we want to build a 95% confidence interval for the predicted annual sales volume in a store with an area of ​​4000 square meters. feet:

Therefore, the predicted annual sales volume for a 4,000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As you can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values ​​is much greater than in estimating the expected value.

Pitfalls and ethical issues associated with the use of regression

Difficulties associated with regression analysis:

  • Ignoring the conditions of applicability of the method of least squares.
  • An erroneous estimate of the conditions for applicability of the method of least squares.
  • Wrong choice of alternative methods in violation of the conditions of applicability of the least squares method.
  • Application of regression analysis without in-depth knowledge of the subject of study.
  • Extrapolation of the regression beyond the range of the explanatory variable.
  • Confusion between statistical and causal relationships.

The spread of spreadsheets and software for statistical calculations eliminated the computational problems that prevented the use of regression analysis. However, this led to the fact that regression analysis began to be used by users who do not have sufficient qualifications and knowledge. How do users know about alternative methods if many of them have no idea at all about the conditions for applicability of the least squares method and do not know how to check their implementation?

The researcher should not be carried away by grinding numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this classic example taken from textbooks. Anscombe showed that all four datasets shown in Fig. 23 have the same regression parameters (Fig. 24).

Rice. 23. Four artificial data sets

Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the image to enlarge the image)

So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis were over on this, we would have lost a lot useful information. This is evidenced by the scatter plots (Fig. 25) and residual plots (Fig. 26) constructed for these data sets.

Rice. 25. Scatter plots for four datasets

Scatter plots and residual plots show that these data are different from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A has no pattern. The same cannot be said for sets B, C, and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the plot of residuals, which has a parabolic shape. The scatter plot and residual plot show that dataset B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. The technique for detecting and eliminating outliers from observations is called influence analysis. After eliminating the outlier, the result of the re-evaluation of the model may be completely different. A scatterplot built using data from set D illustrates unusual situation, in which the empirical model depends significantly on the individual response ( X 8 = 19, Y 8 = 12.5). Such regression models need to be calculated especially carefully. So, scatter and residual plots are an essential tool for regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

Rice. 26. Plots of residuals for four datasets

How to avoid pitfalls in regression analysis:

  • Analysis of the possible relationship between variables X and Y always start with a scatterplot.
  • Before interpreting the results of a regression analysis, check the conditions for its applicability.
  • Plot the residuals versus the independent variable. This will allow to determine how the empirical model corresponds to the results of observation, and to detect violation of the constancy of the variance.
  • Use histograms, stem and leaf plots, box plots, and normal distribution plots to test the assumption of a normal distribution of errors.
  • If the applicability conditions of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
  • If the applicability conditions of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and build confidence intervals containing the mathematical expectation and the predicted response value.
  • Avoid predicting values ​​of the dependent variable outside the range of the independent variable.
  • Keep in mind that statistical dependencies are not always causal. Remember that correlation between variables does not mean that there is a causal relationship between them.

Summary. As shown in the block diagram (Fig. 27), the note describes a simple linear regression model, the conditions for its applicability, and ways to test these conditions. Considered t-criterion for testing the statistical significance of the slope of the regression. To predict the values ​​of the dependent variable, we used regression model. An example is considered related to the choice of a place for a retail outlet, in which the dependence of the annual sales volume on the store area is studied. The information obtained allows you to more accurately select a location for the store and predict its annual sales. In the following notes, the discussion of regression analysis will continue, as well as multiple regression models.

Rice. 27. Block diagram of a note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 792–872

If the dependent variable is categorical, logistic regression should be applied.

The purpose of multiple regression is to analyze the relationship between one dependent and several independent variables.

Example: There is data on the cost of one seat (when buying 50 seats) for various PDM systems. Required: to evaluate the relationship between the price of a PDM system workplace and the number of characteristics implemented in it, shown in Table 2.

Table 2 - Characteristics of PDM systems

Item number PDM system Price Product configuration management Product Models Teamwork Product change management Document flow Archives Document Search Project Planning Product Manufacturing Management
iMAN Yes Yes
PartY Plus Yes Yes
PDM STEP Suite Yes Yes
Search Yes Yes
Windchill Yes Yes
Compass Manager Yes Yes
T-Flex Docs Yes Yes
TechnoPro Not Not

The numerical value of characteristics (except "Cost", "Product models" and "Teamwork") means the number of implemented requirements of each characteristic.

Let's create and fill in a spreadsheet with initial data (Figure 27).

The value "1" of the variables "Mod. ed. " and "Collect. r-ta.” corresponds to the value "Yes" of the source data, and the value "0" to the value "No" of the source data.

Let's build a regression between the dependent variable "Cost" and the independent variables "Ex. conf., Mod. ed., Collect. r-ta”, “Ex. rev.", "Doc.", "Archives", "Search", "Plan-e", "Ex. made.

To start the statistical analysis of the initial data, call the "Multiple Regression" module (Figure 22).

In the dialog box that appears (Figure 23), specify the variables for which the statistical analysis will be performed.

Figure 27 - Initial data

To do this, press the Variables button and in the dialog box that appears (Figure 28) in the part corresponding to dependent variables (Dependent var.) select "1-Cost", and in the part corresponding to independent variables (Independent variable list) select all other variables. The selection of several variables from the list is carried out using the "Ctrl" or "Shift" keys, or by specifying the numbers (range of numbers) of the variables in the corresponding field.



Figure 28 - Dialog box for setting variables for statistical analysis

After the variables are selected, click the "OK" button in the dialog box for setting the parameters of the "Multiple Regression" module. In the window that appears with the inscription "No of indep. vars. >=(N-1); cannot invert corr. matrix." (Figure 29) press the "OK" button.

This message appears when the system cannot build a regression for all declared independent variables, because the number of variables is greater than or equal to the number of occurrences minus 1.

In the window that appears (Figure 30), on the “Advanced” tab, you can change the method for constructing the regression equation.

Figure 29 - Error message

To do this, in the "Method" (method) field, select "Forward stepwise" (step-by-step with inclusion).

Figure 30 - Window for choosing a method and setting parameters for constructing a regression equation

The method of stepwise regression consists in the fact that at each step some independent variable is included or excluded in the model. Thus, a set of the most "significant" variables is singled out. This reduces the number of variables that describe the dependency.

Stepwise analysis with an exception ("Backward stepwise"). In this case, all variables will be included in the model first, and then at each step, variables that contribute little to the predictions will be eliminated. Then, as a result of a successful analysis, only the "important" variables in the model can be stored, that is, those variables whose contribution to discrimination is greater than the rest.

Stepwise analysis with inclusion ("Forward stepwise"). When using this method, independent variables are sequentially included in the regression equation until the equation satisfactorily describes the original data. The inclusion of variables is determined using the F-criterion. At each step, all variables are looked through and the one that makes the greatest contribution to the difference between the sets is found. This variable must be included in the model for this step, and proceed to the next step.

In the "Intercept" field (free regression term), you can choose whether to include it in the equation ("Include in model") or ignore it and consider it equal to zero ("Set to zero").

The "Tolerance" parameter is the tolerance of the variables. Defined as 1 minus the square of the coefficient multiple correlation this variable with all other independent variables in the regression equation. Therefore, the smaller the tolerance of a variable, the more redundant is its contribution to the regression equation. If the tolerance of any of the variables in the regression equation is equal to or close to zero, then the regression equation cannot be evaluated. Therefore, it is desirable to set the tolerance parameter to 0.05 or 0.1.

The parameter "Ridge regression; lambda:" is used when the independent variables are highly intercorrelated and robust estimates for the coefficients of the regression equation cannot be obtained through least squares. The specified constant (lambda) will be added to the diagonal of the correlation matrix, which will then be re-normalized (so that all diagonal elements are equal to 1.0). In other words, this parameter artificially reduces the correlation coefficients so that more robust (yet biased) estimates of the regression parameters can be computed. In our case, this parameter is not used.

The "Batch processing/printing" option is used when it is necessary to immediately prepare several tables for the report, reflecting the results and the process of regression analysis. This option is very useful when you want to print or analyze the results of a stepwise regression analysis at each step.

On the “Stepwise” tab (Figure 31), you can set the parameters of the inclusion (“F to enter”) or exclusion (“F to remove”) conditions for variables when constructing the regression equation, as well as the number of steps for constructing the equation (“Number of steps”).

Figure 31 - Tab “Stepwise” of the window for choosing a method and setting parameters for constructing a regression equation

F is the value of the F-criterion.

If, during stepwise analysis with inclusion, it is necessary that all or almost all variables enter the regression equation, then it is necessary to set the “F to enter” value to the minimum (0.0001), and set the “F to remove” value to the minimum as well.

If, during stepwise analysis with an exception, it is necessary to remove all variables (one by one) from the regression equation, then it is necessary to set the value of "F to enter" very large, for example 999, and set the value of "F to remove" close to "F to enter".

It should be remembered that the value of the "F to remove" parameter must always be less than "F to enter".

The "Display results" option has two options:

2) At each step - display the results of the analysis at each step.

After clicking the "OK" button in the window for selecting methods of regression analysis, a window of analysis results will appear (Figure 32).

Figure 32 - Analysis results window

Figure 33 - Summary of regression analysis results

According to the results of the analysis, the coefficient of determination . This means that the constructed regression explains 99.987% of the spread of values ​​relative to the mean, i.e. explains almost all the variability of the variables.

Great importance and its significance level show that the constructed regression is highly significant.

To view summary regression results, click the "Summary: Regression result" button. The screen will show spreadsheet with the results of the analysis (Figure 33).

The third column ("B") displays estimates of the unknown parameters of the model, i.e. coefficients of the regression equation.

Thus, the required regression looks like:

A qualitatively constructed regression equation can be interpreted as follows:

1) The cost of a PDM system increases with an increase in the number of implemented functions for change management, workflow and planning, and also if the product model support function is included in the system;

2) The cost of a PDM system decreases with the increase in configuration management functions implemented and with the increase in search capabilities.

The task of multiple linear regression is to build a linear model of the relationship between a set of continuous predictors and a continuous dependent variable. The following regression equation is often used:

Here a i- regression coefficients, b 0- free member (if used), e- a member containing an error - various assumptions are made about it, which, however, are more often reduced to the normality of the distribution with a zero vector mat. expectation and correlation matrix .

Such linear model many tasks are well described in various subject areas e.g. economics, industry, medicine. This is because some tasks are linear in nature.

Let's take a simple example. Let it be required to predict the cost of laying a road according to its known parameters. At the same time, we have data on already laid roads, indicating the length, the depth of the sprinkling, the amount of working material, the number of workers, and so on.

It is clear that the cost of the road will eventually become equal to the sum the values ​​of all these factors separately. It will take a certain amount, for example, crushed stone, with a known cost per ton, a certain amount of asphalt, also with a known cost.

It is possible that forestry will have to be cut down for laying, which will also lead to additional costs. All this together will give the cost of creating the road.

In this case, the model will include a free member, who, for example, will be responsible for organizational costs (which are approximately the same for all construction and installation works given level) or taxes.

The error will include factors that we did not take into account when building the model (for example, the weather during construction - it cannot be taken into account at all).

Example: Multiple Regression Analysis

For this example, several possible correlations of poverty rates and a power that predicts the percentage of families below the poverty line will be analyzed. Therefore, we will consider the variable characterizing the percentage of families below the poverty line as the dependent variable, and the remaining variables as continuous predictors.

Regression coefficients

To find out which of the explanatory variables contributes more to predicting poverty, we examine the standardized coefficients (or Beta) of the regression.

Rice. 1. Estimates of the parameters of the regression coefficients.

The Beta coefficients are the coefficients that you would get if you adjusted all variables to a mean of 0 and a standard deviation of 1. Therefore, the magnitude of these Beta coefficients allows you to compare the relative contribution of each independent variable to the dependent variable. As can be seen from the table shown above, the population changes since 1960 (POP_CHING), the percentage of the population living in the village (PT_RURAL) and the number of people employed in agriculture(N_Empld) are the most important predictors of poverty rates, as only they are statistically significant (their 95% confidence interval does not include 0). The regression coefficient of population change since 1960 (Pop_Chng) is negative, so the smaller the population growth, the more families who live below the poverty line in the respective county. The regression coefficient for the population (%) living in the village (Pt_Rural) is positive, i.e., the greater the percentage villagers, topics more level poverty.

Significance of predictor effects

Let's look at the Table with the significance criteria.

Rice. 2. Simultaneous results for each given variable.

As this table shows, only the effects of 2 variables are statistically significant: the change in population since 1960 (Pop_Chng) and the percentage of the population living in the village (Pt_Rural), p< .05.

Residue analysis. After fitting a regression equation, it is almost always necessary to check the predicted values ​​and residuals. For example, large outliers can greatly skew the results and lead to erroneous conclusions.

Line graph of emissions

It is usually necessary to check the original or standardized residuals for large outliers.

Rice. 3. Numbers of observations and residuals.

Scale vertical axis of this graph is plotted in terms of sigma, i.e., standard deviation leftovers. If one or more observations do not fall within ±3 times sigma, then it may be worth excluding those observations (this can be easily done through the observation selection conditions) and running the analysis again to make sure that the results are not changed by these outliers.

Mahalanobis Distances

Most statistical textbooks spend a lot of time on outliers and residuals on the dependent variable. However, the role of outliers in predictors often remains unidentified. On the side of the predictor variable, there is a list of variables that participate with different weights (regression coefficients) in predicting the dependent variable. You can think of the independent variables as a multidimensional space in which any observation can be put off. For example, if you have two independent variables with equal odds regression, it would be possible to construct a scatterplot of these two variables and place each observation on this plot. Then one could mark the average value on this graph and calculate the distances from each observation to this average (the so-called center of gravity) in two-dimensional space. This is the main idea behind calculating the Mahalanobis distance. Now look at the histogram of the population change variable since 1960.

Rice. 4. Histogram of distribution of Mahalanobis distances.

It follows from the graph that there is one outlier at the Mahalanobis distances.

Rice. 5. Observed, predicted and residual values.

Notice how Shelby County (in the first row) stands out from the rest of the counties. If you look at the original data, you will find that in reality Shelby County has the most big number people employed in agriculture (variable N_Empld). It might be wiser to express it as a percentage rather than absolute numbers, in which case Shelby County's Mahalanobis distance would probably not be as large compared to other counties. Clearly, Shelby County is an outlier.

Removed remnants

Another very important statistic that allows one to gauge the severity of the outlier problem is the removed residuals. These are the standardized residuals for the respective cases, which are obtained by removing that case from the analysis. Remember that the multiple regression procedure adjusts the regression surface to show the relationship between the dependent variable and the predictor. If one observation is an outlier (like Shelby County), then there is a tendency to "pull" the regression surface toward that outlier. As a result, if the corresponding observation is removed, another surface (and Beta coefficients) will be obtained. Therefore, if the removed residuals are very different from the standardized residuals, then you will have reason to assume that regression analysis seriously distorted by the relevant observation. In this example, the removed residuals for Shelby County show that this is an outlier that severely skews the analysis. The scatterplot clearly shows the outlier.

Rice. 6. Initial Residuals and Displaced Residuals variable indicating the percentage of families living below the poverty line.

Most of them have more or less clear interpretations, however, let's turn to normal probability graphs.

As already mentioned, multiple regression assumes that there is a linear relationship between the variables in the equation and a normal distribution of the residuals. If these assumptions are violated, then the conclusion may be inaccurate. A normal probability plot of residuals will tell you if there are serious violations of these assumptions or not.

Rice. 7. Normal probability graph; original leftovers.

This chart was built in the following way. First, the standardized residuals are ranked in order. From these ranks, you can calculate z-values ​​(i.e., normal distribution standard values) based on the assumption that the data follows a normal distribution. These z values ​​are plotted along the y-axis on the graph.

If the observed residuals (plotted along the x-axis) are normally distributed, then all values ​​would lie on a straight line on the graph. On our graph, all the points are very close relative to the curve. If the residuals are not normally distributed, then they deviate from this line. Outliers also become noticeable in this graph.

If there is loss of agreement and the data appears to form a clear curve (e.g., in the shape of an S) about the line, then the dependent variable can be transformed in some way (e.g., logarithmic transformation to "reduce" the tail of the distribution, etc.). A discussion of this method is outside the scope of this example (Neter, Wasserman, and Kutner, 1985, pp. 134-141, a discussion of transformations that remove non-normality and non-linearity of data is presented). However, researchers very often simply conduct analyzes directly without testing the relevant assumptions, leading to erroneous conclusions.