Biographies Characteristics Analysis

Determine the parameters of the pairwise linear regression equation. Sample variance for

Paired Linear Regression

WORKSHOP

Paired Linear Regression: Workshop. -

The study of econometrics involves students gaining experience in building econometric models, making decisions on the specification and identification of a model, choosing a method for estimating model parameters, assessing its quality, interpreting results, obtaining predictive estimates, etc. The workshop will help students acquire practical skills in these matters.

Approved by the editorial and publishing council

Compiled by: M.B. Perova, Doctor of Economics, Professor

General provisions

Econometric research begins with a theory that establishes relationships between phenomena. From the whole range of factors influencing the effective feature, the most significant factors are distinguished. After the presence of a relationship between the studied characteristics has been identified, the exact form of this relationship is determined using regression analysis.

Regression analysis is to determine an analytical expression (in the definition of a function), in which a change in one value (effective feature) is due to the influence independent value(factorial sign). This relationship can be quantified by constructing a regression equation or a regression function.

The basic regression model is a paired (one-factor) regression model. Pair Regression– the equation of connection of two variables at and X:

where - dependent variable (resultant sign);

– independent, explanatory variable (factorial sign).

Depending on the nature of the change at with change X distinguish between linear and non-linear regressions.

Linear Regression

This regression function is called a polynomial of the first degree and is used to describe processes uniformly developing in time.

Having a random member (regression errors) is associated with the impact on the dependent variable of other factors not taken into account in the equation, with the possible nonlinearity of the model, measurement errors, therefore, the appearance random error equation regression may be due to the following objective reasons:

1) non-representativeness of the sample. The paired regression model includes a factor that is not able to fully explain the variation of the outcome attribute, which may be influenced by many other factors (missing variables) to a much greater extent. Employment, wages may depend, in addition to qualifications, on the level of education, work experience, gender, etc.;

2) there is a possibility that the variables involved in the model may be measured in error. For example, data on family food expenditures are compiled from the records of survey participants, who are expected to carefully record their daily expenses. Of course, this can lead to errors.

Based on the sample observation, the sample regression equation is estimated ( regression line):

,

where
– estimates of the parameters of the regression equation (
).

Analytical form of dependency between the studied pair of features (regression function) is determined using the following methods:

    Based on theoretical and logical analysis the nature of the studied phenomena, their socio-economic essence. For example, if the relationship between the income of the population and the size of the population's deposits in banks is studied, then it is obvious that the relationship is direct.

    Graphic method when the nature of the relationship is assessed visually.

This dependence can be clearly seen if you build a graph by plotting the value of the attribute on the x-axis X, and on the y-axis - the values ​​of the feature at. Putting on the graph the points corresponding to the values X and at, we get correlation field:

a) if the points are randomly scattered throughout the field, this indicates the absence of a relationship between these features;

b) if the points are concentrated around an axis going from the lower left corner to the upper right, then there is a direct relationship between the signs;

c) if the points are concentrated around an axis running from the upper left corner to the lower right, then the relationship between the features is inverse.

If on the correlation field we connect the points with line segments, then we get broken line with some upward trend. This will be an empirical link or empirical regression line. By its appearance, one can judge not only the presence, but also the form of the relationship between the studied features.

Building a Pair Regression Equation

The construction of the regression equation is reduced to estimating its parameters. These parameter estimates can be found in various ways. One of them is the method least squares(MNK). The essence of the method is as follows. Each value corresponds to the empirical (observed) value . By constructing a regression equation, for example, a straight line equation, each value will correspond to the theoretical (calculated) value . Observed values do not lie exactly on the regression line, i.e. do not match with . The difference between the actual and calculated values ​​of the dependent variable is called remainder:

LSM allows you to obtain such estimates of parameters, in which the sum of the squared deviations of the actual values ​​of the effective feature at from theoretical , i.e. sum of squares of residuals, minimum:

For linear equations and nonlinear equations reducible to linear, the following system is solved with respect to a and b:

where n– sample size.

Solving the system of equations, we obtain the values a and b, which allows us to write regression equation(regression equation):

where is the explanatory (independent) variable;

–explained (dependent) variable;

The regression line passes through the point ( ,) and equalities are fulfilled:

You can use ready-made formulas that follow from this system of equations:

where - the average value of the dependent feature;

is the average value of an independent feature;

is the arithmetic mean of the product of the dependent and independent features;

is the variance of an independent feature;

is the covariance between the dependent and independent features.

Sample covariance two variables X, at called average value the product of the deviations of these variables from their means

Parameter b at X has a great practical value and is called the regression coefficient. Regression coefficient shows how many units the value changes on average at X 1 unit of its measurement.

Parameter sign b in the pair regression equation indicates the direction of the relationship:

if
, then the relationship between the studied indicators is direct, i.e. with an increase in the factor trait X the resultant sign increases at, and vice versa;

if
, then the relationship between the studied indicators is inverse, i.e. with an increase in the factor trait X effective sign at decreases and vice versa.

Parameter value a in the pair regression equation in some cases can be interpreted as the initial value of the effective feature at. This interpretation of the parameter a possible only if the value
has the meaning.

After building the regression equation, the observed values y can be imagined as:

Remains , as well as errors , are random variables, but they, in contrast to errors , observable. The remainder is that part of the dependent variable y, which cannot be explained by the regression equation.

Based on the regression equation, one can calculate theoretical values X for any values X.

In economic analysis, the concept of the elasticity of a function is often used. Function elasticity
calculated as relative change y to relative change x. Elasticity shows how much the function changes
when the independent variable changes by 1%.

Because elasticity linear function
is not constant value, but depends on X, then the elasticity coefficient is usually calculated as the average elasticity index.

Elasticity coefficient shows by how many percent the value of the effective attribute will change on average in the aggregate at when changing the factor sign X 1% of its average value:

where
– average values ​​of variables X and at in the sample.

Evaluation of the quality of the constructed regression model

Quality of the regression model– adequacy of the constructed model to the initial (observed) data.

To measure the tightness of the connection, i.e. to measure how close it is to the functional, you need to determine the variance that measures the deviations at from at X and characterizing the residual variation due to other factors. They underlie the indicators that characterize the quality of the regression model.

The quality of pairwise regression is determined using coefficients characterizing

1) the tightness of the connection - the correlation index, the paired linear correlation coefficient;

2) approximation error;

3) the quality of the regression equation and its individual parameters - the mean square errors of the regression equation as a whole and its individual parameters.

For regression equations of any kind is defined correlation index, which characterizes only the tightness of the correlation dependence, i.e. the degree of its approximation to a functional connection:

,

where – factorial (theoretical) variance;

is the total variance.

Correlation index takes values
, wherein,

if

if
is the relationship between features X and at is functional, the closer to 1, the closer the relationship between the studied traits is considered. If a
, then the relationship can be considered as close

The variances required to calculate the indicators of the tightness of the connection are calculated:

Total variance, measuring common variation due to the action of all factors:

Factorial (theoretical) variance, measuring the variation of the resulting trait at due to the action of a factor sign X:

Residual dispersion, which characterizes the variation of the trait at due to all factors except X(i.e. with the excluded X):

Then, according to the rule of addition of variances:

Steam room quality linear regression can be defined also using paired linear correlation coefficient:

,

where
– covariance of variables X and at;

– standard deviation of an independent feature;

is the standard deviation of the dependent feature.

The linear correlation coefficient characterizes the tightness and direction of the relationship between the studied features. It is measured within [-1; +1]:

if
- then the relationship between the signs is direct;

if
- then the relationship between the signs is inverse;

if
– then there is no connection between the signs;

if
or
- then the relationship between the features is functional, i.e. characterized by a perfect match between X and at. The closer to 1, the closer the relationship between the studied traits is considered.

If the correlation index (paired linear coefficient correlations) squared, we get the coefficient of determination.

Determination coefficient- represents the share of factor variance in the total and shows how many percent the variation of the resulting attribute at explained by the variation of the factor trait X:

It doesn't cover all variations. at from a factor trait X, but only that part of it that corresponds to the linear regression equation, i.e. shows specific gravity variation of the resulting trait, linearly related to the variation of the factor trait.

Value
- the proportion of the variation of the resulting attribute, which the regression model could not take into account.

The spread of points in the correlation field can be very large, and the calculated regression equation can give a large error in estimating the analyzed indicator.

Average approximation error shows the average deviation of the calculated values ​​from the actual ones:

The maximum allowable value is 12–15%.

A measure of the spread of the dependent variable around the regression line is the standard error. For the entire set of observed values, the standard (rms) regression equation error, which is the standard deviation of the actual values at relative to theoretical values ​​calculated by the regression equation at X .

,

where
is the number of degrees of freedom;

m is the number of parameters of the regression equation (for the straight line equation m=2).

The value of the mean square error can be estimated by comparing it

a) with the average value of the effective feature at;

b) with the standard deviation of the feature at:

if
, then the use of this regression equation is appropriate.

Separately evaluated standard (rms) errors of equation parameters and correlation index:

;
;
.

X– standard deviation X.

Checking the significance of the regression equation and indicators of the tightness of the connection

In order for the constructed model to be used for further economic calculations, it is not enough to check the quality of the constructed model. It is also necessary to check the significance (importance) of the estimates of the regression equation and the indicator of closeness of connection obtained using the least squares method, i.e. it is necessary to check them for compliance with the true parameters of the relationship.

This is due to the fact that the indicators calculated for a limited population retain the element of randomness inherent in the individual values ​​of the attribute. Therefore, they are only estimates of a certain statistical regularity. It is necessary to assess the degree of accuracy and significance (reliability, materiality) of the regression parameters. Under significance understand the probability that the value of the checked parameter is not equal to zero does not include values ​​of opposite signs.

Significance Test– checking the assumption that the parameters differ from zero.

Assessing the Significance of the Paired Regression Equation comes down to testing hypotheses about the significance of the regression equation as a whole and its individual parameters ( a, b), pair coefficient of determination or correlation index.

In this case, the following can be put forward main hypothesesH 0 :

1)
– the regression coefficients are insignificant and the regression equation is also insignificant;

2)
– the pair coefficient of determination is insignificant and the regression equation is also insignificant.

Alternative (or reverse) are the following hypotheses:

1)
– regression coefficients are significantly different from zero, and the constructed regression equation is significant;

2)
– the pair coefficient of determination is significantly different from zero and the constructed regression equation is significant.

Testing the hypothesis about the significance of the paired regression equation

To test the hypothesis of statistical insignificance of the regression equation as a whole and the coefficient of determination, we use F-criterion(Fisher's criterion):

or

where k 1 = m–1 ; k 2 = nm is the number of degrees of freedom;

n is the number of population units;

m is the number of parameters of the regression equation;

– factor dispersion;

is the residual variance.

The hypothesis is tested as follows:

1) if the actual (observed) value F-criterion is greater than the critical (table) value of this criterion
, then with probability
the main hypothesis about the insignificance of the regression equation or the pair coefficient of determination is rejected, and the regression equation is recognized as significant;

2) if the actual (observed) value of the F-criterion is less than the critical value of this criterion
, then with probability (
) the main hypothesis about the insignificance of the regression equation or the pair coefficient of determination is accepted, and the constructed regression equation is recognized as insignificant.

critical value F- the criterion is found according to the corresponding tables depending on the level of significance and number of degrees of freedom
.

Number of degrees of freedom– indicator, which is defined as the difference between the sample size ( n) and the number of estimated parameters for this sample ( m). For a paired regression model, the number of degrees of freedom is calculated as
, since two parameters are estimated from the sample (
).

Significance level - the value determined
,

where is the confidence probability that the estimated parameter falls within the confidence interval. Usually 0.95 is taken. Thus is the probability that the estimated parameter will not fall into the confidence interval, equal to 0.05 (5%) .

Then, in the case of assessing the significance of the paired regression equation, the critical value of the F-criterion is calculated as
:

.

Testing the hypothesis about the significance of the parameters of the pair regression equation and the correlation index

When checking the significance of the parameters of the equation (the assumption that the parameters differ from zero), the main hypothesis is put forward about the insignificance of the obtained estimates (
. As an alternative (reverse) hypothesis is put forward about the significance of the parameters of the equation (
).

To test the proposed hypotheses, we use t -criterion (t-statistics) Student. Observed value t-criteria is compared with the value t-criterion determined by the Student's distribution table (critical value). critical value t- criteria
depends on two parameters: significance level and number of degrees of freedom
.

The proposed hypotheses are tested as follows:

1) if the modulus of the observed value t-criteria is greater than the critical value t-criteria, i.e.
, then with probability
the main hypothesis about the insignificance of the regression parameters is rejected, i.e. regression parameters are not equal to 0;

2) if the modulus of the observed value t- the criterion is less than or equal to the critical value t-criteria, i.e.
, then with probability
the main hypothesis about the insignificance of the regression parameters is accepted, i.e. regression parameters almost do not differ from 0 or are equal to 0.

The assessment of the significance of the regression coefficients using the Student's test is carried out by comparing their estimates with the value of the standard error:

;

To assess the statistical significance of the index (linear coefficient) of the correlation, it is also used t-Student's criterion.

Paired regression characterizes the relationship between two features: resultant and factorial. An important and non-trivial step in building a regression model is the choice of a regression equation. This choice is based on theoretical data about the phenomenon under study and a preliminary analysis of available statistical data.

Steam room equation linear regression looks like:

where are the theoretical values ​​of the effective feature obtained by the regression equation; - coefficients (parameters) of the regression equation.

The regression model is built on the basis of statistical data, and can be used as individual values features, and grouped data. To identify the relationship between features, it is enough a large number observations, statistical data are preliminarily grouped according to both criteria and a correlation table is built. With help correlation table only the pairwise correlation is displayed, i.e. connection of an effective feature with one factor. The estimation of the parameters of the regression equation is carried out by the least squares method, which is based on the assumption of the independence of the observations of the studied population and the requirement that the sum of squared deviations of empirical data from the aligned values ​​of the effective factor be minimal:

.

For linear equation regression we have:

To find the minimum of this function, we equate its partial derivatives to zero and obtain a system of two linear equations, which is called the system normal equations:

where is the volume of the studied population (number of units of observation).

Solving a system of normal equations allows you to find the parameters of the regression equation.

The pairwise linear regression coefficient is the average value at the point , so its economic interpretation is difficult. The meaning of this coefficient can be interpreted as the average influence on the effective attribute of unaccounted (not allocated for research) factors. The coefficient shows how much the value of the effective attribute changes on average when the factor attribute changes by one.

After receiving the regression equation, it is necessary to check its adequacy, that is, compliance with the actual statistical data. For this purpose, the significance of the regression coefficients is checked: it turns out to what extent these indicators are typical for the entire population whether they are the result of a random combination of circumstances.

To test the significance of the coefficients of a simple linear regression with a population size of less than 30 units, Student's t-test is used. Comparing the value of the parameter with its average error, the value of the criterion is determined:


where is the average error of the parameter .

The average error of the parameters and are calculated by the following formulas:

; ,

– sample size;

The standard deviation of the resulting feature from the aligned values ​​;

The standard deviation of the factor sign from the total average:

or

Then the calculated (actual) values ​​of the criterion are respectively equal to:

- for the parameter ;

- for the parameter.

The calculated values ​​of the criterion are compared with the critical values ​​, which are determined by the Student's table, taking into account the accepted level of significance and the number of degrees of freedom , where is the sample size, -1 ( is the number of factor signs). In socioeconomic studies, the significance level is usually taken as 0.05 or 0.01. The parameter is recognized as significant if (the hypothesis is rejected that the parameter turned out to be equal to the obtained value only due to random circumstances, but in reality it is equal to zero).

The adequacy of the regression model can be assessed using Fisher's -test. The calculated value of the criterion is determined by the formula ,

where is the number of model parameters;

Sample size.

The table determines the critical value of Fisher's -criterion for the accepted level of significance and the number of degrees of freedom , . If , then the regression model is recognized as adequate according to this criterion (the hypothesis about the discrepancy between the relationships inherent in the equation and the really existing relationships is rejected).

The second task of the correlation-regression analysis is to measure the tightness of the dependence of the resultant and factor sign.

For all types of connection, the problem of measuring the closeness of dependence can be solved by calculating the theoretical correlation ratio:

,

where - variance in the series of aligned values ​​of the effective feature , due to the factor feature ;

- dispersion in a series of actual values ​​. This is the total variance, which is the sum of the variance due to the factor (i.e. factor variance) and the variance of the residual (deviation empirical values feature from aligned theoretical ones).

Based on the rule of addition of variances the theoretical correlation ratio can be expressed in terms of the residual variance:

.

Since the dispersion reflects the variation in the series only due to the variation of the factor, and the dispersion reflects the variation due to all factors, their ratio, called the theoretical coefficient of determination, shows what proportion in total variance series is occupied by the variance caused by the variation of the factor . Square root from the ratio of these variances gives the theoretical correlation ratio. With non-linear relationships, the theoretical correlation ratio is called the correlation index and denoted by .

If , then this means that the role of other factors in the variation is absent, residual dispersion is zero and the ratio means complete dependence variations from . If , then this means that the variation does not affect the variation in any way, and in this case . Therefore, the correlation ratio takes values ​​from 0 to 1. The closer the correlation ratio is to 1, the closer connection between signs.

In addition, with a linear form of the connection equation, another indicator of the tightness of the connection is used - the linear correlation coefficient:

.

The linear correlation coefficient takes values ​​from –1 to 1. Negative values point to inverse relationship, positive - directly. The closer the module of the correlation coefficient is to unity, the closer the relationship between the features.

The following boundary estimates of the linear correlation coefficient are accepted:

There is no connection;

Communication is weak;

Communication is mediocre;

The connection is strong;

The connection is very strong.

The square of the linear correlation coefficient is called the linear coefficient of determination.

The fact of coincidence or mismatch of the theoretical correlation ratio and the linear correlation coefficient is used to evaluate the form of the dependence. Their values ​​are the same only if linear connection. The discrepancy between these values ​​indicates the nonlinearity of the relationship between the features. It is assumed that if , then the hypothesis of the linearity of the relationship can be considered confirmed.

Indicators of the closeness of connection, especially those calculated from the data of a relatively small statistical population, may be distorted by the action of random causes. This makes it necessary to check their reliability (significance), which makes it possible to extend the conclusions obtained from sample data to the general population.

For this, the average error of the correlation coefficient is calculated:

Where is the number of degrees of freedom with a linear relationship.

Then the ratio of the correlation coefficient to its mean error is found, that is, which is compared with the tabular value of the Student's t-test.

If the actual (calculated) value is greater than the tabular (critical, threshold), then the linear correlation coefficient is considered significant, and the relationship between and is considered real.

After checking the adequacy of the constructed model (regression equation), it must be analyzed. For the convenience of interpreting the parameter, the coefficient of elasticity is used. It shows the average changes in the resulting attribute when the factor attribute changes by 1% and is calculated by the formula:

The accuracy of the resulting model can be estimated based on the value average error approximations:

In addition, in some cases, data on the residuals characterizing the deviation of x observations from the calculated values ​​are informative. Of particular economic interest are the values ​​whose residuals have the largest positive or negative deviations from the expected level of the analyzed indicator.

Linear pair regression is widely used in econometrics in the form of a clear economic interpretation of its parameters. Linear regression is reduced to finding an equation of the form

or . (3.6)

Type equation allows for given values ​​of the factor X have theoretical values ​​of the effective feature, substituting the actual values ​​of the factor into it x.

The construction of a paired linear regression is reduced to estimating its parameters and . Linear regression parameter estimates can be found by different methods. For example, the method of least squares (LSM).

According to the least squares method of parameter estimation and are chosen in such a way that the sum of the squared deviations of the actual values ​​of the resulting feature (y) from the calculated (theoretical, model) was minimal. In other words, from the entire set of lines, the regression line on the graph is chosen so that the sum of squared vertical distances between the points and this line would be minimal (Fig. 3.2):

, (3.7)

Rice. 3.2. Regression line with minimum sum of squared vertical distances between points and this line

For further conclusions in expression (3.7) we substitute the model value, i.e., and we get:

To find the minimum of the function (3.8), it is necessary to calculate the partial derivatives with respect to each of the parameters and and equate them to zero:

Transforming this system, we obtain the following system of normal equations for estimating the parameters and :

. (3.9)

The matrix form of this system has the form:

. (3.10)

Solving the system of normal equations (3.10) in matrix form, we obtain:

The algebraic form of the solution to system (3.11) can be written as follows:

After simple transformations, formula (3.12) can be written in a convenient form:

It should be noted that estimates of the parameters of the regression equation can also be obtained using other formulas, for example:

(3.14)

Here is the sample pairwise linear correlation coefficient.

After calculating the regression parameters, we can write the equation of the mathematical model regression:

It should be noted that the parameter shows the average change in the result with a change in the factor by one unit. So, if in the cost function (at - costs (thousand rubles), X- the number of units of production). Therefore, with an increase in the volume of production (X) for 1 unit production costs increase by an average of 2 thousand rubles, i.e., an additional increase in production by 1 unit. will require an increase in costs by an average of 2 thousand rubles.

The possibility of a clear economic interpretation of the regression coefficient has made the linear regression equation quite common in econometric studies.

Formally - meaning at at X= 0. If the sign-factor does not and cannot have a zero value, then the above interpretation of the free term doesn't make sense. Parameter may not have economic content. Attempts to economically interpret the parameter can lead to absurdity, especially when < 0.

Example 3.2. Suppose for a group of enterprises producing the same type of product, the cost function is considered: . Information Needed to Calculate Parameter Estimates and , presented in table. 3.1.

Table 3.1

Estimated table

company number

Output, thousand units ()

Production costs, million rubles ()

The system of normal equations will look like:

.

The solution of this system by formula (4.13) gives the result:

Let us write the model of the regression equation (4.16):

Substituting into the equation the values x, we find the theoretical (model) values y,(see the last column of Table 3.1).

In this case, the value of the parameter makes no economic sense.

In this example, we have:

The regression equation is always supplemented with an indicator of the tightness of the connection. When using linear regression, the linear correlation coefficient acts as such an indicator. There are various modifications of the linear correlation coefficient formula. Some of them are listed below:

As you know, the linear correlation coefficient is within the limits: .

If the regression coefficient , then, and vice versa, at, .

According to Table. 4.1, the value of the linear correlation coefficient was 0.993, which is quite close to 1 and means that there is a very close dependence of production costs on the volume of output.

It should be borne in mind that the value of the linear correlation coefficient evaluates the closeness of the relationship of the considered features in its linear form. Therefore, the proximity of the absolute value of the linear correlation coefficient to zero does not mean that there is no connection between the features. With a different specification of the model, the relationship between the features may be quite close.

To assess the quality of the selection of a linear function, the square of the linear correlation coefficient is calculated, called determination coefficient. The coefficient of determination characterizes the proportion of the variance of the effective feature y, explainable by regression, in the total variance of the resulting feature.

Accordingly, the value characterizes the proportion of dispersion caused by the influence of other factors not taken into account in the model.

In our example . Consequently, the regression equation explains 98.6% of the variance of the resulting attribute, and only 1.4% of its variance (i.e., residual variance) falls to the share of other factors. The value of the coefficient of determination serves as one of the criteria for assessing the quality of a linear model. The larger the proportion of explained variation, the smaller the role of other factors, and, consequently, linear model well approximates the original data and can be used to predict the values ​​of the resulting feature. So, assuming that the volume of production of the enterprise can be 6 thousand . units, the forecast value for production costs will be 221.01 thousand rubles.

The simplest in terms of understanding, interpretation and calculation technique is the linear form of regression.

Linear pair regression equation , where

a 0 , a 1 - model parameters, ε i - random variable (remainder value).

Model parameters and their content:


The regression equation is supplemented with an indicator of the tightness of the connection. Such an indicator is the linear correlation coefficient, which is calculated by the formula:

or .

To assess the quality of the selection of a linear function, the square of the linear correlation coefficient is calculated, called determination coefficient. The coefficient of determination characterizes the proportion of the variance of the resultant attribute, explained by regression, in the total variance of the resultant attribute:

,

where

.

Accordingly, the value characterizes the proportion of dispersion caused by the influence of other factors not taken into account in the model.

After the regression equation is built, its adequacy and accuracy are checked. These properties of the model are studied based on the analysis of a number of residuals ε i (deviations of the calculated values ​​from the actual ones).

Residue row level

Correlative and regression analysis carried out for a limited population. In this regard, the indicators of regression, correlation and determination can be distorted by the action of random factors. To check how these indicators are typical for the entire population, whether they are the result of a combination of random circumstances, it is necessary to check the adequacy of the constructed model.

Checking the adequacy of the model consists in determining the significance of the model and establishing the presence or absence of a systematic error.

Values 1 relevant data X i at theoretical values a 0 and a 1 , random. The values ​​of the coefficients calculated from them will also be random. a 0 and a 1 .

Checking the significance of individual regression coefficients is carried out according to Student's t-test by testing the hypothesis that each regression coefficient is equal to zero. At the same time, it is found out how characteristic the calculated parameters are for displaying a set of conditions: are the obtained parameter values ​​the result of the action random variables. Appropriate formulas are used for the corresponding regression coefficients.

Formulas for determining Student's t-test

where

S a 0 ,S a 1 - standard deviations of the free term and the regression coefficient. Formulas

where

S ε - standard deviation model residuals (standard error of the estimate), which is determined by the formula

The calculated values ​​of the t-criterion are compared with the tabular value of the criterion tαγ , which is determined for (n - k— 1) degrees of freedom and the corresponding significance level α. If the calculated value of the t-criterion exceeds its tabular value tαγ , then the parameter is recognized as significant. In this case, it is almost unbelievable that the found values ​​of the parameters are due only to random coincidences.

The assessment of the significance of the regression equation as a whole is made on the basis of - Fisher's criterion, which is preceded by analysis of variance.

The total sum of squared deviations of the variable from the mean value is decomposed into two parts - "explained" and "unexplained":

Total sum of squared deviations;

Sum of squared deviations explained by regression (or factor sum of squared deviations);


- residual sum of squared deviations, which characterizes the influence of factors not taken into account in the model.

Scheme analysis of variance has the form presented in table 35 ( - number of observations, - number of parameters with variable ).

Table 35 - Scheme of analysis of variance

Variance components Sum of squares Number of degrees of freedom Dispersion per degree of freedom
General
factorial
Residual

Determining the dispersion per one degree of freedom brings the dispersions to a comparable form. Comparing the factorial and residual variances per one degree of freedom, we obtain the value of Fisher's -criterion:

To check the significance of the regression equation as a whole, use Fisher F-test. In the case of paired linear regression, the significance of the regression model is determined by the following formula: .

If, at a given level of significance, the calculated value of the F-criterion with γ 1 =k, γ 2 =( p-k- 1) the degrees of freedom are greater than the tabular one, then the model is considered significant, the hypothesis about the random nature of the estimated characteristics is rejected and their statistical significance and reliability are recognized. Checking for the presence or absence of a systematic error (fulfillment of the prerequisites of the method of least squares - LSM) is carried out on the basis of the analysis of a number of residuals. Calculation random errors parameters of linear regression and the correlation coefficient are produced by the formulas

,

To test the randomness property of a series of residuals, you can use the criterion of turning points (peaks). A point is considered a turning point if following conditions: ε i -1< ε i >ε i +1 or ε i -1 > ε i< ε i +1

Next, the number of turning points p is calculated. A randomness test with a 5% significance level, i.e. with confidence level 95%, is the fulfillment of the inequality:

The square brackets mean that it is taken whole part number enclosed in brackets. If the inequality is satisfied, then the model is considered adequate.

To test for equality mathematical expectation residual sequence zero, the average value of a series of residuals is calculated:

If = 0, then it is considered that the model does not contain a constant systematic error and is adequate according to the zero mean criterion.

If ≠ 0, then the null hypothesis is tested that the mathematical expectation is equal to zero. To do this, calculate Student's t-test according to the formula:

where S ε is the standard deviation of the model residuals (standard error).

The value of the t-criterion is compared with the table t αγ . If the inequality t > t αγ is satisfied, then the model is inadequate according to this criterion

The variance of the levels of a series of residues must be the same for all values X(property homoscedasticity). If this condition is not met, then heteroscedasticity .

To assess heteroscedasticity with a small sample size, one can use Goldfeld–Quandt method, the essence of which is that it is necessary:

Locate Variable Values X in ascending order;

Divide the set of ordered observations into two groups;

For each group of observations, construct regression equations;

Determine the residual sums of squares for the first and second groups using the formulas: ; , where

n 1 - the number of observations in the first group;

n 2 - the number of observations in the second group.

Calculate the criterion or (the numerator must contain a large sum of squares). While doing null hypothesis about homoscedasticity, the criterion F calc will satisfy the F-criterion with degrees of freedom γ 1 =n 1 -m, γ 2 =n - n 1 - m) for each residual amount squares (where m the number of estimated parameters in the regression equation). The more the value of Fcalc exceeds the tabular value of the F-criterion, the more the premise of the equality of the dispersions of the residuals is violated.

Checking the independence of the sequence of residues (lack of autocorrelation) is carried out using the Durbin-Watson d-test. It is determined by the formula:

The calculated value of the criterion is compared with the lower d 1 and upper d 2 critical values ​​of the Durbin–Watson statistics. The following cases are possible:

1) if d< d 1 , то гипотеза о независимости остатков отвергается и модель признается неадекватной по критерию независимости остатков;

2) if d 1 < d < d 2 (including these values ​​themselves), it is considered that there are no sufficient grounds for drawing one or another conclusion. Nessesary to use additional criterion, for example the first autocorrelation coefficient:

If the calculated value of the coefficient modulo is less than the tabular value r 1kr, then the hypothesis of the absence of autocorrelation is accepted; otherwise, this hypothesis is rejected;

3) if d 2 < d < 2, then the hypothesis of the independence of the residuals is accepted and the model is recognized as adequate according to this criterion;

4) if d> 2, then this indicates negative autocorrelation leftovers. In this case, the calculated value of the criterion must be converted according to the formula d′= 4 - d and compared with the critical value d′ , not d.

Checking the conformity of the distribution of the residual sequence normal law distributions can be carried out using the R / S - criterion, which is determined by the formula:

where S ε is the standard deviation of the model residuals (standard error). The calculated value of R/S - criteria is compared with table values(lower and upper bounds given relationship), and if the value does not fall within the interval between the critical boundaries, then with a given level of significance, the hypothesis of normal distribution is rejected; otherwise the hypothesis is accepted

To assess the quality of regression models, it is also advisable to use correlation index(multiple correlation coefficient).

Formula for determining the correlation index

where

The total sum of squared deviations of the dependent variable from its mean. Determined by the formula:

Sum of squared deviations explained by regression. Determined by the formula:

Residual sum of squared deviations. Calculated according to the formula:

The equation can be represented as follows:

The correlation index takes a value from 0 to 1. The higher the index value, the closer the calculated values ​​of the resulting feature are to the actual ones. The correlation index is used for any form of association of variables; with paired linear regression, it is equal to pair coefficient correlations.

Accuracy characteristics are used as a measure of model accuracy: To determine the measure of model accuracy, the following are calculated:

- maximum error- corresponds to the deviation of the calculated deviation of the calculated values ​​from the actual ones

- average absolute error - the error shows how much the actual values ​​deviate from the model on average

- variance of a series of residuals(residual variance)

where is the average value of a series of residues. Determined by the formula

- average quadratic error . It is the square root of the variance: , how less value errors, the more accurate the model

- average relative error approximations.

The average approximation error should not exceed 8-10%.

If the regression model is recognized as adequate, and the model parameters are significant, then proceed to building a forecast .

predicted value variable at is obtained by substituting the expected value of the independent variable into the regression equation X progn.

This prediction is called point. The probability of implementing a point forecast is almost zero, so the confidence interval of the forecast is calculated with high reliability.

Confidence intervals forecast depend on the standard error, removal X run from its mean , number of observations n and the significance level of the forecast α. Confidence intervals of the forecast are calculated by the formula: or

where

t table - determined by the Student's distribution table for the significance level α and the number of degrees of freedom γ=n-k-1.

Example13.

According to a survey of eight groups of families, data on the relationship between the population's spending on food and the level of family income are known (Table 36).

Table 36 - Relationships between household spending on food and family income

Expenditure on food, ths. rub. 0,9 1,2 1,8 2,2 2,6 2,9 3,3 3,8
Family income, thousand rubles 1,2 3,1 5,3 7,4 9,6 11,8 14,5 18,7

Assume that the relationship between family income and food expenditure is linear. To confirm our assumption, we construct a correlation field (Figure 8).

The graph shows that the points line up in some straight line.

For the convenience of further calculations, we will compile Table 37.

Calculate the parameters of the linear pair regression equation . To do this, we use the formulas:

Figure 8 - Correlation field.

We got the equation:

Those. with an increase in family income by 1000 rubles. food costs increase by 168 rubles.

Calculation of the linear correlation coefficient.

100 r first order bonus

Choose the type of work Graduate work Course work Abstract Master's thesis Report on practice Article Report Review Test Monograph Problem solving Business plan Answers to questions creative work Essay Drawing Compositions Translation Presentations Typing Other Increasing the uniqueness of the text Candidate's thesis Laboratory work Help online

Ask for a price

Pair regression is the equation of the relationship of two variables

y and x Species y= f(x),

where y - dependent variable (resultant sign);

x is an independent, explanatory variable (sign-factor).

There are linear and non-linear regressions.

Least squares method

To estimate the regression parameters that are linear in these parameters, the least squares method (LSM) is used . LSM makes it possible to obtain such estimates of the parameters under which the sum of the squared deviations of the actual values ​​of the effective feature y from the theoretical values ​​ŷ x with the same values ​​of the factor x minimal, i.e.

5. Evaluation of the statistical significance of correlation indicators, parameters of the paired linear regression equation, the regression equation as a whole.

6. Assessment of the degree of closeness of the relationship between quantitative variables. Covariance coefficient. Correlation measures: linear correlation coefficient, correlation index (= theoretical correlation ratio).

covariance coefficient

Mch (y) - I.e. we get a correlation dependence.

The presence of a correlation dependence cannot answer the question about the cause of the relationship. Correlation establishes only the measure of this connection, i.e. a measure of consistent variation.

A measure of the relationship between mu 2 variables can be found using covariance.

, ,

The value of the covariance exponent depends on the units in the γ variable being measured. Therefore, to assess the degree of consistent variation, the correlation coefficient is used - a dimensionless characteristic that has certain limits of variation.

7. Coefficient of determination. Standard error of the regression equation.

Coefficient of determination (rxy2) - characterizes the proportion of the variance of the resulting feature y, explained by the variance, in the total variance of the resulting feature. The closer rxy2 is to 1, the better regression model, that is, the original model approximates the original data well.

8. Evaluation of the statistical significance of the correction indicators, the parameters of the paired linear regression equation, the regression equation as a whole: t-Student's criterion, F- Fisher's criterion.

9. Nonlinear Models regressions and their linearization.

Nonlinear regressions are divided into two classes : regressions that are non-linear with respect to the explanatory variables excluded from the analysis, but linear with respect to the estimated parameters, and regressions that are non-linear with respect to the estimated parameters.

regression examples, non-linear in explanatory variables, but linear in the estimated parameters:


Nonlinear regression models and their linearization

With a nonlinear dependence of features, reduced to linear form, options multiple regression are also determined by MNC with the only difference being that it is not used for background information, but to the transformed data. So, considering the power function

,

we convert it to a linear form:

where the variables are expressed in logarithms.

Further, the LSM processing is the same: a system of normal equations is constructed and unknown parameters are determined. By potentiating the value , we find the parameter a and correspondingly general form power function equations.

Generally speaking, non-linear regression on the included variables does not contain any difficulties in estimating its parameters. This estimate is determined, as in linear regression, by the least squares. So, in the two-factor nonlinear regression equation

linearization can be carried out by introducing new variables into it . The result is a four-factor linear regression equation

10.Multicollinearity. Methods for eliminating multicollinearity.

The greatest difficulties in using the apparatus of multiple regression arise in the presence of multicollinearity of factors, when more than two factors are related linear dependence . The presence of factor multicollinearity may mean that some factors will always act in unison. As a result, the variation in the original data is no longer completely independent, and it is impossible to assess the impact of each factor separately.

The stronger the multicollinearity of the factors, the less reliable is the estimate of the distribution of the sum of the explained variation over individual factors using the method of least squares (LSM).

The inclusion of multicollinear factors in the model is undesirable for the following reasons:

ü difficult to interpret the parameters of multiple regression; linear regression parameters lose economic sense;

ü parameter estimates are unreliable, show large standard errors and change with the volume of observations, which makes the model unsuitable for analysis and forecasting

Methods for eliminating multicollinearity

- exclusion of the variable(s) from the model;

However, some caution is needed when applying this method. In this situation, specification errors are possible.

- obtaining additional data or constructing a new sample;

Sometimes, to reduce multicollinearity, it is enough to increase the sample size. For example, if you are using yearly data, you can change to quarterly data. Increasing the amount of data reduces the variances of the regression coefficients and thus increases them. statistical significance. However, obtaining a new sample or expanding the old one is not always possible or involves significant costs. Moreover, this approach can increase

autocorrelation.

- change of model specification;

In some cases, the problem of multicollinearity can be solved by changing the specification of the model: either the shape of the model is changed, or new explanatory variables are added that are not taken into account in the model.

- use of preliminary information about some parameters;

11. Classical linear model of multiple regression (CLMMR). Determination of the parameters of the ur-I of the multiple regression by the method of squares.