Biographies Characteristics Analysis

The assessment of the significance of the regression equation is carried out on the basis of. Assessment of the statistical significance of the regression equation and its parameters

In socio-economic research, one often has to work in conditions of a limited population, or with selective data. So after mathematical parameters the regression equation needs to evaluate them and the equation as a whole for statistical significance, i.e. it is necessary to make sure that the resulting equation and its parameters are formed under the influence of non-random factors.

First of all, the statistical significance of the equation as a whole is evaluated. The evaluation is usually carried out using Fisher's F-test. The calculation of the F-criterion is based on the rule of adding variances. Namely, the general variance sign-result = factor variance + residual variance.

actual price

Theoretical price
Having built the regression equation, it is possible to calculate the theoretical value of the sign-result, i.e. calculated by the regression equation taking into account its parameters.

These values ​​will characterize the sign-result formed under the influence of the factors included in the analysis.

There are always discrepancies (residuals) between the actual values ​​of the result attribute and those calculated on the basis of the regression equation, due to the influence of other factors not included in the analysis.

The difference between the theoretical and actual values ​​of the attribute-result is called residuals. General Variation result-sign:

The variation in the trait-result, due to the variation in the traits of the factors included in the analysis, is estimated through a comparison of the theoretical values ​​of the result. feature and its mean values. Residual variation through a comparison of theoretical and actual values ​​of the resulting feature. Total variance, residual and actual have a different number of degrees of freedom.

general, P- number of units in the studied population

actual, P- number of factors included in the analysis

Residual

Fisher's F-test is calculated as a ratio to , and calculated for one degree of freedom.

The use of Fisher's F-test as an estimate of the statistical significance of a regression equation is very logical. is the result. trait, due to the factors included in the analysis, i.e. this is the proportion of the explained result. sign. - this is a (variation) of the sign of the result due to factors whose influence is not taken into account, i.e. not included in the analysis.

That. F-criterion is designed to evaluate meaningful excess over . If it is insignificantly lower than , and even more so if it exceeds , therefore, the analysis does not include those factors that really affect the result attribute.

Fisher's F-test is tabulated, the actual value is compared with the table. If , then the regression equation is considered statistically significant. If, on the contrary, the equation is not statistically significant and cannot be used in practice, the significance of the equation as a whole indicates the statistical significance of the correlation indicators.

After evaluating the equation as a whole, it is necessary to evaluate the statistical significance of the parameters of the equation. This estimate is made using Student's t-statistics. The t-statistic is calculated as the ratio of the equation parameters (modulo) to their standard mean square error. If a one-factor model is evaluated, then 2 statistics are calculated.

In all computer programs the calculation of the standard error and t-statistics for the parameters is carried out with the calculation of the parameters themselves. T-statistics are tabulated. If the value is , then the parameter is considered statistically significant, i.e. formed under the influence of non-random factors.

Calculating the t-statistic essentially means testing the null hypothesis that the parameter is insignificant, i.e. its equality to zero. With a one-factor model, 2 hypotheses are evaluated: and

The level of significance of accepting the null hypothesis depends on the level of accepted confidence level. So if the researcher specifies a probability level of 95%, the acceptance significance level will be calculated, therefore, if the significance level ≥ 0.05, then it is accepted and the parameters are considered statistically insignificant. If , then the alternative is rejected and accepted: and .

Statistical application packages also provide a level of acceptance significance null hypotheses. An assessment of the significance of the regression equation and its parameters can give the following results:

First, the equation as a whole is significant (according to the F-test) and all parameters of the equation are also statistically significant. This means that the resulting equation can be used both for making management decisions as well as for forecasting.

Secondly, according to the F-criterion, the equation is statistically significant, but at least one of the parameters of the equation is not significant. The equation can be used to make management decisions regarding the analyzed factors, but cannot be used for forecasting.

Thirdly, the equation is not statistically significant, or the equation is significant according to the F-criterion, but all parameters of the resulting equation are not significant. The equation cannot be used for any purpose.

In order for the regression equation to be recognized as a model of the relationship between the result sign and the factor signs, it is necessary that all critical factors, which determine the result, so that the meaningful interpretation of the parameters of the equation corresponds to theoretically substantiated relationships in the phenomenon under study. The coefficient of determination R 2 must be > 0.5.

When building multiple equation regression, it is advisable to carry out an assessment by the so-called adjusted coefficient of determination (R 2). The value of R 2 (as well as correlations) increases with an increase in the number of factors included in the analysis. The value of the coefficients is especially overestimated in conditions of small populations. In order to pay off negative influence R 2 and correlations are corrected for the number of degrees of freedom, i.e. the number of freely varying elements when certain factors are included.

Adjusted coefficient of determination

P–set size/number of observations

k– number of factors included in the analysis

n-1 is the number of degrees of freedom

(1-R2)- the value of the residual / unexplained variance of the resulting attribute

Always less R2. on the basis, one can compare the estimates of the equations with different number analyzed factors.

34. Problems of studying time series.

Series of dynamics are called time series or time series. A dynamic series is a time-ordered sequence of indicators characterizing a particular phenomenon (the volume of GDP from 90 to 98 years). The purpose of studying the series of dynamics is to identify patterns in the development of the phenomenon under study (the main trend) and forecast on this basis. It follows from the definition of RD that any series consists of two elements: time t and the level of the series (those specific values ​​of the indicator on the basis of which the DR series is built). DR series can be 1) momentary - series whose indicators are fixed at a point in time, at certain date, 2) interval - series, the indicators of which are obtained for a certain period of time (1. population of St. Petersburg, 2. GDP for the period). The division of the series into moment and interval ones is necessary, since this determines the specifics of the calculation of some indicators of the DR series. Level summation interval series gives a meaningfully interpreted result, which cannot be said about the summation of the levels of the moment series, since the latter contain repeated counting. The most important problem in the analysis of time series is the problem of comparability of the levels of the series. This concept is very versatile. The levels should be comparable in terms of calculation methods and in terms of territory and coverage of population units. If the DR series is built in terms of cost, then all levels should be presented or calculated in comparable prices. When constructing interval series, the levels should characterize the same periods of time. When constructing moment Series D, the levels must be fixed on the same date. The rows can be complete or incomplete. Incomplete series are used in official publications (1980,1985,1990,1995,1996,1997,1998,1999…). Complex analysis The RD includes the study of the following points:

1. calculation of indicators of changes in RD levels

2. calculation of average indicators of RD

3. identifying the main trend of the series, building trend models

4. Estimation of autocorrelation in RD, construction of autoregressive models

5. correlation of RD

6. RD forecasting.

35. Indicators of change in the levels of time series .

AT general view RowD can be represented:

y is the DR level, t is the moment or period of time to which the level (indicator) refers, n is the length of the DR Series (number of periods). when studying a series of dynamics, the following indicators are calculated: 1. absolute growth, 2. growth factor (growth rate), 3. acceleration, 4. growth factor (growth rate), 5. absolute value 1% increase. The calculated indicators can be: 1. chain - obtained by comparing each level of the series with the immediately preceding one, 2. basic - obtained by comparing with the level chosen as the comparison base (unless otherwise specified, the 1st level of the series is taken as the base). 1. Chain absolute gains:. Shows how much more or less. Chain absolute increases are called indicators of the rate of change of levels dynamic series. Base absolute growth: . If the levels of the series are relative performance, expressed in %, then the absolute increase is expressed in points of change. 2. growth factor (growth rate): It is calculated as the ratio of the levels of the series to the immediately preceding ones (chain growth factors), or to the level taken as the comparison base (basic growth factors): . Characterizes how many times each level of the series > or< предшествующего или базисного. На основе коэффициентов роста рассчитываются темпы роста. Это коэффициенты роста, выраженные в %ах: 3. on the basis of absolute growth, the indicator is calculated - acceleration of absolute growth: . Acceleration is the absolute growth of absolute growths. Evaluates how the increments themselves change, whether they are stable or accelerating (increasing). 4. growth rate is the ratio of growth to the base of comparison. Expressed in %: ; . The growth rate is the growth rate minus 100%. Shows how much given level row > or< предшествующего либо базисного. 5. абсолютное значение 1% прироста. Рассчитывается как отношение абсолютного прироста к темпу прироста, т.е.: - сотая доля предыдущего уровня. Все эти показатели рассчитываются для оценки степени изменения уровней ряда. Цепные коэффициенты и темпы роста называются показателями интенсивности изменения уровней ДРядов.

2. Calculation of average indicators of RD Calculate the average levels of the series, the average absolute gains, the average growth rate and the average growth rate. Average indicators are calculated in order to summarize information and to be able to compare the levels and indicators of their change in different series. 1. middle level row a) for interval time series it is calculated by the simple arithmetic mean: , where n is the number of levels in the time series; b) for moment series, the average level is calculated according to a specific formula, which is called the chronological average: . 2. average absolute increase is calculated on the basis of chain absolute increments according to the arithmetic mean simple:

. 3. Average coefficient growth calculated on the basis of chain growth factors using the geometric mean formula: . When commenting on the average indicators of the DR Series, it is necessary to indicate 2 points: the period that characterizes the analyzed indicator and the time interval for which the DR Series is built. 4. Average growth rate: . 5. average growth rate: .

To assess the significance, significance of the correlation coefficient, Student's t-test is used.

The average error of the correlation coefficient is found by the formula:

H
and based on the error, the t-test is calculated:

The calculated value of the t-test is compared with the tabular value found in the Student's distribution table at a significance level of 0.05 or 0.01 and the number of degrees of freedom n-1. If the calculated value of the t-test is greater than the tabulated one, then the correlation coefficient is recognized as significant.

With a curvilinear relationship, the F-criterion is used to assess the significance of the correlation relationship and the regression equation. It is calculated by the formula:

or

where η is the correlation ratio; n is the number of observations; m is the number of parameters in the regression equation.

The calculated value of F is compared with the table value for the accepted level of significance α (0.05 or 0.01) and the number of degrees of freedom k 1 =m-1 and k 2 =n-m. If the calculated value of F exceeds the tabulated value, the relationship is recognized as significant.

The significance of the regression coefficient is established using Student's t-test, which is calculated by the formula:

where σ 2 and i is the variance of the regression coefficient.

It is calculated by the formula:

where k is the number of factor features in the regression equation.

The regression coefficient is recognized as significant if t a 1 ≥t cr. t cr is found in the table of critical points of Student's distribution at the accepted level of significance and the number of degrees of freedom k=n-1.

4.3 Correlation-regression analysis in Excel

Let's carry out a correlation-regression analysis of the relationship between yield and labor costs per 1 quintal of grain. To do this, open an Excel sheet, in cells A1: A30 enter the values ​​of the factor attribute productivity of grain crops, in cells B1: B30 the values ​​of the effective feature - labor costs per 1 quintal of grain. From the Tools menu, select the Data Analysis option. Left-clicking on this item will open the Regression tool. Click on the OK button, the Regression dialog box appears on the screen. In the Input interval Y field, enter the values ​​of the resulting attribute (highlighting cells B1:B30), in the Input interval X field, enter the values ​​of the factor attribute (highlighting cells A1:A30). We mark the probability level of 95%, select New worksheet. We click on the OK button. The table “RESULTS” appears on the worksheet, in which the results of calculating the parameters of the regression equation, the correlation coefficient and other indicators are given, allowing you to determine the significance of the correlation coefficient and the parameters of the regression equation.

RESULTS

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

Analysis of variance

Significance F

Regression

Odds

standard error

t-statistic

P-Value

bottom 95%

Top 95%

Lower 95.0%

Top 95.0%

Y-intersection

Variable X 1

In this table, "Multiple R" is the correlation coefficient, "R-squared" is the coefficient of determination. "Coefficients: Y-intersection" - a free term of the regression equation 2.836242; "Variable X1" - regression coefficient -0.06654. There are also values ​​of Fisher's F-test 74.9876, Student's t-test 14.18042, " standard error 0.112121”, which are necessary to assess the significance of the correlation coefficient, the parameters of the regression equation and the entire equation.

Based on the data in the table, we construct a regression equation: y x ​​\u003d 2.836-0.067x. The regression coefficient a 1 = -0.067 means that with an increase in grain yield by 1 quintal/ha, labor costs per 1 quintal of grain decrease by 0.067 man-hours.

Correlation coefficient r=0.85>0.7, therefore, the relationship between the studied features in this population is close. The coefficient of determination r 2 =0.73 shows that 73% of the variation of the effective trait (labor costs per 1 centner of grain) is caused by the action of the factor trait (grain yield).

Table critical points distribution of Fisher - Snedecor, we find the critical value of the F-criterion at a significance level of 0.05 and the number of degrees of freedom k 1 =m-1=2-1=1 and k 2 =n-m=30-2=28, it is equal to 4.21. Since the calculated value of the criterion is greater than the tabular value (F=74.9896>4.21), the regression equation is recognized as significant.

To assess the significance of the correlation coefficient, we calculate the Student's t-test:

AT
In the table of critical points of the Student's distribution, we find the critical value of the t-test at a significance level of 0.05 and the number of degrees of freedom n-1=30-1=29, it is equal to 2.0452. Since the calculated value is greater than the tabulated one, the correlation coefficient is significant.

To check the significance, the ratio of the regression coefficient and its standard deviation is analyzed. This ratio is a Student's distribution, that is, to determine the significance, we use the t - criterion:

- SKO from residual dispersion;

- sum of deviations from the mean value

If t races. >t tab. , then the coefficient b i is significant.

The confidence interval is determined by the formula:

WORK PROCEDURE

    Take the initial data according to the variant of the work (according to the student's number in the journal). A static control object with two inputs is specified X 1 , X 2 and one exit Y. A passive experiment was carried out on the object and a sample of 30 points was obtained containing the values X 1 , X 2 and Y for each experiment.

    Open a new file in Excel 2007. Enter background information into the columns of the source table - the values ​​of the input variables X 1 , X 2 and output variable Y.

    Prepare additional two columns for entering calculated values Y and leftovers.

    Call the program "Regression": Data / Data Analysis / Regression.

Rice. 1. Dialog box "Data Analysis".

    Enter in the "Regression" dialog box the addresses of the source data:

    input interval Y, input interval X (2 columns),

    set the reliability level to 95%,

    in the option "Output interval, specify the upper left cell of the place where the regression analysis data is output (the first cell on the 2-page worksheet),

    enable the options "Remains" and "Graph of Remains",

    press OK button to run regression analysis.

Rice. 2. Dialog box "Regression".

    Excel will display 4 tables and 2 plots of residuals versus variables X1 and X2.

    Format the table "Output of totals" - expand the column with the names of the output data, make 3 significant digits after the decimal point in the second column.

    Format the "ANOVA" table - make it easy to read and understand the number of significant digits after commas, shorten the names of the variables and adjust the width of the columns.

    Format the table of coefficients of the equation - shorten the names of the variables and adjust the width of the columns if necessary, make the number of significant figures convenient for reading and understanding, delete the last 2 columns (values ​​​​and table markup).

    Transfer the data from the "Remainder Output" table to the prepared columns of the source table, then delete the "Remainder Output" table (option "Special Insert").

    Enter the resulting estimates of the coefficients in the original table.

    Pull the results tables to the top of the page as much as possible.

    Build charts below tables Yexp, Ycalc and forecast errors (residual).

    Format residual charts. Based on the obtained graphs, evaluate the correctness of the model by inputs X1, X2.

    Print the results of the regression analysis.

    Deal with the results of regression analysis.

    Prepare a work report.

WORK EXAMPLE

The method of performing regression analysis in the EXCEL package is shown in Figures 3-5.

Rice. 3. An example of regression analysis in the EXCEL package.


Fig.4. Plots of Variable Residuals X1, X2

Rice. 5. Graphs Yexp,Ycalc and forecast errors (residual).

According to the regression analysis, we can say:

1. The regression equation obtained using Excel has the form:

    Determination coefficient:

The variation of the result by 46.5% is explained by the variation of factors.

    The general F-test tests the hypothesis about the statistical significance of the regression equation. The analysis is performed by comparing the actual and tabular values ​​of Fisher's F-test.

Since the actual value exceeds the table
, then we conclude that the resulting regression equation is statistically significant.

    Coefficient multiple correlation:

    b 0 :

t tab. (29, 0.975)=2.05

b 0 :

Confidence interval:

    Determine the confidence interval for the coefficient b 1 :

Coefficient Significance Check b 1 :

t races >t tab. , coefficient b 1 is significant

Confidence interval:

    Determine the confidence interval for the coefficient b 2 :

Significance test for coefficient b 2 :

Determine the confidence interval:

ASSIGNMENT OPTIONS

Table 2. Task options

option number

Effective sign Y i

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 1

Y 2

Y 2

Y 2

Y 2

Y 2

factor number X i

factor number X i

Table 1 continued

option number

Effective sign Y i

Y 2

Y 2

Y 2

Y 2

Y 2

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

Y 3

factor number X i

factor number X i

Table 3. Initial data

Y 1

Y 2

Y 3

X 1

X 2

X 3

X 4

X 5

QUESTIONS FOR SELF-CHECKING

    Problems of regression analysis.

    Prerequisites for regression analysis.

    Basic Equation analysis of variance.

    What does Fisher's F-ratio show?

    How is determined table value Fisher's criterion?

    What does the coefficient of determination show?

    How to determine the significance of the regression coefficients?

    How to determine the confidence interval of the regression coefficients?

    How to determine the calculated value of the t-test?

    How to determine the tabular value of the t-test?

    Formulate the main idea of ​​variance analysis, for which tasks is it most effective?

    What are the main theoretical premises of the analysis of variance?

    Make a decomposition total amount squared deviations for the components in the analysis of variance.

    How to get variance estimates from sums of squared deviations?

    How are the required degrees of freedom obtained?

    How is standard error determined?

    Explain the scheme of two-way analysis of variance.

    How is cross-classification different from hierarchical classification?

    How is balanced data different?

The report is drawn up in a Word text editor on A4 GOST 6656-76 paper (210x297 mm) and contains:

    Name of the lab.

    Objective.

  1. Calculation results.

TIME ALLOWED FOR COMPLETION

LABORATORY WORK

Preparation for work - 0.5 acad. hours.

Performance of work - 0.5 acad. hours.

Computer calculations - 0.5 acad. hours.

Registration of work - 0.5 acad. hours.

Literature

    Identification of control objects. / A. D. Semenov, D. V. Artamonov, A. V. Bryukhachev. Tutorial. - Penza: PGU, 2003. - 211 p.

    Basics statistical analysis. Workshop on statistical methods and operations research using STATISTIC and EXCEL packages. / Vukolov E.A. Tutorial. - M.: FORUM, 2008. - 464 p.

    Fundamentals of the theory of identification of control objects. / A.A. Ignatiev, S.A. Ignatiev. Tutorial. - Saratov: SGTU, 2008. - 44 p.

    Probability theory and math statistics in examples and tasks using EXCEL. / G.V. Gorelova, I.A. Katsko. - Rostov n / a: Phoenix, 2006. - 475 p.

    Purpose of work 2

    Basic concepts 2

    Work order 6

    Work example 9

    Questions for self-control 13

    Time allotted for work 14

    You can check the significance of the parameters of the regression equation using t-statistics.

    Exercise:
    For a group of enterprises producing the same type of product, cost functions are considered:
    y = α + βx;
    y = α x β ;
    y = α β x ;
    y = α + β / x;
    where y is the cost of production, thousand cu.
    x - output, thousand units.

    Required:
    1. Build paired regression equations y from x:

    • linear;
    • power;
    • indicative;
    • equilateral hyperbola.
    2. Calculate linear coefficient pair correlation and coefficient of determination. Draw conclusions.
    3. Assess the statistical significance of the regression equation as a whole.
    4. Assess the statistical significance of the regression and correlation parameters.
    5. Perform a forecast of production costs with a forecast output of 195% of the average level.
    6. Assess the accuracy of the forecast, calculate the forecast error and its confidence interval.
    7. Evaluate the model through average error approximations.

    Decision:

    1. The equation has the form y = α + βx
    1. Parameters of the regression equation.
    Averages

    Dispersion

    standard deviation

    Correlation coefficient

    The relationship between trait Y factor X is strong and direct
    Regression Equation

    Determination coefficient
    R 2 = 0.94 2 = 0.89, i.e. in 88.9774% of cases, changes in x lead to a change in y. In other words - the accuracy of the selection of the regression equation is high

    x y x2 y2 x y y(x) (y-y cp) 2 (y-y(x)) 2 (x-x p) 2
    78 133 6084 17689 10374 142.16 115.98 83.83 1
    82 148 6724 21904 12136 148.61 17.9 0.37 9
    87 134 7569 17956 11658 156.68 95.44 514.26 64
    79 154 6241 23716 12166 143.77 104.67 104.67 0
    89 162 7921 26244 14418 159.9 332.36 4.39 100
    106 195 11236 38025 20670 187.33 2624.59 58.76 729
    67 139 4489 19321 9313 124.41 22.75 212.95 144
    88 158 7744 24964 13904 158.29 202.51 0.08 81
    73 152 5329 23104 11096 134.09 67.75 320.84 36
    87 162 7569 26244 14094 156.68 332.36 28.33 64
    76 159 5776 25281 12084 138.93 231.98 402.86 9
    115 173 13225 29929 19895 201.86 854.44 832.66 1296
    0 0 0 16.3 20669.59 265.73 6241
    1027 1869 89907 294377 161808 1869 25672.31 2829.74 8774

    Note: y(x) values ​​are found from the resulting regression equation:
    y(1) = 4.01*1 + 99.18 = 103.19
    y(2) = 4.01*2 + 99.18 = 107.2
    ... ... ...

    2. Estimating the parameters of the regression equation
    Significance of the correlation coefficient

    According to the Student's table, we find Ttable
    T table (n-m-1; α / 2) \u003d (11; 0.05 / 2) \u003d 1.796
    Since Tobs > Ttabl, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant.

    Analysis of the accuracy of determining estimates of regression coefficients





    Sa = 0.1712
    Confidence intervals for the dependent variable

    Let us calculate the boundaries of the interval in which 95% of the possible values ​​of Y will be concentrated for unlimited large numbers observations and X = 1
    (-20.41;56.24)
    Testing hypotheses about coefficients linear equation regression
    1) t-statistic


    The statistical significance of the regression coefficient a is confirmed

    The statistical significance of the regression coefficient b is not confirmed
    Confidence interval for coefficients of the regression equation
    Let's define confidence intervals regression coefficients, which with 95% reliability will be as follows:
    (a - t S a ; a + t S a)
    (1.306;1.921)
    (b - t b S b ; b + t b S b)
    (-9.2733;41.876)
    where t = 1.796
    2) F-statistics


    fkp = 4.84
    Since F > Fkp, the coefficient of determination is statistically significant