Methods of regression analysis. Regression analysis - a statistical method for studying the dependence of a random variable on variables

In the presence of a correlation between factor and resultant signs, doctors often have to determine by what amount the value of one sign can change when another is changed by a unit of measurement generally accepted or established by the researcher himself.

For example, how will the body weight of schoolchildren of the 1st grade (girls or boys) change if their height increases by 1 cm. For these purposes, the regression analysis method is used.

Most often, the regression analysis method is used to develop normative scales and standards for physical development.

Definition of regression. Regression is a function that allows, based on the average value of one attribute, to determine the average value of another attribute that is correlated with the first one.
For this purpose, the regression coefficient and a number of other parameters are used. For example, you can calculate the number of colds on average for certain values of the average monthly air temperature in the autumn-winter period.
Definition of the regression coefficient. The regression coefficient is the absolute value by which the value of one attribute changes on average when another attribute associated with it changes by the established unit of measurement.
Regression coefficient formula. R y / x \u003d r xy x (σ y / σ x)
where R y / x - regression coefficient;
r xy - correlation coefficient between features x and y;
(σ y and σ x) - standard deviations of features x and y.
In our example ;
σ x = 4.6 (standard deviation of air temperature in the autumn-winter period;
σ y = 8.65 (standard deviation of the number of infectious colds).
Thus, R y/x is the regression coefficient.
R y / x \u003d -0.96 x (4.6 / 8.65) \u003d 1.8, i.e. with a decrease in the average monthly air temperature (x) by 1 degree, the average number of infectious colds (y) in the autumn-winter period will change by 1.8 cases.
Regression Equation. y \u003d M y + R y / x (x - M x)
where y is the average value of the attribute, which should be determined when the average value of another attribute (x) changes;
x - known average value of another feature;
R y/x - regression coefficient;
M x, M y - known average values of features x and y.
For example, the average number of infectious colds (y) can be determined without special measurements at any average value of the average monthly air temperature (x). So, if x \u003d - 9 °, R y / x \u003d 1.8 diseases, M x \u003d -7 °, M y \u003d 20 diseases, then y \u003d 20 + 1.8 x (9-7) \u003d 20 + 3 .6 = 23.6 diseases.
This equation is applied in the case of a straight-line relationship between two features (x and y).
Purpose of the regression equation. The regression equation is used to plot the regression line. The latter allows, without special measurements, to determine any average value (y) of one attribute, if the value (x) of another attribute changes. Based on these data, a graph is built - regression line, which can be used to determine the average number of colds at any value of the average monthly temperature within the range between the calculated values of the number of colds.
Regression sigma (formula).
where σ Ru/x - sigma (standard deviation) of the regression;
σ y is the standard deviation of the feature y;
r xy - correlation coefficient between features x and y.
So, if σ y is the standard deviation of the number of colds = 8.65; r xy - the correlation coefficient between the number of colds (y) and the average monthly air temperature in the autumn-winter period (x) is - 0.96, then
Purpose of sigma regression. Gives a characteristic of the measure of the diversity of the resulting feature (y).
For example, it characterizes the diversity of the number of colds at a certain value of the average monthly air temperature in the autumn-winter period. So, the average number of colds at air temperature x 1 \u003d -6 ° can range from 15.78 diseases to 20.62 diseases.
At x 2 = -9°, the average number of colds can range from 21.18 diseases to 26.02 diseases, etc.
The regression sigma is used in the construction of a regression scale, which reflects the deviation of the values of the effective attribute from its average value plotted on the regression line.
Data required to calculate and plot the regression scale
- regression coefficient - Ry/x;
- regression equation - y \u003d M y + R y / x (x-M x);
- regression sigma - σ Rx/y
The sequence of calculations and graphic representation of the regression scale.
- determine the regression coefficient by the formula (see paragraph 3). For example, it should be determined how much the average body weight will change (at a certain age depending on gender) if the average height changes by 1 cm.
- according to the formula of the regression equation (see paragraph 4), determine what will be the average, for example, body weight (y, y 2, y 3 ...) * for a certain growth value (x, x 2, x 3 ...) .
  ________________
  * The value of "y" should be calculated for at least three known values of "x".
  At the same time, the average values of body weight and height (M x, and M y) for a certain age and sex are known
- calculate the sigma of the regression, knowing the corresponding values of σ y and r xy and substituting their values into the formula (see paragraph 6).
- based on the known values of x 1, x 2, x 3 and their corresponding average values y 1, y 2 y 3, as well as the smallest (y - σ ru / x) and the largest (y + σ ru / x) values \u200b\u200b(y) construct a regression scale.
  For a graphical representation of the regression scale, the values x, x 2 , x 3 (y-axis) are first marked on the graph, i.e. a regression line is built, for example, the dependence of body weight (y) on height (x).
  Then, at the corresponding points y 1 , y 2 , y 3 the numerical values of the regression sigma are marked, i.e. on the graph find the smallest and largest values of y 1 , y 2 , y 3 .
Practical use of the regression scale. Normative scales and standards are being developed, in particular for physical development. According to the standard scale, it is possible to give an individual assessment of the development of children. At the same time, physical development is assessed as harmonious if, for example, at a certain height, the child’s body weight is within one sigma of regression to the average calculated unit of body weight - (y) for a given height (x) (y ± 1 σ Ry / x).
Physical development is considered disharmonious in terms of body weight if the child's body weight for a certain height is within the second regression sigma: (y ± 2 σ Ry/x)
Physical development will be sharply disharmonious both due to excess and insufficient body weight if the body weight for a certain height is within the third sigma of the regression (y ± 3 σ Ry/x).

According to the results of a statistical study of the physical development of 5-year-old boys, it is known that their average height (x) is 109 cm, and their average body weight (y) is 19 kg. The correlation coefficient between height and body weight is +0.9, standard deviations are presented in the table.

Required:

calculate the regression coefficient;
using the regression equation, determine what the expected body weight of 5-year-old boys will be with a height equal to x1 = 100 cm, x2 = 110 cm, x3 = 120 cm;
calculate the regression sigma, build a regression scale, present the results of its solution graphically;
draw the appropriate conclusions.

The condition of the problem and the results of its solution are presented in the summary table.

Table 1

Conditions of the problem				Problem solution results
Conditions of the problem				regression equation			sigma regression	regression scale (expected body weight (in kg))
	M	σ	r xy	R y/x	X	At	σRx/y	y - σ Rу/х	y + σ Rу/х
1	2	3	4	5	6	7	8	9	10
Height (x)	109 cm	± 4.4cm	+0,9	0,16	100cm	17.56 kg	± 0.35 kg	17.21 kg	17.91 kg
Body weight (y)	19 kg	± 0.8 kg			110 cm	19.16 kg		18.81 kg	19.51 kg
Body weight (y)	19 kg	± 0.8 kg			120 cm	20.76 kg		20.41 kg	21.11 kg

Decision.

Conclusion. Thus, the regression scale within the calculated values of body weight allows you to determine it for any other value of growth or to assess the individual development of the child. To do this, restore the perpendicular to the regression line.

Vlasov V.V. Epidemiology. - M.: GEOTAR-MED, 2004. - 464 p.
Lisitsyn Yu.P. Public health and healthcare. Textbook for high schools. - M.: GEOTAR-MED, 2007. - 512 p.
Medik V.A., Yuriev V.K. A course of lectures on public health and health care: Part 1. Public health. - M.: Medicine, 2003. - 368 p.
Minyaev V.A., Vishnyakov N.I. and others. Social medicine and healthcare organization (Guide in 2 volumes). - St. Petersburg, 1998. -528 p.
Kucherenko V.Z., Agarkov N.M. and others. Social hygiene and organization of health care (Tutorial) - Moscow, 2000. - 432 p.
S. Glantz. Medico-biological statistics. Per from English. - M., Practice, 1998. - 459 p.

A) Graphical analysis of a simple linear regression.

Simple linear regression equation y=a+bx. If there is a correlation between the random variables Y and X, then the value y = ý + ,

where ý is the theoretical value of y obtained from the equation ý = f(x),

 is the deviation error of the theoretical equation ý from the actual (experimental) data.

The equation for the dependence of the average value of ý on x, that is, ý = f (x) is called the regression equation. Regression analysis consists of four steps:

1) setting the task and establishing the reasons for the connection.

2) limitation of the object of research, collection of statistical information.

3) selection of the link equation based on the analysis and nature of the collected data.

4) calculation of numerical values, characteristics of correlation.

If two variables are related in such a way that a change in one variable corresponds to a systematic change in the other variable, then regression analysis is used to estimate and select the relationship equation between them if these variables are known. Unlike regression analysis, correlation analysis is used to analyze the tightness of the relationship between X and Y.

Consider finding a straight line in regression analysis:

Theoretical regression equation.

The term "simple regression" indicates that the magnitude of one variable is estimated based on knowledge of another variable. Unlike simple multivariate regression, it is used to estimate a variable based on knowledge of two, three or more variables. Consider a graphical analysis of a simple linear regression.

Let's assume that we have the results of pre-employment and labor productivity screening tests.

	Selection results (100 points), x	Performance (20 points), y

Putting the points on the graph, we get a scattering diagram (field). We use it to analyze the results of selection tests and labor productivity.

Let's analyze the regression line using the scatterplot. In regression analysis, at least two variables are always specified. A systematic change in one variable is associated with a change in another. primary goal regression analysis is to estimate the value of one variable if the value of the other variable is known. For a complete task, the assessment of labor productivity is important.

Independent variable in regression analysis, a quantity is used as the basis for the analysis of another variable. In this case, these are the results of selection tests (along the X axis).

dependent variable called the estimated value (along the Y-axis). In a regression analysis, there can only be one dependent variable and multiple independent variables.

For a simple regression analysis, the dependence can be represented in a two-coordinate system (x and y), along the x-axis - an independent variable, along the y-axis - dependent. We plot the intersection points in such a way that a pair of quantities is presented on the graph. The graph is called scatterplot. Its construction is the second stage of the regression analysis, since the first is the choice of the analyzed values and the collection of sample data. Thus, regression analysis is applied for statistical analysis. The relationship between the sample data of the chart is linear.

To estimate the value of the variable y based on the variable x, it is necessary to determine the position of the line that best represents the relationship between x and y based on the location of the scatterplot points. In our example, this is performance analysis. The line drawn through the scatter points - regression line. One way to build a regression line based on visual experience is the freehand method. Our regression line can be used to determine labor productivity. When finding the equation of the regression line

Often the least squares test is used. The most suitable line is the one where the sum of the squared deviations is minimal.

The mathematical equation of the growth line represents the law of growth in arithmetic progression:

at = a – bX.

Y = a + bX– the reduced equation with one parameter is the simplest form of the constraint equation. It is acceptable for average values. To better express the relationship between X and at, an additional proportionality factor is introduced b, which indicates the slope of the regression line.

B) Construction of a theoretical regression line.

The process of finding it consists in choosing and justifying the type of curve and calculating the parameters a, b, with etc. The construction process is called leveling, and the stock of curves offered by the mat. analysis, varied. Most often, in economic problems, a family of curves is used, equations that are expressed by polynomials of integer positive degrees.

1)
- equation of a straight line,

2)
is the hyperbola equation,

3)
is the parabola equation,

where ý are the ordinates of the theoretical regression line.

Having chosen the type of equation, it is necessary to find the parameters on which this equation depends. For example, the nature of the location of points in the scatter field showed that the theoretical regression line is straight.

The scatterplot allows you to represent labor productivity using regression analysis. In economics, regression analysis predicts many characteristics that affect the final product (taking into account pricing).

C) The criterion of the smallest frames for finding a straight line.

One of the criteria we could apply for a suitable regression line in a scatterplot is based on choosing a line for which the sum of the squared errors will be minimal.

The proximity of the scattering points to the straight line is measured by the ordinates of the segments. The deviations of these points can be positive or negative, but the sum of the squared deviations of the theoretical line from the experimental line is always positive and should be minimal. The fact that all scatter points do not coincide with the position of the regression line indicates the existence of a discrepancy between the experimental and theoretical data. Thus, it can be said that no other regression line, except for the one found, can give a smaller sum of deviations between experimental and experimental data. Therefore, having found the theoretical equation ý and the regression line, we satisfy the least squares requirement.

This is done using the constraint equation
, using formulas to find parameters a and b. Taking the theoretical value
and denoting the left side of the equation through f, we get the function
from unknown parameters a and b. Values a and b will satisfy the minimum of the function f and are found from the partial differential equations
and
. This is necessary condition, however, for a positive quadratic function, this is also a sufficient condition for finding a and b.

Let us derive from the equations of partial derivatives the formulas for the parameters a and b:

we get a system of equations:

where
– arithmetic mean errors.

Substituting numerical values, we find the parameters a and b.

There is a concept
. This is the approximation factor.

If a e < 33%, то модель приемлема для дальнейшего анализа;

If a e> 33%, then we take a hyperbola, a parabola, etc. This gives the right to analyze in various situations.

Conclusion: according to the criterion of the approximation coefficient, the most suitable line is the one for which

, and no other regression line for our problem gives a minimum of deviations.

D) Quadratic error of estimation, verification of their typicality.

For a population with less than 30 study parameters ( n < 30), для проверки типичности параметров уравнения регрессии используется t-Student's criterion. This calculates the actual value t-criteria:

From here

where is the residual root mean square error. Received t a and t b compared with critical t k from the Student's table, taking into account the accepted level of significance ( = 0.01 = 99% or  = 0.05 = 95%). P = f = k 1 = m is the number of parameters of the equation under study (degree of freedom). For example, if y = a + bx; m = 2, k 2 = f 2 = p 2 = n – (m+ 1), where n- the number of studied features.

t a < t k < t b .

Conclusion: according to the parameters of the regression equation checked for typicality, a mathematical model of the connection is constructed
. In this case, the parameters of the mathematical function used in the analysis (linear, hyperbola, parabola) receive the corresponding quantitative values. The semantic content of the models obtained in this way is that they characterize the average value of the effective feature
from a factor trait X.

E) Curvilinear regression.

Quite often there is a curvilinear relationship, when a changing relationship is established between the variables. The intensity of the increase (decrease) depends on the level of finding X. Curvilinear dependence can be of different types. For example, consider the relationship between yield and rainfall. With an increase in precipitation under equal natural conditions, an intensive increase in yield, but up to a certain limit. After the critical point, rainfall is redundant, and the yield drops catastrophically. The example shows that at first the relationship was positive, and then negative. Critical point - the optimal level of feature X, which corresponds to the maximum or minimum value of feature Y.

In economics, such a relationship is observed between price and consumption, productivity and length of service.

parabolic dependency.

If the data show that an increase in the factor attribute leads to an increase in the resultant attribute, then the second-order equation (parabola) is taken as the regression equation.

. The coefficients a,b,c are found from the partial differential equations:

We get a system of equations:

Types of curvilinear equations:

It is reasonable to assume that there is a curvilinear relationship between labor productivity and selection test scores. This means that with the growth of the scoring system, performance will begin to decrease at some level, so the direct model may turn out to be curvilinear.

The third model will be a hyperbola, and in all equations, instead of the variable x, there will be an expression.

In the previous notes, the focus has often been on a single numerical variable, such as mutual fund returns, Web page load time, or soft drink consumption. In this and the following notes, we will consider methods for predicting the values of a numeric variable depending on the values of one or more other numeric variables.

The material will be illustrated with a through example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company does not currently have a systematic approach to selecting new outlets. The location where the company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager's idea of the ideal location of the store. Imagine that you are the head of the Special Projects and Planning Department. You have been tasked with developing a strategic plan for opening new stores. This plan should contain a forecast of annual sales in newly opened stores. You believe that selling space is directly related to revenue and want to factor that fact into your decision making process. How do you develop a statistical model that predicts annual sales based on new store size?

Typically, regression analysis is used to predict the values of a variable. Its goal is to develop a statistical model that predicts the values of the dependent variable, or response, from the values of at least one independent, or explanatory, variable. In this note, we will consider a simple linear regression - a statistical method that allows you to predict the values of the dependent variable Y by the values of the independent variable X. The following notes will describe a multiple regression model designed to predict the values of the independent variable Y by the values of several dependent variables ( X 1 , X 2 , …, X k).

Download note in or format, examples in format

Types of regression models

where ρ 1 is the autocorrelation coefficient; if ρ 1 = 0 (no autocorrelation), D≈ 2; if ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; if ρ 1 = -1 (negative autocorrelation), D ≈ 4.

In practice, the application of the Durbin-Watson criterion is based on a comparison of the value D with critical theoretical values d L and d U for a given number of observations n, the number of independent variables of the model k(for simple linear regression k= 1) and significance level α. If a D< d L , the hypothesis of independence of random deviations is rejected (hence, there is a positive autocorrelation); if D > d U, the hypothesis is not rejected (that is, there is no autocorrelation); if d L< D < d U there is not enough reason to make a decision. When the calculated value D exceeds 2, then d L and d U it is not the coefficient itself that is being compared D, and the expression (4 – D).

To calculate the Durbin-Watson statistics in Excel, we turn to the bottom table in Fig. fourteen Balance withdrawal. The numerator in expression (10) is calculated using the function = SUMMQDIFF(array1, array2), and the denominator = SUMMQ(array) (Fig. 16).

Rice. 16. Formulas for calculating Durbin-Watson statistics

In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that there is a positive autocorrelation? It is necessary to correlate the value of D with the critical values ( d L and d U) depending on the number of observations n and significance level α (Fig. 17).

Rice. 17. Critical values of Durbin-Watson statistics (table fragment)

Thus, in the problem of the volume of sales in a store delivering goods to your home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Hence, d L= 1.08 and dU= 1.36. Insofar as D = 0,883 < d L= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be applied.

Testing Hypotheses about Slope and Correlation Coefficient

The above regression was applied solely for forecasting. To determine regression coefficients and predict the value of a variable Y for a given variable value X the method of least squares was used. In addition, we considered the standard error of the estimate and the coefficient of mixed correlation. If the residual analysis confirms that the applicability conditions of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

Applicationt -criteria for slope. By checking whether the population slope β 1 is equal to zero, one can determine whether there is a statistically significant relationship between the variables X and Y. If this hypothesis is rejected, it can be argued that between the variables X and Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (no linear relationship), H1: β 1 ≠ 0 (there is a linear relationship). A-priory t-statistic is equal to the difference between the sample slope and the hypothetical population slope, divided by the standard error of the slope estimate:

(11) t = (b 1 – β 1 ) / Sb 1

where b 1 is the slope of the direct regression based on sample data, β1 is the hypothetical slope of the direct general population, , and test statistics t It has t- distribution with n - 2 degrees of freedom.

Let's check if there is a statistically significant relationship between store size and annual sales at α = 0.05. t-criteria is displayed along with other parameters when using Analysis package(option Regression). The full results of the Analysis Package are shown in Fig. 4, a fragment related to t-statistics - in fig. eighteen.

Rice. 18. Application results t

Because the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level α = 0.05 can be found by the formula: t L=STUDENT.INV(0.025;12) = -2.1788 where 0.025 is half the significance level and 12 = n – 2; t U\u003d STUDENT.INV (0.975, 12) \u003d +2.1788.

Insofar as t-statistics = 10.64 > t U= 2.1788 (Fig. 19), null hypothesis H 0 is rejected. On the other side, R-value for X\u003d 10.6411, calculated by the formula \u003d 1-STUDENT.DIST (D3, 12, TRUE), is approximately equal to zero, so the hypothesis H 0 is rejected again. The fact that R-value is almost zero, meaning that if there were no real linear relationship between store size and annual sales, it would be almost impossible to detect it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

Rice. 19. Testing the hypothesis about the slope of the general population at a significance level of 0.05 and 12 degrees of freedom

ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of a simple linear regression is to use F-criteria. Recall that F-criterion is used to test the relationship between two variances (see details). When testing the slope hypothesis, the measure of random errors is the error variance (the sum of squared errors divided by the number of degrees of freedom), so F-test uses the ratio of the variance explained by the regression (i.e., the values SSR divided by the number of independent variables k), to the error variance ( MSE=SYX 2 ).

A-priory F-statistic is equal to the mean squared deviations due to regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, where MSR=SSR / k, MSE =SSE/(n– k – 1), k is the number of independent variables in the regression model. Test statistics F It has F- distribution with k and n– k – 1 degrees of freedom.

For a given significance level α, the decision rule is formulated as follows: if F > FU, the null hypothesis is rejected; otherwise, it is not rejected. The results, presented in the form of a summary table of the analysis of variance, are shown in fig. 20.

Rice. 20. Table of analysis of variance to test the hypothesis of the statistical significance of the regression coefficient

Similarly t-criterion F-criteria is displayed in the table when using Analysis package(option Regression). Full results of the work Analysis package shown in fig. 4, fragment related to F-statistics - in fig. 21.

Rice. 21. Application results F- Criteria obtained using the Excel Analysis ToolPack

F-statistic is 113.23 and R-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained from the formula F U\u003d F. OBR (1-0.05; 1; 12) \u003d 4.7472 (Fig. 22). Insofar as F = 113,23 > F U= 4.7472, and R-value close to 0< 0,05, нулевая гипотеза H 0 deviates, i.e. The size of a store is closely related to its annual sales volume.

Rice. 22. Testing the hypothesis about the slope of the general population at a significance level of 0.05, with one and 12 degrees of freedom

Confidence interval containing slope β 1 . To test the hypothesis about the existence of a linear relationship between variables, you can build a confidence interval containing the slope β 1 and make sure that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±t n –2 Sb 1

As shown in fig. eighteen, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 \u003d STUDENT.OBR (0.975, 12) \u003d 2.1788. Hence, b 1 ±t n –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, the slope of the population with a probability of 0.95 lies in the range from +1.328 to +2.012 (i.e., from $1,328,000 to $2,012,000). Because these values are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that every 1,000 sq. feet results in an increase in average sales of $1,328,000 to $2,012,000.

Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking for the existence of a correlation:

where r = + , if b 1 > 0, r = – , if b 1 < 0. Тестовая статистика t It has t- distribution with n - 2 degrees of freedom.

In the problem of the Sunflowers store chain r2= 0.904, and b 1- +1.670 (see Fig. 4). Insofar as b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t- statistics:

At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

When discussing inferences about population slopes, confidence intervals and criteria for testing hypotheses are interchangeable tools. However, the calculation of the confidence interval containing the correlation coefficient turns out to be more difficult, since the form of the sampling distribution of the statistic r depends on the true correlation coefficient.

Estimation of mathematical expectation and prediction of individual values

This section discusses methods for estimating the expected response Y and predictions of individual values Y for given values of the variable X.

Construction of a confidence interval. In example 2 (see above section Least square method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales in a store with an area of 4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is a point. to estimate the mathematical expectation of the general population, the concept of a confidence interval was proposed. Similarly, one can introduce the concept confidence interval for the mathematical expectation of the response for a given value of a variable X:

where , = b 0 + b 1 X i– predicted value variable Y at X = X i, S YX is the mean square error, n is the sample size, Xi- the given value of the variable X, µ Y|X = Xi– mathematical expectation of a variable Y at X = Х i,SSX=

Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given level of significance, an increase in the amplitude of fluctuations around the regression line, measured using the mean square error, leads to an increase in the width of the interval. On the other hand, as expected, an increase in the sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the value of the variable Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values far from the mean.

Let's say that when choosing a location for a store, we want to build a 95% confidence interval for the average annual sales in all stores with an area of 4000 square meters. feet:

Therefore, the average annual sales volume in all stores with an area of 4,000 square meters. feet, with a 95% probability lies in the range from 6.971 to 8.317 million dollars.

Compute the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains a predicted value and not an estimate of the parameter. Interval for predicted response YX = Xi for a specific value of the variable Xi is determined by the formula:

Let's assume that when choosing a location for a retail outlet, we want to build a 95% confidence interval for the predicted annual sales volume in a store with an area of 4000 square meters. feet:

Therefore, the predicted annual sales volume for a 4,000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As you can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values is much greater than in estimating the expected value.

Pitfalls and ethical issues associated with the use of regression

Difficulties associated with regression analysis:

Ignoring the conditions of applicability of the method of least squares.
An erroneous estimate of the conditions for applicability of the method of least squares.
Wrong choice of alternative methods in violation of the conditions of applicability of the least squares method.
Application of regression analysis without in-depth knowledge of the subject of study.
Extrapolation of the regression beyond the range of the explanatory variable.
Confusion between statistical and causal relationships.

The widespread use of spreadsheets and statistical software has eliminated the computational problems that prevented the use of regression analysis. However, this led to the fact that regression analysis began to be used by users who do not have sufficient qualifications and knowledge. How do users know about alternative methods if many of them have no idea at all about the conditions for applicability of the least squares method and do not know how to check their implementation?

The researcher should not be carried away by grinding numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this with a classic example taken from textbooks. Anscombe showed that all four datasets shown in Fig. 23 have the same regression parameters (Fig. 24).

Rice. 23. Four artificial data sets

Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the image to enlarge the image)

So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis ended there, we would lose a lot of useful information. This is evidenced by the scatter plots (Fig. 25) and residual plots (Fig. 26) constructed for these data sets.

Rice. 25. Scatter plots for four datasets

Scatter plots and residual plots show that these data are different from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A has no pattern. The same cannot be said for sets B, C, and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the plot of residuals, which has a parabolic shape. The scatter plot and residual plot show that dataset B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. The technique for detecting and eliminating outliers from observations is called influence analysis. After eliminating the outlier, the result of the re-evaluation of the model may be completely different. A scatterplot plotted from data set D illustrates an unusual situation in which the empirical model is highly dependent on a single response ( X 8 = 19, Y 8 = 12.5). Such regression models need to be calculated especially carefully. So, scatter and residual plots are an essential tool for regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

Rice. 26. Plots of residuals for four datasets

How to avoid pitfalls in regression analysis:

Analysis of the possible relationship between variables X and Y always start with a scatterplot.
Before interpreting the results of a regression analysis, check the conditions for its applicability.
Plot the residuals versus the independent variable. This will allow to determine how the empirical model corresponds to the results of observation, and to detect violation of the constancy of the variance.
Use histograms, stem and leaf plots, box plots, and normal distribution plots to test the assumption of a normal distribution of errors.
If the applicability conditions of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
If the applicability conditions of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and build confidence intervals containing the mathematical expectation and the predicted response value.
Avoid predicting values of the dependent variable outside the range of the independent variable.
Keep in mind that statistical dependencies are not always causal. Remember that correlation between variables does not mean that there is a causal relationship between them.

Summary. As shown in the block diagram (Fig. 27), the note describes a simple linear regression model, the conditions for its applicability, and ways to test these conditions. Considered t-criterion for testing the statistical significance of the slope of the regression. A regression model was used to predict the values of the dependent variable. An example is considered related to the choice of a place for a retail outlet, in which the dependence of the annual sales volume on the store area is studied. The information obtained allows you to more accurately select a location for the store and predict its annual sales. In the following notes, the discussion of regression analysis will continue, as well as multiple regression models.

Rice. 27. Block diagram of a note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 792–872

If the dependent variable is categorical, logistic regression should be applied.

Regression analysis is one of the most popular methods of statistical research. It can be used to determine the degree of influence of independent variables on the dependent variable. The functionality of Microsoft Excel has tools designed to carry out this type of analysis. Let's take a look at what they are and how to use them.

But, in order to use the function that allows you to conduct regression analysis, first of all, you need to activate the Analysis Package. Only then the tools necessary for this procedure will appear on the Excel ribbon.

Now when we go to tab "Data", on the ribbon in the toolbox "Analysis" we will see a new button - "Data analysis".

Types of regression analysis

There are several types of regressions:

parabolic;
power;
logarithmic;
exponential;
demonstration;
hyperbolic;
linear regression.

We will talk in more detail about the implementation of the last type of regression analysis in Excel later.

Linear Regression in Excel

Below, as an example, is a table that shows the average daily air temperature on the street, and the number of store customers for the corresponding working day. Let's find out with the help of regression analysis exactly how weather conditions in the form of air temperature can affect the attendance of a retail establishment.

The general linear regression equation looks like this: Y = a0 + a1x1 + ... + axk. In this formula Y means the variable whose influence we are trying to study. In our case, this is the number of buyers. Meaning x are the various factors that affect the variable. Options a are the regression coefficients. That is, they determine the significance of a particular factor. Index k denotes the total number of these same factors.

Analysis results analysis

The results of the regression analysis are displayed in the form of a table in the place specified in the settings.

One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. A relationship less than 0.5 is bad.

Another important indicator is located in the cell at the intersection of the line "Y-intersection" and column "Coefficients". Here it is indicated what value Y will have, and in our case, this is the number of buyers, with all other factors equal to zero. In this table, this value is 58.04.

Value at the intersection of the graph "Variable X1" and "Coefficients" shows the level of dependence of Y on X. In our case, this is the level of dependence of the number of store customers on temperature. A coefficient of 1.31 is considered a fairly high indicator of influence.

As you can see, it is quite easy to create a regression analysis table using Microsoft Excel. But, only a trained person can work with the data obtained at the output, and understand their essence.

Lecture 3

Regression analysis.

1) Numerical characteristics of regression

2) Linear Regression

3) Nonlinear regression

4) Multiple Regression

5) Using MS EXCEL to perform regression analysis

Control and evaluation tool - test tasks

1. Numerical characteristics of regression

Regression analysis is a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criteria. The terminology of dependent and independent variables reflects only the mathematical dependence of the variables, and not the relationship of cause and effect.

Goals of regression analysis

Determination of the degree of determinism of the variation of the criterion (dependent) variable by predictors (independent variables).
Predicting the value of the dependent variable using the independent variable(s).
Determination of the contribution of individual independent variables to the variation of the dependent one.

Regression analysis cannot be used to determine whether there is a relationship between variables, since the existence of such a relationship is a prerequisite for applying the analysis.

To conduct regression analysis, you first need to get acquainted with the basic concepts of statistics and probability theory.

Basic numerical characteristics of discrete and continuous random variables: mathematical expectation, variance and standard deviation.

Random variables are divided into two types:

Discrete, which can take only specific, predetermined values (for example, the values of numbers on the upper face of a thrown dice or ordinal values of the current month);
· continuous (most often - the values of some physical quantities: weights, distances, temperatures, etc.), which, according to the laws of nature, can take on any values, at least in a certain interval.

The distribution law of a random variable is the correspondence between the possible values of a discrete random variable and its probabilities, usually written in a table:

The statistical definition of probability is expressed in terms of the relative frequency of a random event, that is, it is found as the ratio of the number of random variables to the total number of random variables.

Mathematical expectation of a discrete random variableX is called the sum of the products of the values of the quantity X on the probability of these values. The mathematical expectation is denoted by or M(X) .

= M(X) = x 1 p 1 + x 2 p 2 +… + x n p n = S x i pi

i=1

The dispersion of a random variable with respect to its mathematical expectation is determined using a numerical characteristic called dispersion. Simply put, variance is the spread of a random variable around the mean. To understand the essence of dispersion, consider an example. The average salary in the country is about 25 thousand rubles. Where does this number come from? Most likely, all salaries are added up and divided by the number of employees. In this case, there is a very large dispersion (the minimum salary is about 4 thousand rubles, and the maximum is about 100 thousand rubles). If everyone had the same salary, then the dispersion would be zero, and there would be no spread.

Dispersion of a discrete random variableX is called the mathematical expectation of the square of the difference of a random variable and its mathematical expectation:

D = M [ ((X - M (X)) 2 ]

Using the definition of mathematical expectation to calculate the variance, we obtain the formula:

D \u003d S (x i - M (X)) 2 p i

The variance has the dimension of the square of a random variable. In cases where it is necessary to have a numerical characteristic of the dispersion of possible values in the same dimension as the random variable itself, the standard deviation is used.

Standard deviation random variable is called the square root of its variance.

The mean square deviation is a measure of the dispersion of the values of a random variable around its mathematical expectation.

Example.

The distribution law of a random variable X is given by the following table:

Find its mathematical expectation, variance and standard deviation .

We use the above formulas:

M (X) \u003d 1 0.1 + 2 0.4 + 4 0.4 + 5 0.1 \u003d 3

D \u003d (1-3) 2 0.1 + (2 - 3) 2 0.4 + (4 - 3) 2 0.4 + (5 - 3) 2 0.1 \u003d 1.6

Example.

In the money lottery, 1 win of 1000 rubles, 10 wins of 100 rubles and 100 wins of 1 ruble each with a total number of tickets of 10,000 are played. Make a distribution law for a random win X for the owner of one lottery ticket and determine the mathematical expectation, variance and standard deviation of a random variable .

X 1 \u003d 1000, X 2 \u003d 100, X 3 \u003d 1, X 4 \u003d 0,

P 1 = 1/10000 = 0.0001, P 2 = 10/10000 = 0.001, P 3 = 100/10000 = 0.01, P 4 = 1 - (P 1 + P 2 + P 3) = 0.9889 .

We put the results in a table:

Mathematical expectation - the sum of paired products of the value of a random variable by their probability. For this problem, it is advisable to calculate it by the formula

1000 0.0001 + 100 0.001 + 1 0.01 + 0 0.9889 = 0.21 rubles.

We got a real "fair" ticket price.

D \u003d S (x i - M (X)) 2 p i \u003d (1000 - 0.21) 2 0.0001 + (100 - 0.21) 2 0.001 +

+ (1 - 0,21) 2 0,01 + (0 - 0,21) 2 0,9889 ≈ 109,97

Distribution function of continuous random variables

The value, which as a result of the test will take one possible value (it is not known in advance which one), is called a random variable. As mentioned above, random variables are discrete (discontinuous) and continuous.

A discrete variable is a random variable that takes on separate possible values with certain probabilities that can be numbered.

A continuous variable is a random variable that can take on all values from some finite or infinite interval.

Up to this point, we have limited ourselves to only one “variety” of random variables - discrete, i.e. taking finite values.

But the theory and practice of statistics require the use of the concept of a continuous random variable - allowing any numerical values from any interval.

The distribution law of a continuous random variable is conveniently specified using the so-called probability density function. f(x). Probability P(a< X < b) того, что значение, принятое случайной величиной Х, попадет в промежуток (a; b), определяется равенством

P (a< X < b) = ∫ f(x) dx

The graph of the function f (x) is called the distribution curve. Geometrically, the probability of a random variable falling into the interval (a; b) is equal to the area of the corresponding curvilinear trapezoid, bounded by the distribution curve, the Ox axis and the straight lines x = a, x = b.

P(a£X

If a finite or countable set is subtracted from a complex event, the probability of a new event will remain unchanged.

Function f(x) - a numerical scalar function of a real argument x is called a probability density, and exists at a point x if there is a limit at this point:

Probability Density Properties:

The probability density is a non-negative function, i.e. f(x) ≥ 0

(if all values of the random variable X are in the interval (a;b), then the last

equality can be written as ∫ f (x) dx = 1).

Consider now the function F(x) = P(X< х). Эта функция называется функцией распределения вероятности случайной величины Х. Функция F(х) существует как для дискретных, так и для непрерывных случайных величин. Если f (x) - функция плотности распределения вероятности

continuous random variable X, then F (x) = ∫ f(x) dx = 1).

It follows from the last equality that f (x) = F" (x)

Sometimes the function f(x) is called the differential probability distribution function, and the function F(x) is called the cumulative probability distribution function.

We note the most important properties of the probability distribution function:

F(x) is a non-decreasing function.
F(-∞)=0.
F (+∞) = 1.

The concept of a distribution function is central to the theory of probability. Using this concept, one can give another definition of a continuous random variable. A random variable is called continuous if its integral distribution function F(x) is continuous.

Numerical characteristics of continuous random variables

The mathematical expectation, variance and other parameters of any random variables are almost always calculated using formulas that follow from the distribution law.

For a continuous random variable, the mathematical expectation is calculated by the formula:

M(X) = ∫ x f(x) dx

Dispersion:

D(X) = ∫ ( x- M (X)) 2 f(x) dx or D(X) = ∫ x 2 f(x) dx - (M (X)) 2

2. Linear regression

Let the components X and Y of a two-dimensional random variable (X, Y) be dependent. We will assume that one of them can be approximately represented as a linear function of the other, for example

Y ≈ g(X) = α + βX, and determine the parameters α and β using the least squares method.

Definition. The function g(X) = α + βX is called best approximation Y in the sense of the least squares method, if the mathematical expectation M(Y - g(X)) 2 takes the smallest possible value; the function g(X) is called mean square regression Y to X.

Theorem The linear mean square regression of Y on X is:

where is the correlation coefficient X and Y.

Coefficients of the equation.

One can check that for these values the function function F(α, β)

F(α, β ) = M(Y - α - βX)² has a minimum, which proves the assertion of the theorem.

Definition. The coefficient is called regression coefficient Y on X, and the straight line - - direct mean square regression of Y on X.

Substituting the coordinates of the stationary point into the equality, we can find the minimum value of the function F(α, β) equal to This value is called residual dispersion Y relative to X and characterizes the amount of error allowed when replacing Y with

g(X) = α + βX. At , the residual variance is 0, that is, the equality is not approximate, but exact. Therefore, when Y and X are connected by a linear functional dependence. Similarly, you can get a straight line of root-mean-square regression of X on Y:

and the residual variance of X with respect to Y. For both direct regressions coincide. Comparing the regression equations Y on X and X on Y and solving the system of equations, you can find the intersection point of the regression lines - a point with coordinates (t x, t y), called the center of the joint distribution of X and Y values.

We will consider the algorithm for compiling regression equations from the textbook by V. E. Gmurman “Probability Theory and Mathematical Statistics” p. 256.

1) Compile a calculation table in which the numbers of sample elements, sample options, their squares and product will be recorded.

2) Calculate the sum over all columns except the number.

3) Calculate the average values for each quantity, dispersion and standard deviations.

5) Test the hypothesis about the existence of a relationship between X and Y.

6) Compose the equations of both regression lines and plot the graphs of these equations.

The slope of the straight line regression Y on X is the sample regression coefficient

Coefficient b=

We obtain the desired equation of the regression line Y on X:

Y \u003d 0.202 X + 1.024

Similarly, the regression equation X on Y:

The slope of the straight line regression Y on X is the sample regression coefficient pxy:

Coefficient b=

X \u003d 4.119 Y - 3.714

3. Nonlinear regression

If there are non-linear relationships between economic phenomena, then they are expressed using the corresponding non-linear functions.

There are two classes of non-linear regressions:

1. Regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials of various degrees

Equilateral hyperbole - ;

Semilogarithmic function - .

2. Regressions that are non-linear in terms of the estimated parameters, for example:

Power - ;

Demonstrative -;

Exponential - .

Non-linear regressions with respect to the included variables are reduced to a linear form by a simple change of variables, and further parameter estimation is performed using the least squares method. Let's consider some functions.

The parabola of the second degree is reduced to a linear form using the replacement: . As a result, we arrive at a two-factor equation, the estimation of whose parameters using the least squares method leads to the system of equations:

A parabola of the second degree is usually used in cases where, for a certain interval of factor values, the nature of the relationship of the characteristics under consideration changes: a direct relationship changes to an inverse one or an inverse one to a direct one.

An equilateral hyperbola can be used to characterize the relationship between the specific costs of raw materials, materials, fuel and the volume of output, the time of circulation of goods and the value of turnover. Its classic example is the Phillips curve, which characterizes the non-linear relationship between the unemployment rate x and percentage increase in wages y.

The hyperbola is reduced to a linear equation by a simple replacement: . You can also use the Least Squares method to build a system of linear equations.

In a similar way, the dependences are reduced to a linear form: , and others.

An equilateral hyperbola and a semi-logarithmic curve are used to describe the Engel curve (a mathematical description of the relationship between the share of spending on durable goods and total spending (or income)). The equations in which they are included are used in studies of productivity, labor intensity of agricultural production.

4. Multiple Regression

Multiple regression - a link equation with multiple independent variables:

where is the dependent variable (resultant sign);

Independent variables (factors).

To build a multiple regression equation, the following functions are most often used:

linear -

power -

exhibitor -

hyperbole - .

You can use other functions that can be reduced to a linear form.

To estimate the parameters of the multiple regression equation, the least squares method (LSM) is used. For linear equations and non-linear equations reduced to linear ones, the following system of normal equations is constructed, the solution of which makes it possible to obtain estimates of the regression parameters:

To solve it, the method of determinants can be applied:

where is the determinant of the system;

Private determinants; which are obtained by replacing the corresponding column of the matrix of the determinant of the system with the data of the left side of the system.

Another type of multiple regression equation is the standardized scale regression equation, LSM is applicable to the multiple regression equation on a standardized scale.

5. UsageMSEXCELto perform regression analysis

Regression analysis establishes the form of the relationship between the random variable Y (dependent) and the values of one or more variables (independent), and the values of the latter are considered to be exactly given. Such dependence is usually determined by some mathematical model (regression equation) containing several unknown parameters. In the course of regression analysis, on the basis of sample data, estimates of these parameters are found, statistical errors of estimates or boundaries of confidence intervals are determined, and the compliance (adequacy) of the accepted mathematical model with experimental data is checked.

In linear regression analysis, the relationship between random variables is assumed to be linear. In the simplest case, in a paired linear regression model, there are two variables X and Y. And it is required for n pairs of observations (X1, Y1), (X2, Y2), ..., (Xn, Yn) to build (select) a straight line, called the regression line, which "best" approximates the observed values. The equation of this line y=ax+b is a regression equation. Using a regression equation, you can predict the expected value of the dependent variable y corresponding to a given value of the independent variable x. In the case when the dependence between one dependent variable Y and several independent variables X1, X2, ..., Xm is considered, one speaks of multiple linear regression.

In this case, the regression equation has the form

y = a 0 +a 1 x 1 +a 2 x 2 +…+a m x m ,

where a0, a1, a2, …, am are the regression coefficients to be determined.

The coefficients of the regression equation are determined using the least squares method, achieving the minimum possible sum of squared differences between the real values of the variable Y and those calculated using the regression equation. Thus, for example, a linear regression equation can be constructed even when there is no linear correlation.

A measure of the effectiveness of the regression model is the coefficient of determination R2 (R-square). The coefficient of determination can take values between 0 and 1 determines with what degree of accuracy the resulting regression equation describes (approximates) the original data. The significance of the regression model is also investigated using the F-criterion (Fisher) and the reliability of the difference between the coefficients a0, a1, a2, ..., am from zero is checked using the Student's criterion.

In Excel, the experimental data are approximated by a linear equation up to the 16th order:

y = a0+a1x1+a2x2+…+a16x16

To obtain linear regression coefficients, the "Regression" procedure from the analysis package can be used. Also, the LINEST function provides complete information about the linear regression equation. In addition, the SLOPE and INTERCEPT functions can be used to obtain the parameters of the regression equation, and the TREND and FORECAST functions can be used to obtain the predicted Y values at the required points (for pairwise regression).

Let us consider in detail the application of the LINEST function (known_y, [known_x], [constant], [statistics]): known_y - the range of known values of the dependent parameter Y. In pairwise regression analysis, it can have any form; in the plural, it must be either a row or a column; known_x is the range of known values of one or more independent parameters. Must have the same shape as the Y range (for multiple parameters, multiple columns or rows, respectively); constant - boolean argument. If, based on the practical meaning of the regression analysis task, it is necessary that the regression line pass through the origin, that is, the free coefficient is equal to 0, the value of this argument should be set to 0 (or "false"). If the value is set to 1 (or "true") or omitted, then the free coefficient is calculated in the usual way; statistics is a boolean argument. If the value is set to 1 (or "true"), then an additional regression statistic (see table) is returned, used to evaluate the effectiveness and significance of the model. In the general case, for pairwise regression y=ax+b, the result of applying the LINEST function looks like this:

Table. Output Range of LINEST for Pairwise Regression Analysis

In the case of multiple regression analysis for the equation y=a0+a1x1+a2x2+…+amxm, the coefficients am,…,a1,a0 are displayed in the first line, and the standard errors for these coefficients are displayed in the second line. Rows 3-5, except for the first two columns filled with regression statistics, will yield #N/A.

The LINEST function should be entered as an array formula, first selecting an array of the desired size for the result (m+1 columns and 5 rows if regression statistics are required) and completing the formula entry by pressing CTRL+SHIFT+ENTER.

The result for our example:

In addition, the program has a built-in function - Data Analysis on the Data tab.

It can also be used to perform regression analysis:

On the slide - the result of the regression analysis performed using Data Analysis.

RESULTS

Regression statistics
Multiple R
R-square
Normalized R-square
standard error
Observations

Analysis of variance
					Significance F
Regression



	Odds	standard error	t-statistic	P-value	bottom 95%	Top 95%	Lower 95.0%	Top 95.0%
Y-intersection
Variable X 1

The regression equations that we looked at earlier are also built in MS Excel. To perform them, first a scatter plot is built, then through the context menu, select - Add trend line. In the new window, check the boxes - Show the equation on the diagram and place the value of the approximation reliability (R ^ 2) on the diagram.

Literature:

Theory of Probability and Mathematical Statistics. Gmurman V. E. Textbook for universities. - Ed. 10th, sr. - M.: Higher. school, 2010. - 479s.
Higher mathematics in exercises and tasks. Textbook for universities / Danko P. E., Popov A. G., Kozhevnikova T. Ya., Danko S. P. In 2 hours - Ed. 6th, sr. - M .: Oniks Publishing House LLC: Mir and Education Publishing House LLC, 2007. - 416 p.
1. 3. http://www.machinelearning.ru/wiki/index.php?title=%D0%A0%D0%B5%D0%B3%D1%80%D0%B5%D1%81%D1%81%D0%B8 %D1%8F - some information about regression analysis