Biographies Characteristics Analysis

Linear regression analysis. Methods of mathematical statistics

RESULTS

Table 8.3a. Regression statistics
Regression statistics
Multiple R 0,998364
R-square 0,99673
Normalized R-square 0,996321
standard error 0,42405
Observations 10

Let's first look at the upper part of the calculations presented in Table 8.3a, the regression statistics.

The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .

In most cases, the R-squared value is between these values, called extremes, i.e. between zero and one.

If the value of the R-square is close to one, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, an R-squared value close to zero means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

Multiple R- coefficient of multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).

Multiple R is equal to the square root of the coefficient of determination, this value takes values ​​in the range from zero to one.

In a simple linear regression analysis, the multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients
Odds standard error t-statistic
Y-intersection 2,694545455 0,33176878 8,121757129
Variable X 1 2,305454545 0,04668634 49,38177965
* A truncated version of the calculations is given

Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

If the sign at regression coefficient- positive, the relationship of the dependent variable with the independent will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

In table 8.3c. the results of the output of the residuals are presented. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.

REMAINING WITHDRAWAL

Table 8.3c. Remains
Observation Predicted Y Remains Standard balances
1 9,610909091 -0,610909091 -1,528044662
2 7,305454545 -0,305454545 -0,764022331
3 11,91636364 0,083636364 0,209196591
4 14,22181818 0,778181818 1,946437843
5 16,52727273 0,472727273 1,182415512
6 18,83272727 0,167272727 0,418393181
7 21,13818182 -0,138181818 -0,34562915
8 23,44363636 -0,043636364 -0,109146047
9 25,74909091 -0,149090909 -0,372915662
10 28,05454545 -0,254545455 -0,636685276

Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value

Lecture 3

Regression analysis.

1) Numerical characteristics of regression

2) Linear regression

3) Nonlinear regression

4) Multiple Regression

5) Using MS EXCEL to perform regression analysis

Control and evaluation tool - test tasks

1. Numerical characteristics of regression

Regression analysis is a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criteria. The terminology of dependent and independent variables reflects only the mathematical dependence of the variables, and not the relationship of cause and effect.

Goals of regression analysis

  • Determination of the degree of determinism of the variation of the criterion (dependent) variable by predictors (independent variables).
  • Predicting the value of the dependent variable using the independent variable(s).
  • Determination of the contribution of individual independent variables to the variation of the dependent one.

Regression analysis cannot be used to determine whether there is a relationship between variables, since the existence of such a relationship is a prerequisite for applying the analysis.

To conduct regression analysis, you first need to get acquainted with the basic concepts of statistics and probability theory.

Basic numerical characteristics of discrete and continuous random variables: mathematical expectation, variance and standard deviation.

Random variables are divided into two types:

  • Discrete, which can take only specific, predetermined values ​​(for example, the values ​​of numbers on the upper face of a thrown dice or ordinal values ​​of the current month);
  • · continuous (most often - the values ​​of some physical quantities: weights, distances, temperatures, etc.), which, according to the laws of nature, can take on any values, at least in a certain interval.

The distribution law of a random variable is the correspondence between the possible values ​​of a discrete random variable and its probabilities, usually written in a table:

The statistical definition of probability is expressed in terms of the relative frequency of a random event, that is, it is found as the ratio of the number of random variables to the total number of random variables.

Mathematical expectation of a discrete random variableX is called the sum of the products of the values ​​of the quantity X on the probability of these values. The mathematical expectation is denoted by or M(X) .

n

= M(X) = x 1 p 1 + x 2 p 2 +… + x n p n = S x i pi

i=1

The dispersion of a random variable with respect to its mathematical expectation is determined using a numerical characteristic called dispersion. Simply put, variance is the spread of a random variable around the mean. To understand the essence of dispersion, consider an example. The average salary in the country is about 25 thousand rubles. Where does this number come from? Most likely, all salaries are added up and divided by the number of employees. In this case, there is a very large dispersion (the minimum salary is about 4 thousand rubles, and the maximum is about 100 thousand rubles). If everyone had the same salary, then the dispersion would be zero, and there would be no spread.

Dispersion of a discrete random variableX is called the mathematical expectation of the square of the difference of a random variable and its mathematical expectation:

D = M [ ((X - M (X)) 2 ]

Using the definition of mathematical expectation to calculate the variance, we obtain the formula:

D \u003d S (x i - M (X)) 2 p i

The variance has the dimension of the square of a random variable. In cases where it is necessary to have a numerical characteristic of the dispersion of possible values ​​in the same dimension as the random variable itself, the standard deviation is used.

Standard deviation random variable is called the square root of its variance.

The mean square deviation is a measure of the dispersion of the values ​​of a random variable around its mathematical expectation.

Example.

The distribution law of a random variable X is given by the following table:

Find its mathematical expectation, variance and standard deviation .

We use the above formulas:

M (X) \u003d 1 0.1 + 2 0.4 + 4 0.4 + 5 0.1 \u003d 3

D \u003d (1-3) 2 0.1 + (2 - 3) 2 0.4 + (4 - 3) 2 0.4 + (5 - 3) 2 0.1 \u003d 1.6

Example.

In the money lottery, 1 win of 1000 rubles, 10 wins of 100 rubles and 100 wins of 1 ruble each with a total number of tickets of 10,000 are played. Make a distribution law for a random win X for the owner of one lottery ticket and determine the mathematical expectation, variance and standard deviation of a random variable .

X 1 \u003d 1000, X 2 \u003d 100, X 3 \u003d 1, X 4 \u003d 0,

P 1 = 1/10000 = 0.0001, P 2 = 10/10000 = 0.001, P 3 = 100/10000 = 0.01, P 4 = 1 - (P 1 + P 2 + P 3) = 0.9889 .

We put the results in a table:

Mathematical expectation - the sum of paired products of the value of a random variable by their probability. For this problem, it is advisable to calculate it by the formula

1000 0.0001 + 100 0.001 + 1 0.01 + 0 0.9889 = 0.21 rubles.

We got a real "fair" ticket price.

D \u003d S (x i - M (X)) 2 p i \u003d (1000 - 0.21) 2 0.0001 + (100 - 0.21) 2 0.001 +

+ (1 - 0,21) 2 0,01 + (0 - 0,21) 2 0,9889 ≈ 109,97

Distribution function of continuous random variables

The value, which as a result of the test will take one possible value (it is not known in advance which one), is called a random variable. As mentioned above, random variables are discrete (discontinuous) and continuous.

A discrete variable is a random variable that takes on separate possible values ​​with certain probabilities that can be numbered.

A continuous variable is a random variable that can take on all values ​​from some finite or infinite interval.

Up to this point, we have limited ourselves to only one “variety” of random variables - discrete, i.e. taking finite values.

But the theory and practice of statistics require the use of the concept of a continuous random variable - allowing any numerical values ​​from any interval.

The distribution law of a continuous random variable is conveniently specified using the so-called probability density function. f(x). Probability P(a< X < b) того, что значение, принятое случайной величиной Х, попадет в промежуток (a; b), определяется равенством

P (a< X < b) = ∫ f(x) dx

The graph of the function f (x) is called the distribution curve. Geometrically, the probability of a random variable falling into the interval (a; b) is equal to the area of ​​the corresponding curvilinear trapezoid, bounded by the distribution curve, the Ox axis and the straight lines x \u003d a, x \u003d b.

P(a£X

If a finite or countable set is subtracted from a complex event, the probability of a new event will remain unchanged.

Function f(x) - a numerical scalar function of a real argument x is called a probability density, and exists at a point x if there is a limit at this point:

Probability Density Properties:

  1. The probability density is a non-negative function, i.e. f(x) ≥ 0

(if all values ​​of the random variable X are in the interval (a;b), then the last

equality can be written as ∫ f (x) dx = 1).

Consider now the function F(x) = P(X< х). Эта функция называется функцией распределения вероятности случайной величины Х. Функция F(х) существует как для дискретных, так и для непрерывных случайных величин. Если f (x) - функция плотности распределения вероятности

continuous random variable X, then F (x) = ∫ f(x) dx = 1).

It follows from the last equality that f (x) = F" (x)

Sometimes the function f(x) is called the differential probability distribution function, and the function F(x) is called the cumulative probability distribution function.

We note the most important properties of the probability distribution function:

  1. F(x) is a non-decreasing function.
  2. F(-∞)=0.
  3. F (+∞) = 1.

The concept of a distribution function is central to the theory of probability. Using this concept, one can give another definition of a continuous random variable. A random variable is called continuous if its integral distribution function F(x) is continuous.

Numerical characteristics of continuous random variables

The mathematical expectation, variance and other parameters of any random variables are almost always calculated using formulas that follow from the distribution law.

For a continuous random variable, the mathematical expectation is calculated by the formula:

M(X) = ∫ x f(x) dx

Dispersion:

D(X) = ∫ ( x- M (X)) 2 f(x) dx or D(X) = ∫ x 2 f(x) dx - (M (X)) 2

2. Linear regression

Let the components X and Y of a two-dimensional random variable (X, Y) be dependent. We will assume that one of them can be approximately represented as a linear function of the other, for example

Y ≈ g(X) = α + βX, and determine the parameters α and β using the least squares method.

Definition. The function g(X) = α + βX is called best approximation Y in the sense of the least squares method, if the mathematical expectation M(Y - g(X)) 2 takes the smallest possible value; the function g(X) is called mean square regression Y to X.

Theorem The linear mean square regression of Y on X is:

where is the correlation coefficient X and Y.

Coefficients of the equation.

One can check that for these values ​​the function function F(α, β)

F(α, β ) = M(Y - α - βX)² has a minimum, which proves the assertion of the theorem.

Definition. The coefficient is called regression coefficient Y on X, and the straight line - - direct mean square regression of Y on X.

Substituting the coordinates of the stationary point into the equality, we can find the minimum value of the function F(α, β) equal to This value is called residual dispersion Y relative to X and characterizes the amount of error allowed when replacing Y with

g(X) = α + βX. At , the residual variance is 0, that is, the equality is not approximate, but exact. Therefore, when Y and X are connected by a linear functional dependence. Similarly, you can get a straight line of root-mean-square regression of X on Y:

and the residual variance of X with respect to Y. For both direct regressions coincide. Comparing the regression equations Y on X and X on Y and solving the system of equations, you can find the intersection point of the regression lines - a point with coordinates (t x, t y), called the center of the joint distribution of X and Y values.

We will consider the algorithm for compiling regression equations from the textbook by V. E. Gmurman “Probability Theory and Mathematical Statistics” p. 256.

1) Compile a calculation table in which the numbers of sample elements, sample options, their squares and product will be recorded.

2) Calculate the sum over all columns except the number.

3) Calculate the average values ​​for each quantity, dispersion and standard deviations.

5) Test the hypothesis about the existence of a relationship between X and Y.

6) Compose the equations of both regression lines and plot the graphs of these equations.

The slope of the straight line regression Y on X is the sample regression coefficient

Coefficient b=

We obtain the desired equation of the regression line Y on X:

Y \u003d 0.202 X + 1.024

Similarly, the regression equation X on Y:

The slope of the straight line regression Y on X is the sample regression coefficient pxy:

Coefficient b=

X \u003d 4.119 Y - 3.714

3. Nonlinear regression

If there are non-linear relationships between economic phenomena, then they are expressed using the corresponding non-linear functions.

There are two classes of non-linear regressions:

1. Regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials of various degrees

Equilateral hyperbole - ;

Semilogarithmic function - .

2. Regressions that are non-linear in terms of the estimated parameters, for example:

Power - ;

Demonstrative -;

Exponential - .

Non-linear regressions with respect to the included variables are reduced to a linear form by a simple change of variables, and further parameter estimation is performed using the least squares method. Let's consider some functions.

The parabola of the second degree is reduced to a linear form using the replacement: . As a result, we arrive at a two-factor equation, the estimation of whose parameters using the least squares method leads to the system of equations:

A parabola of the second degree is usually used in cases where, for a certain interval of factor values, the nature of the relationship of the characteristics under consideration changes: a direct relationship changes to an inverse one or an inverse one to a direct one.

An equilateral hyperbola can be used to characterize the relationship between the specific costs of raw materials, materials, fuel and the volume of output, the time of circulation of goods and the value of turnover. Its classic example is the Phillips curve, which characterizes the non-linear relationship between the unemployment rate x and percentage increase in wages y.

The hyperbola is reduced to a linear equation by a simple replacement: . You can also use the Least Squares method to build a system of linear equations.

In a similar way, the dependences are reduced to a linear form: , and others.

An equilateral hyperbola and a semi-logarithmic curve are used to describe the Engel curve (a mathematical description of the relationship between the share of spending on durable goods and total spending (or income)). The equations in which they are included are used in studies of productivity, labor intensity of agricultural production.

4. Multiple Regression

Multiple regression - a link equation with multiple independent variables:

where is the dependent variable (resultant sign);

Independent variables (factors).

To build a multiple regression equation, the following functions are most often used:

linear -

power -

exhibitor -

hyperbole - .

You can use other functions that can be reduced to a linear form.

To estimate the parameters of the multiple regression equation, the least squares method (LSM) is used. For linear equations and non-linear equations reduced to linear ones, the following system of normal equations is constructed, the solution of which makes it possible to obtain estimates of the regression parameters:

To solve it, the method of determinants can be applied:

where is the determinant of the system;

Private determinants; which are obtained by replacing the corresponding column of the matrix of the determinant of the system with the data of the left side of the system.

Another type of multiple regression equation is the standardized scale regression equation, LSM is applicable to the multiple regression equation on a standardized scale.

5. UsageMSEXCELto perform regression analysis

Regression analysis establishes the form of the relationship between the random variable Y (dependent) and the values ​​of one or more variables (independent), and the values ​​of the latter are considered to be exactly given. Such dependence is usually determined by some mathematical model (regression equation) containing several unknown parameters. In the course of regression analysis, estimates of these parameters are found on the basis of sample data, statistical errors of estimates or boundaries of confidence intervals are determined, and the compliance (adequacy) of the accepted mathematical model with experimental data is checked.

In linear regression analysis, the relationship between random variables is assumed to be linear. In the simplest case, in a paired linear regression model, there are two variables X and Y. And it is required for n pairs of observations (X1, Y1), (X2, Y2), ..., (Xn, Yn) to build (select) a straight line, called the regression line, which "best" approximates the observed values. The equation of this line y=ax+b is a regression equation. Using a regression equation, you can predict the expected value of the dependent variable y corresponding to a given value of the independent variable x. In the case when the dependence between one dependent variable Y and several independent variables X1, X2, ..., Xm is considered, one speaks of multiple linear regression.

In this case, the regression equation has the form

y = a 0 +a 1 x 1 +a 2 x 2 +…+a m x m ,

where a0, a1, a2, …, am are the regression coefficients to be determined.

The coefficients of the regression equation are determined using the least squares method, achieving the minimum possible sum of squared differences between the real values ​​of the variable Y and those calculated using the regression equation. Thus, for example, a linear regression equation can be constructed even when there is no linear correlation.

A measure of the effectiveness of the regression model is the coefficient of determination R2 (R-square). The coefficient of determination can take values ​​between 0 and 1 determines with what degree of accuracy the resulting regression equation describes (approximates) the original data. The significance of the regression model is also investigated using the F-criterion (Fisher) and the reliability of the difference between the coefficients a0, a1, a2, ..., am from zero is checked using the Student's criterion.

In Excel, the experimental data are approximated by a linear equation up to the 16th order:

y = a0+a1x1+a2x2+…+a16x16

To obtain linear regression coefficients, the "Regression" procedure from the analysis package can be used. Also, the LINEST function provides complete information about the linear regression equation. In addition, the SLOPE and INTERCEPT functions can be used to obtain the parameters of the regression equation, and the TREND and FORECAST functions can be used to obtain the predicted Y values ​​at the required points (for pairwise regression).

Let us consider in detail the application of the LINEST function (known_y, [known_x], [constant], [statistics]): known_y - the range of known values ​​of the dependent parameter Y. In pairwise regression analysis, it can have any form; in the plural, it must be either a row or a column; known_x is the range of known values ​​of one or more independent parameters. Must have the same shape as the Y range (for multiple parameters, multiple columns or rows, respectively); constant - boolean argument. If, based on the practical meaning of the regression analysis task, it is necessary that the regression line pass through the origin, that is, the free coefficient is equal to 0, the value of this argument should be set to 0 (or "false"). If the value is set to 1 (or "true") or omitted, then the free coefficient is calculated in the usual way; statistics is a boolean argument. If the value is set to 1 (or "true"), then an additional regression statistic (see table) is returned, used to evaluate the effectiveness and significance of the model. In the general case, for pairwise regression y=ax+b, the result of applying the LINEST function looks like this:

Table. Output Range of LINEST for Pairwise Regression Analysis

In the case of multiple regression analysis for the equation y=a0+a1x1+a2x2+…+amxm, the coefficients am,…,a1,a0 are displayed in the first line, and the standard errors for these coefficients are displayed in the second line. Rows 3-5, except for the first two columns filled with regression statistics, will yield #N/A.

The LINEST function should be entered as an array formula, first selecting an array of the desired size for the result (m+1 columns and 5 rows if regression statistics are required) and completing the formula entry by pressing CTRL+SHIFT+ENTER.

The result for our example:

In addition, the program has a built-in function - Data Analysis on the Data tab.

It can also be used to perform regression analysis:

On the slide - the result of the regression analysis performed using Data Analysis.

RESULTS

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

Analysis of variance

Significance F

Regression

Odds

standard error

t-statistic

P-value

bottom 95%

Top 95%

Lower 95.0%

Top 95.0%

Y-intersection

Variable X 1

The regression equations that we looked at earlier are also built in MS Excel. To perform them, first a scatter plot is built, then through the context menu, select - Add trend line. In the new window, check the boxes - Show the equation on the diagram and place the value of the approximation reliability (R ^ 2) on the diagram.

Literature:

  1. Theory of Probability and Mathematical Statistics. Gmurman V. E. Textbook for universities. - Ed. 10th, sr. - M.: Higher. school, 2010. - 479s.
  2. Higher mathematics in exercises and tasks. Textbook for universities / Danko P. E., Popov A. G., Kozhevnikova T. Ya., Danko S. P. In 2 hours - Ed. 6th, sr. - M .: Oniks Publishing House LLC: Mir and Education Publishing House LLC, 2007. - 416 p.
    1. 3. http://www.machinelearning.ru/wiki/index.php?title=%D0%A0%D0%B5%D0%B3%D1%80%D0%B5%D1%81%D1%81%D0%B8 %D1%8F - some information about regression analysis

The concept of regression. Relationship between variables x and y can be described in different ways. In particular, any form of connection can be expressed by a general equation , where y treated as a dependent variable, or functions from another - an independent variable x, called argument. The correspondence between an argument and a function can be given by a table, a formula, a graph, and so on. Changing a function depending on a change in one or more arguments is called regression. All means used to describe correlations are the content regression analysis.

Correlation equations, or regression equations, empirical and theoretically calculated regression series, their graphs, called regression lines, as well as linear and non-linear regression coefficients, serve to express regression.

Regression indicators express the correlation two-way, taking into account the change in the average values ​​of the attribute Y when changing values x i sign X, and vice versa, show the change in the mean values ​​of the feature X by changed values y i sign Y. The exception is time series, or series of dynamics, showing the change in signs over time. The regression of such series is one-sided.

There are many different forms and types of correlations. The task is reduced to identifying the form of connection in each specific case and expressing it by the corresponding correlation equation, which makes it possible to foresee possible changes in one sign Y based on known changes X, associated with the first correlation.

12.1 Linear regression

Regression equation. The results of observations carried out on a particular biological object according to correlated characteristics x and y, can be represented by points on a plane by constructing a system of rectangular coordinates. As a result, a certain scatter diagram is obtained, which makes it possible to judge the form and tightness of the relationship between varying features. Quite often this relationship looks like a straight line or can be approximated by a straight line.

Linear relationship between variables x and y is described by a general equation , where a, b, c, d,… are the parameters of the equation that determine the relationship between the arguments x 1 , x 2 , x 3 , …, x m and functions.

In practice, not all possible arguments are taken into account, but only some arguments, in the simplest case, only one:

In the linear regression equation (1) a is a free term, and the parameter b determines the slope of the regression line with respect to the rectangular coordinate axes. In analytic geometry, this parameter is called slope factor, and in biometrics - regression coefficient. A visual representation of this parameter and the position of the regression lines Y on X and X on Y in the system of rectangular coordinates gives Fig.1.

Rice. 1 Y by X and X by Y regression lines in the system

rectangular coordinates

The regression lines, as shown in Fig. 1, intersect at the point O (,), corresponding to the arithmetic mean values ​​of signs correlated with each other Y and X. When plotting regression graphs, the values ​​of the independent variable X are plotted along the abscissa, and the values ​​of the dependent variable, or function Y, are plotted along the ordinate. The line AB passing through the point O (,) corresponds to the complete (functional) relationship between the variables Y and X when the correlation coefficient . The stronger the connection between Y and X, the closer the regression lines are to AB, and, conversely, the weaker the relationship between these values, the more distant the regression lines are from AB. In the absence of a connection between the features, the regression lines are at right angles to each other and .

Since the regression indicators express the correlation two-way, the regression equation (1) should be written as follows:

According to the first formula, the average values ​​​​are determined when the sign changes X per unit of measure, on the second - averaged values ​​when a feature is changed per unit of measure Y.

Regression coefficient. The regression coefficient shows how, on average, the value of one feature y changes when another unit of measure, correlated with Y sign X. This indicator is determined by the formula

Here values s multiply by the size of the class intervals λ if they were found by variation series or correlation tables.

The regression coefficient can be calculated bypassing the calculation of standard deviations s y and s x according to the formula

If the correlation coefficient is unknown, the regression coefficient is determined as follows:

Relationship between regression and correlation coefficients. Comparing formulas (11.1) (topic 11) and (12.5), we see that their numerator contains the same value , which indicates a connection between these indicators. This relationship is expressed by the equality

Thus, the correlation coefficient is equal to the geometric mean of the coefficients b yx and b xy. Formula (6) allows, firstly, from the known values ​​of the regression coefficients b yx and b xy determine the regression coefficient R xy, and secondly, to check the correctness of the calculation of this correlation indicator R xy between varying traits X and Y.

Like the correlation coefficient, the regression coefficient characterizes only a linear relationship and is accompanied by a plus sign for a positive relationship and a minus sign for a negative relationship.

Determination of linear regression parameters. It is known that the sum of the squared deviations of the variant x i from the average there is the smallest value, i.e. This theorem forms the basis of the least squares method. With respect to linear regression [see formula (1)], the requirement of this theorem is satisfied by a certain system of equations called normal:

Joint solution of these equations with respect to parameters a and b leads to the following results:

;

;

, whence i.

Given the two-way nature of the relationship between the variables Y and X, the formula for determining the parameter a should be expressed like this:

and . (7)

Parameter b, or regression coefficient, is determined by the following formulas:

Construction of empirical regression series. In the presence of a large number of observations, regression analysis begins with the construction of empirical regression series. Empirical regression series is formed by calculating the values ​​of one variable attribute X average values ​​of the other, correlated with X sign Y. In other words, the construction of empirical regression series comes down to finding the group means u from the corresponding values ​​of the signs Y and X.

An empirical regression series is a double series of numbers that can be represented by points on a plane, and then, by connecting these points with straight line segments, an empirical regression line can be obtained. Empirical regression series, especially their plots, called regression lines, give a visual representation of the form and tightness of the correlation dependence between varying features.

Equalization of empirical regression series. Graphs of empirical regression series are, as a rule, broken lines rather than smooth ones. This is explained by the fact that, along with the main reasons that determine the general pattern in the variability of correlated traits, their value is affected by the influence of numerous secondary causes that cause random fluctuations in the nodal points of the regression. To identify the main trend (trend) of the conjugate variation of correlated features, you need to replace the broken lines with smooth, smoothly running regression lines. The process of replacing broken lines with smooth ones is called alignment of empirical series and regression lines.

Graphic alignment method. This is the simplest method that does not require computational work. Its essence is as follows. The empirical regression series is plotted as a graph in a rectangular coordinate system. Then, the midpoints of the regression are visually outlined, along which a solid line is drawn using a ruler or pattern. The disadvantage of this method is obvious: it does not exclude the influence of the individual characteristics of the researcher on the results of the alignment of empirical regression lines. Therefore, in cases where higher accuracy is required when replacing broken regression lines with smooth ones, other methods of aligning the empirical series are used.

Moving average method. The essence of this method is reduced to the sequential calculation of the arithmetic mean of two or three neighboring members of the empirical series. This method is especially convenient in cases where the empirical series is represented by a large number of terms, so that the loss of two of them - the extreme ones, which is inevitable with this method of equalization, will not noticeably affect its structure.

Least square method. This method was proposed at the beginning of the 19th century by A.M. Legendre and, independently of him, K. Gauss. It allows you to most accurately align the empirical series. This method, as shown above, is based on the assumption that the sum of the squared deviations of the variant x i from their average there is a minimum value, i.e. Hence the name of the method, which is used not only in ecology, but also in technology. The method of least squares is objective and universal, it is used in a variety of cases when finding empirical equations of regression series and determining their parameters.

The requirement of the least squares method is that the theoretical points of the regression line must be obtained in such a way that the sum of the squared deviations from these points for empirical observations y i was minimal, i.e.

Calculating the minimum of this expression in accordance with the principles of mathematical analysis and transforming it in a certain way, one can obtain a system of so-called normal equations, in which the unknown values ​​are the desired parameters of the regression equation, and the known coefficients are determined by the empirical values ​​of the features, usually the sums of their values ​​and their cross products.

Multiple linear regression. The relationship between several variables is usually expressed by a multiple regression equation, which can be linear and non-linear. In its simplest form, multiple regression is expressed by an equation with two independent variables ( x, z):

where a is the free term of the equation; b and c are the parameters of the equation. To find the parameters of equation (10) (by the least squares method), the following system of normal equations is used:

Rows of dynamics. Row alignment. The change in signs over time forms the so-called time series or rows of dynamics. A characteristic feature of such series is that the time factor always acts here as the independent variable X, and the changing sign is the dependent variable Y. Depending on the regression series, the relationship between variables X and Y is one-sided, since the time factor does not depend on the variability of features. Despite these features, time series can be likened to regression series and processed by the same methods.

Like regression series, empirical time series are influenced not only by the main, but also by numerous secondary (random) factors that obscure the main trend in the variability of features, which in the language of statistics is called trend.

Analysis of time series begins with identifying the shape of the trend. To do this, the time series is depicted as a line graph in a rectangular coordinate system. At the same time, time points (years, months, and other units of time) are plotted along the abscissa axis, and the values ​​of the dependent variable Y are plotted along the ordinate axis. is the regression equation in the form of deviations of the terms of the series of the dependent variable Y from the arithmetic mean of the series of the independent variable X:

Here, is the linear regression parameter.

Numerical characteristics of the series of dynamics. The main generalizing numerical characteristics of the series of dynamics include geometric mean and an arithmetic mean close to it. They characterize the average rate at which the value of the dependent variable changes over certain periods of time:

An estimate of the variability of the terms of the dynamics series is standard deviation. When choosing regression equations to describe the time series, the form of the trend is taken into account, which can be linear (or reduced to linear) and non-linear. The correctness of the choice of the regression equation is usually judged by the similarity of the empirically observed and calculated values ​​of the dependent variable. More accurate in solving this problem is the method of regression analysis of variance (topic 12 p.4).

Correlation of series of dynamics. It is often necessary to compare the dynamics of parallel time series that are related to each other by some general conditions, for example, to find out the relationship between agricultural production and the growth of livestock over a certain period of time. In such cases, the relationship between variables X and Y is characterized by correlation coefficient R xy (in the presence of a linear trend).

It is known that the trend of time series, as a rule, is obscured by fluctuations in the terms of the series of the dependent variable Y. Hence, a two-fold problem arises: measuring the relationship between compared series, without excluding the trend, and measuring the relationship between adjacent members of the same series, excluding the trend. In the first case, an indicator of the closeness of the connection between the compared series of dynamics is correlation coefficient(if the relationship is linear), in the second - autocorrelation coefficient. These indicators have different values, although they are calculated using the same formulas (see topic 11).

It is easy to see that the value of the autocorrelation coefficient is affected by the variability of the members of the series of the dependent variable: the less the members of the series deviate from the trend, the higher the autocorrelation coefficient, and vice versa.

In the presence of a correlation between factor and resultant signs, doctors often have to determine by what amount the value of one sign can change when another is changed by a unit of measurement generally accepted or established by the researcher himself.

For example, how will the body weight of schoolchildren of the 1st grade (girls or boys) change if their height increases by 1 cm. For this purpose, the regression analysis method is used.

Most often, the regression analysis method is used to develop normative scales and standards for physical development.

  1. Definition of regression. Regression is a function that allows, based on the average value of one attribute, to determine the average value of another attribute that is correlated with the first one.

    For this purpose, the regression coefficient and a number of other parameters are used. For example, you can calculate the number of colds on average at certain values ​​of the average monthly air temperature in the autumn-winter period.

  2. Definition of the regression coefficient. The regression coefficient is the absolute value by which the value of one attribute changes on average when another attribute associated with it changes by a specified unit of measurement.
  3. Regression coefficient formula. R y / x \u003d r xy x (σ y / σ x)
    where R y / x - regression coefficient;
    r xy - correlation coefficient between features x and y;
    (σ y and σ x) - standard deviations of features x and y.

    In our example ;
    σ x = 4.6 (standard deviation of air temperature in the autumn-winter period;
    σ y = 8.65 (standard deviation of the number of infectious colds).
    Thus, R y/x is the regression coefficient.
    R y / x \u003d -0.96 x (4.6 / 8.65) \u003d 1.8, i.e. with a decrease in the average monthly air temperature (x) by 1 degree, the average number of infectious colds (y) in the autumn-winter period will change by 1.8 cases.

  4. Regression Equation. y \u003d M y + R y / x (x - M x)
    where y is the average value of the attribute, which should be determined when the average value of another attribute (x) changes;
    x - known average value of another feature;
    R y/x - regression coefficient;
    M x, M y - known average values ​​of features x and y.

    For example, the average number of infectious colds (y) can be determined without special measurements at any average value of the average monthly air temperature (x). So, if x \u003d - 9 °, R y / x \u003d 1.8 diseases, M x \u003d -7 °, M y \u003d 20 diseases, then y \u003d 20 + 1.8 x (9-7) \u003d 20 + 3 .6 = 23.6 diseases.
    This equation is applied in the case of a straight-line relationship between two features (x and y).

  5. Purpose of the regression equation. The regression equation is used to plot the regression line. The latter allows, without special measurements, to determine any average value (y) of one attribute, if the value (x) of another attribute changes. Based on these data, a graph is built - regression line, which can be used to determine the average number of colds at any value of the average monthly temperature within the range between the calculated values ​​of the number of colds.
  6. Regression sigma (formula).
    where σ Ru/x - sigma (standard deviation) of the regression;
    σ y is the standard deviation of the feature y;
    r xy - correlation coefficient between features x and y.

    So, if σ y is the standard deviation of the number of colds = 8.65; r xy - the correlation coefficient between the number of colds (y) and the average monthly air temperature in the autumn-winter period (x) is - 0.96, then

  7. Purpose of sigma regression. Gives a characteristic of the measure of the diversity of the resulting feature (y).

    For example, it characterizes the diversity of the number of colds at a certain value of the average monthly air temperature in the autumn-winter period. So, the average number of colds at air temperature x 1 \u003d -6 ° can range from 15.78 diseases to 20.62 diseases.
    At x 2 = -9°, the average number of colds can range from 21.18 diseases to 26.02 diseases, etc.

    The regression sigma is used in the construction of a regression scale, which reflects the deviation of the values ​​of the effective attribute from its average value plotted on the regression line.

  8. Data required to calculate and plot the regression scale
    • regression coefficient - Ry/x;
    • regression equation - y \u003d M y + R y / x (x-M x);
    • regression sigma - σ Rx/y
  9. The sequence of calculations and graphic representation of the regression scale.
    • determine the regression coefficient by the formula (see paragraph 3). For example, one should determine how much the body weight will change on average (at a certain age depending on gender) if the average height changes by 1 cm.
    • according to the formula of the regression equation (see paragraph 4), determine what will be the average, for example, body weight (y, y 2, y 3 ...) * for a certain growth value (x, x 2, x 3 ...) .
      ________________
      * The value of "y" should be calculated for at least three known values ​​of "x".

      At the same time, the average values ​​of body weight and height (M x, and M y) for a certain age and sex are known

    • calculate the sigma of the regression, knowing the corresponding values ​​of σ y and r xy and substituting their values ​​into the formula (see paragraph 6).
    • based on the known values ​​x 1, x 2, x 3 and their corresponding average values ​​y 1, y 2 y 3, as well as the smallest (y - σ ru / x) and largest (y + σ ru / x) values ​​\u200b\u200b(y) construct a regression scale.

      For a graphical representation of the regression scale, the values ​​x, x 2 , x 3 (y-axis) are first marked on the graph, i.e. a regression line is built, for example, the dependence of body weight (y) on height (x).

      Then, at the corresponding points y 1 , y 2 , y 3 the numerical values ​​of the regression sigma are marked, i.e. on the graph find the smallest and largest values ​​of y 1 , y 2 , y 3 .

  10. Practical use of the regression scale. Normative scales and standards are being developed, in particular for physical development. According to the standard scale, it is possible to give an individual assessment of the development of children. At the same time, physical development is assessed as harmonious if, for example, at a certain height, the child’s body weight is within one sigma of regression to the average calculated unit of body weight - (y) for a given height (x) (y ± 1 σ Ry / x).

    Physical development is considered disharmonious in terms of body weight if the child's body weight for a certain height is within the second regression sigma: (y ± 2 σ Ry/x)

    Physical development will be sharply disharmonious both due to excess and insufficient body weight if the body weight for a certain height is within the third sigma of the regression (y ± 3 σ Ry/x).

According to the results of a statistical study of the physical development of 5-year-old boys, it is known that their average height (x) is 109 cm, and their average body weight (y) is 19 kg. The correlation coefficient between height and body weight is +0.9, standard deviations are presented in the table.

Required:

  • calculate the regression coefficient;
  • using the regression equation, determine what the expected body weight of 5-year-old boys will be with a height equal to x1 = 100 cm, x2 = 110 cm, x3 = 120 cm;
  • calculate the regression sigma, build a regression scale, present the results of its solution graphically;
  • draw the appropriate conclusions.

The condition of the problem and the results of its solution are presented in the summary table.

Table 1

Conditions of the problem Problem solution results
regression equation sigma regression regression scale (expected body weight (in kg))
M σ r xy R y/x X At σRx/y y - σ Rу/х y + σ Rу/х
1 2 3 4 5 6 7 8 9 10
Height (x) 109 cm ± 4.4cm +0,9 0,16 100cm 17.56 kg ± 0.35 kg 17.21 kg 17.91 kg
Body weight (y) 19 kg ± 0.8 kg 110 cm 19.16 kg 18.81 kg 19.51 kg
120 cm 20.76 kg 20.41 kg 21.11 kg

Decision.

Conclusion. Thus, the regression scale within the calculated values ​​of body weight allows you to determine it for any other value of growth or to assess the individual development of the child. To do this, restore the perpendicular to the regression line.

  1. Vlasov V.V. Epidemiology. - M.: GEOTAR-MED, 2004. - 464 p.
  2. Lisitsyn Yu.P. Public health and healthcare. Textbook for high schools. - M.: GEOTAR-MED, 2007. - 512 p.
  3. Medik V.A., Yuriev V.K. A course of lectures on public health and health care: Part 1. Public health. - M.: Medicine, 2003. - 368 p.
  4. Minyaev V.A., Vishnyakov N.I. and others. Social medicine and healthcare organization (Guide in 2 volumes). - St. Petersburg, 1998. -528 p.
  5. Kucherenko V.Z., Agarkov N.M. and others. Social hygiene and organization of health care (Tutorial) - Moscow, 2000. - 432 p.
  6. S. Glantz. Medico-biological statistics. Per from English. - M., Practice, 1998. - 459 p.

In statistical modeling, regression analysis is a study used to evaluate the relationship between variables. This mathematical method includes many other methods for modeling and analyzing multiple variables when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps you understand how the typical value of the dependent variable changes if one of the independent variables changes while the other independent variables remain fixed.

In all cases, the target score is a function of the independent variables and is called the regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a function of regression, which can be described using a probability distribution.

Tasks of regression analysis

This statistical research method is widely used for forecasting, where its use has a significant advantage, but sometimes it can lead to illusion or false relationships, so it is recommended to use it carefully in this question, since, for example, correlation does not mean causation.

A large number of methods have been developed for performing regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its function to lie in a certain set of functions, which can be infinite-dimensional.

As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of the data process generating is typically an unknown number, data regression analysis often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is enough data available. Regression models are often useful even when assumptions are moderately violated, although they may not perform at their best.

In a narrower sense, regression can refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable is also called metric regression to distinguish it from related problems.

Story

The earliest form of regression is the well-known method of least squares. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mainly comets, but later also newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821, including a variant of the Gauss-Markov theorem.

The term "regression" was coined by Francis Galton in the 19th century to describe a biological phenomenon. The bottom line was that the growth of descendants from the growth of ancestors, as a rule, regresses down to the normal average. For Galton, regression had only this biological meaning, but later his work was taken up by Udni Yoley and Karl Pearson and taken to a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is considered to be Gaussian. This assumption was rejected by Fischer in the papers of 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this regard, Fisher's suggestion is closer to Gauss's 1821 formulation. Prior to 1970, it sometimes took up to 24 hours to get the result of a regression analysis.

Regression analysis methods continue to be an area of ​​active research. In recent decades, new methods have been developed for robust regression; regressions involving correlated responses; regression methods that accommodate various types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured with error; regressions with more predictors than observations; and causal inferences with regression.

Regression Models

Regression analysis models include the following variables:

  • Unknown parameters, denoted as beta, which can be a scalar or a vector.
  • Independent variables, X.
  • Dependent variables, Y.

In different areas of science where regression analysis is applied, different terms are used instead of dependent and independent variables, but in all cases the regression model relates Y to a function of X and β.

The approximation is usually formulated as E (Y | X) = F (X, β). To perform regression analysis, the form of the function f must be determined. More rarely, it is based on knowledge about the relationship between Y and X that does not rely on data. If such knowledge is not available, then a flexible or convenient form F is chosen.

Dependent variable Y

Let us now assume that the vector of unknown parameters β has length k. To perform a regression analysis, the user must provide information about the dependent variable Y:

  • If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.
  • If exactly N = K are observed, and the function F is linear, then the equation Y = F(X, β) can be solved exactly, not approximately. This boils down to solving a set of N-equations with N-unknowns (the elements of β) that has a unique solution as long as X is linearly independent. If F is non-linear, a solution may not exist, or there may be many solutions.
  • The most common situation is where there are N > points to the data. In this case, there is enough information in the data to estimate the unique value for β that best fits the data, and the regression model when applied to the data can be seen as an overridden system in β.

In the latter case, regression analysis provides tools for:

  • Finding a solution for unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
  • Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about the unknown parameters β and the predicted values ​​of the dependent variable Y.

Required number of independent measurements

Consider a regression model that has three unknown parameters: β 0 , β 1 and β 2 . Let's assume that the experimenter makes 10 measurements in the same value of the independent variable of the vector X. In this case, the regression analysis does not give a unique set of values. The best you can do is to estimate the mean and standard deviation of the dependent variable Y. Similarly, by measuring two different values ​​of X, you can get enough data for a regression with two unknowns, but not for three or more unknowns.

If the experimenter's measurements were taken at three different values ​​of the independent vector variable X, then the regression analysis would provide a unique set of estimates for the three unknown parameters in β.

In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

Statistical Assumptions

When the number of measurements N is greater than the number of unknown parameters k and the measurement errors ε i , then, as a rule, then the excess information contained in the measurements is distributed and used for statistical predictions regarding unknown parameters. This excess of information is called the degree of freedom of the regression.

Underlying Assumptions

Classic assumptions for regression analysis include:

  • Sampling is representative of inference prediction.
  • The error is a random variable with a mean value of zero, which is conditional on the explanatory variables.
  • The independent variables are measured without errors.
  • As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
  • The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each non-zero element is the variance of the error.
  • The error variance is constant across observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

These sufficient conditions for the least squares estimate have the required properties, in particular these assumptions mean that the parameter estimates will be objective, consistent and efficient, especially when taken into account in the class of linear estimates. It is important to note that the actual data rarely satisfies the conditions. That is, the method is used even if the assumptions are not correct. Variation from assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Statistical analysis reports typically include analysis of tests against sample data and methodology for the usefulness of the model.

In addition, variables in some cases refer to values ​​measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic weighted regression is the only method that deals with such data.

In linear regression, the feature is that the dependent variable, which is Y i , is a linear combination of parameters. For example, in simple linear regression, n-point modeling uses one independent variable, x i , and two parameters, β 0 and β 1 .

In multiple linear regression, there are several independent variables or their functions.

When randomly sampled from a population, its parameters make it possible to obtain a sample of a linear regression model.

In this aspect, the least squares method is the most popular. It provides parameter estimates that minimize the sum of squares of the residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters, which are solved to obtain parameter estimates.

Assuming further that population error generally propagates, the researcher can use these estimates of standard errors to create confidence intervals and perform hypotheses testing about its parameters.

Nonlinear Regression Analysis

An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized with an iterative procedure. This introduces many complications that define the differences between linear and non-linear least squares methods. Consequently, the results of regression analysis when using a non-linear method are sometimes unpredictable.

Calculation of power and sample size

Here, as a rule, there are no consistent methods regarding the number of observations compared to the number of independent variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t^n, where N is the sample size, n is the number of explanatory variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one explanatory variable. For example, a researcher builds a linear regression model using a dataset that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately determine the line (m), then the maximum number of explanatory variables that the model can support is 4.

Other Methods

Although the parameters of a regression model are usually estimated using the least squares method, there are other methods that are used much less often. For example, these are the following methods:

  • Bayesian methods (for example, the Bayesian method of linear regression).
  • A percentage regression used for situations where reducing percentage errors is considered more appropriate.
  • The smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
  • Nonparametric regression requiring a large number of observations and calculations.
  • The distance of the learning metric that is learned in search of a meaningful distance metric in the given input space.

Software

All major statistical software packages are performed using least squares regression analysis. Simple linear regression and multiple regression analysis can be used in some spreadsheet applications as well as some calculators. While many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software has been developed for use in areas such as survey analysis and neuroimaging.