Biographies Characteristics Analysis

Least squares solution. LSM in the case of a linear model

Example.

Experimental data on the values ​​of variables X and at are given in the table.

As a result of their alignment, the function

Using method least squares , approximate these data with a linear dependence y=ax+b(find parameters a and b). Find out which of the two lines is better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the method of least squares (LSM).

The problem is to find the linear dependence coefficients for which the function of two variables a and b accepts smallest value. That is, given the data a and b the sum of the squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, the solution of the example is reduced to finding the extremum of a function of two variables.

Derivation of formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding partial derivatives of functions by variables a and b, we equate these derivatives to zero.

We solve the resulting system of equations by any method (for example substitution method or Cramer's method) and obtain formulas for finding the coefficients using the least squares method (LSM).

With data a and b function takes the smallest value. The proof of this fact is given below the text at the end of the page.

That's the whole method of least squares. Formula for finding the parameter a contains the sums ,,, and the parameter n- amount of experimental data. The values ​​of these sums are recommended to be calculated separately. Coefficient b found after calculation a.

It's time to remember the original example.

Decision.

In our example n=5. We fill in the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values ​​in the fourth row of the table are obtained by multiplying the values ​​of the 2nd row by the values ​​of the 3rd row for each number i.

The values ​​in the fifth row of the table are obtained by squaring the values ​​of the 2nd row for each number i.

The values ​​of the last column of the table are the sums of the values ​​across the rows.

We use the formulas of the least squares method to find the coefficients a and b. We substitute in them the corresponding values ​​from the last column of the table:

Hence, y=0.165x+2.184 is the desired approximating straight line.

It remains to find out which of the lines y=0.165x+2.184 or better approximates the original data, i.e. to make an estimate using the least squares method.

Estimation of the error of the method of least squares.

To do this, you need to calculate the sums of squared deviations of the original data from these lines and , a smaller value corresponds to a line that better approximates the original data in terms of the least squares method.

Since , then the line y=0.165x+2.184 approximates the original data better.

Graphic illustration of the least squares method (LSM).

Everything looks great on the charts. The red line is the found line y=0.165x+2.184, the blue line is , the pink dots are the original data.

In practice, when modeling various processes - in particular, economic, physical, technical, social - one or another method of calculating the approximate values ​​of functions from their known values ​​at some fixed points is widely used.

Problems of approximation of functions of this kind often arise:

    when constructing approximate formulas for calculating the values ​​of the characteristic quantities of the process under study according to the tabular data obtained as a result of the experiment;

    in numerical integration, differentiation, solution differential equations etc.;

    if it is necessary to calculate the values ​​of functions at intermediate points of the considered interval;

    when determining the values ​​of the characteristic quantities of the process outside the interval under consideration, in particular, when forecasting.

If, in order to model a certain process specified by a table, a function is constructed that approximately describes this process based on the least squares method, it will be called an approximating function (regression), and the task of constructing approximating functions itself will be an approximation problem.

This article discusses the possibilities of the MS Excel package for solving such problems, in addition, methods and techniques for constructing (creating) regressions for tabularly given functions (which is the basis of regression analysis) are given.

There are two options for building regressions in Excel.

    Adding Selected Regressions ( trend lines- trendlines) into a chart built on the basis of a data table for the studied process characteristic (available only if a chart is built);

    Using the built-in statistical functions of the Excel worksheet, which allows you to get regressions (trend lines) directly from the source data table.

Adding Trendlines to a Chart

For a table of data describing a certain process and represented by a diagram, Excel has an effective regression analysis tool that allows you to:

    build based on the least squares method and add five to the diagram types of regressions, which, with varying degrees of accuracy, model the process under study;

    add an equation of the constructed regression to the diagram;

    determine the degree of compliance of the selected regression with the data displayed on the chart.

Based on the chart data, Excel allows you to get linear, polynomial, logarithmic, exponential, exponential types of regressions, which are given by the equation:

y = y(x)

where x is an independent variable, which often takes the values ​​of a sequence of natural numbers (1; 2; 3; ...) and produces, for example, a countdown of the time of the process under study (characteristics).

1 . Linear regression is good at modeling features that increase or decrease at a constant rate. This is the simplest model of the process under study. It is built according to the equation:

y=mx+b

where m is the tangent of the slope linear regression to the x-axis; b - coordinate of the point of intersection of the linear regression with the y-axis.

2 . A polynomial trendline is useful for describing characteristics that have several distinct extremes (highs and lows). The choice of the degree of the polynomial is determined by the number of extrema of the characteristic under study. Thus, a polynomial of the second degree can well describe a process that has only one maximum or minimum; polynomial of the third degree - no more than two extrema; polynomial of the fourth degree - no more than three extrema, etc.

In this case, the trend line is built in accordance with the equation:

y = c0 + c1x + c2x2 + c3x3 + c4x4 + c5x5 + c6x6

where the coefficients c0, c1, c2,... c6 are constants whose values ​​are determined during construction.

3 . The logarithmic trend line is successfully used in modeling characteristics, the values ​​of which change rapidly at first, and then gradually stabilize.

y = c ln(x) + b

4 . The power trend line gives good results if the values ​​of the studied dependence are characterized by a constant change in the growth rate. An example of such a dependence can serve as a graph of uniformly accelerated movement of the car. If there are zero or negative values, you cannot use a power trend line.

It is built in accordance with the equation:

y = cxb

where the coefficients b, c are constants.

5 . An exponential trendline should be used if the rate of change in the data is continuously increasing. For data containing zero or negative values, this kind of approximation is also not applicable.

It is built in accordance with the equation:

y=cebx

where the coefficients b, c are constants.

When selecting a trend line, Excel automatically calculates the value of R2, which characterizes the accuracy of the approximation: the closer the R2 value is to one, the more reliably the trend line approximates the process under study. If necessary, the value of R2 can always be displayed on the diagram.

Determined by the formula:

To add a trend line to a data series:

    activate the chart built on the basis of the data series, i.e., click within the chart area. The Chart item will appear in the main menu;

    after clicking on this item, a menu will appear on the screen, in which you should select the Add trend line command.

The same actions are easily implemented if you hover over the graph corresponding to one of the data series and right-click; in the context menu that appears, select the Add trend line command. The Trendline dialog box will appear on the screen with the Type tab opened (Fig. 1).

After that you need:

On the Type tab, select the required trend line type (Linear is selected by default). For the Polynomial type, in the Degree field, specify the degree of the selected polynomial.

1 . The Built on Series field lists all the data series in the chart in question. To add a trendline to a specific data series, select its name in the Built on series field.

If necessary, by going to the Parameters tab (Fig. 2), you can set the following parameters for the trend line:

    change the name of the trend line in the Name of the approximating (smoothed) curve field.

    set the number of periods (forward or backward) for the forecast in the Forecast field;

    display the equation of the trend line in the chart area, for which you should enable the checkbox show the equation on the chart;

    display the value of the approximation reliability R2 in the diagram area, for which you should enable the checkbox put the value of the approximation reliability (R^2) on the diagram;

    set the point of intersection of the trend line with the Y-axis, for which you should enable the checkbox Intersection of the curve with the Y-axis at a point;

    click the OK button to close the dialog box.

There are three ways to start editing an already built trend line:

    use the Selected trend line command from the Format menu, after selecting the trend line;

    select the Format Trendline command from the context menu, which is called by right-clicking on the trendline;

    by double clicking on the trend line.

The Format Trendline dialog box will appear on the screen (Fig. 3), containing three tabs: View, Type, Parameters, and the contents of the last two completely coincide with the similar tabs of the Trendline dialog box (Fig. 1-2). On the View tab, you can set the line type, its color and thickness.

To delete an already constructed trend line, select the trend line to be deleted and press the Delete key.

The advantages of the considered regression analysis tool are:

    the relative ease of plotting a trend line on charts without creating a data table for it;

    a fairly wide list of types of proposed trend lines, and this list includes the most commonly used types of regression;

    the possibility of predicting the behavior of the process under study for an arbitrary (within common sense) the number of steps forward as well as back;

    the possibility of obtaining the equation of the trend line in an analytical form;

    the possibility, if necessary, of obtaining an assessment of the reliability of the approximation.

The disadvantages include the following points:

    the construction of a trend line is carried out only if there is a chart built on a series of data;

    the process of generating data series for the characteristic under study based on the trend line equations obtained for it is somewhat cluttered: the required regression equations are updated with each change in the values ​​of the original data series, but only within the chart area, while the data series formed on the basis of the old line equation trend, remains unchanged;

    In PivotChart reports, when you change the chart view or the associated PivotTable report, existing trendlines are not preserved, so you must ensure that the layout of the report meets your requirements before you draw trendlines or otherwise format the PivotChart report.

Trend lines can be added to data series presented on charts such as a graph, histogram, flat non-normalized area charts, bar, scatter, bubble and stock charts.

You cannot add trendlines to data series on 3-D, Standard, Radar, Pie, and Donut charts.

Using Built-in Excel Functions

Excel also provides a regression analysis tool for plotting trendlines outside the chart area. A number of statistical worksheet functions can be used for this purpose, but all of them allow you to build only linear or exponential regressions.

Excel has several functions for building linear regression, in particular:

    TREND;

  • SLOPE and CUT.

As well as several functions for constructing an exponential trend line, in particular:

    LGRFPapprox.

It should be noted that the techniques for constructing regressions using the TREND and GROWTH functions are practically the same. The same can be said about the pair of functions LINEST and LGRFPRIBL. For these four functions, when creating a table of values, Excel features such as array formulas are used, which somewhat clutters up the process of building regressions. We also note that the construction of a linear regression, in our opinion, is easiest to implement using the SLOPE and INTERCEPT functions, where the first of them determines the slope of the linear regression, and the second determines the segment cut off by the regression on the y-axis.

The advantages of the built-in functions tool for regression analysis are:

    a fairly simple process of the same type of formation of data series of the characteristic under study for all built-in statistical functions that set trend lines;

    a standard technique for constructing trend lines based on the generated data series;

    the possibility of predicting the behavior of the process under study on required amount steps forward or backward.

And the disadvantages include the fact that Excel does not have built-in functions for creating other (except linear and exponential) types of trend lines. This circumstance often does not allow choosing a sufficiently accurate model of the process under study, as well as obtaining forecasts close to reality. In addition, when using the TREND and GROW functions, the equations of the trend lines are not known.

It should be noted that the authors did not set the goal of the article to present the course of regression analysis with varying degrees of completeness. Its main task is to show the capabilities of the Excel package in solving approximation problems using specific examples; demonstrate what effective tools Excel has for building regressions and forecasting; illustrate how relatively easily such problems can be solved even by a user who does not have deep knowledge of regression analysis.

Examples of solving specific problems

Consider the solution of specific problems using the listed tools of the Excel package.

Task 1

With a table of data on the profit of a motor transport enterprise for 1995-2002. you need to do the following.

    Build a chart.

    Add linear and polynomial (quadratic and cubic) trend lines to the chart.

    Using the trend line equations, obtain tabular data on the profit of the enterprise for each trend line for 1995-2004.

    Make a profit forecast for the enterprise for 2003 and 2004.

The solution of the problem

    In the range of cells A4:C11 of the Excel worksheet, we enter the worksheet shown in Fig. 4.

    Having selected the range of cells B4:C11, we build a chart.

    We activate the constructed chart and, according to the method described above, after selecting the type of trend line in the Trend Line dialog box (see Fig. 1), we alternately add linear, quadratic and cubic trend lines to the chart. In the same dialog box, open the Parameters tab (see Fig. 2), in the Name of the approximating (smoothed) curve field, enter the name of the added trend, and in the Forecast forward for: periods field, set the value 2, since it is planned to make a profit forecast for two years ahead. To display the regression equation and the approximation reliability value R2 in the diagram area, enable the checkboxes Show the equation on the screen and place the approximation reliability value (R^2) on the diagram. For better visual perception, we change the type, color and thickness of the plotted trend lines, for which we use the View tab of the Trend Line Format dialog box (see Fig. 3). The resulting chart with added trend lines is shown in fig. 5.

    To obtain tabular data on the profit of the enterprise for each trend line for 1995-2004. Let's use the equations of the trend lines presented in fig. 5. To do this, in the cells of the D3:F3 range, enter textual information about the type of the selected trend line: Linear trend, Quadratic trend, Cubic trend. Next, enter the linear regression formula in cell D4 and, using the fill marker, copy this formula with relative references to the range of cells D5:D13. It should be noted that each cell with a linear regression formula from the range of cells D4:D13 has a corresponding cell from the range A4:A13 as an argument. Similarly, for quadratic regression, the cell range E4:E13 is filled, and for cubic regression, the cell range F4:F13 is filled. Thus, a forecast was made for the profit of the enterprise for 2003 and 2004. with three trends. The resulting table of values ​​is shown in fig. 6.

Task 2

    Build a chart.

    Add logarithmic, exponential and exponential trend lines to the chart.

    Derive the equations of the obtained trend lines, as well as the values ​​of the approximation reliability R2 for each of them.

    Using the trend line equations, obtain tabular data on the profit of the enterprise for each trend line for 1995-2002.

    Make a profit forecast for the business for 2003 and 2004 using these trend lines.

The solution of the problem

Following the methodology given in solving problem 1, we obtain a diagram with added logarithmic, exponential and exponential trend lines (Fig. 7). Further, using the obtained trend line equations, we fill in the table of values ​​for the profit of the enterprise, including the predicted values ​​for 2003 and 2004. (Fig. 8).

On fig. 5 and fig. it can be seen that the model with a logarithmic trend corresponds to the lowest value of the approximation reliability

R2 = 0.8659

The highest values ​​of R2 correspond to models with a polynomial trend: quadratic (R2 = 0.9263) and cubic (R2 = 0.933).

Task 3

With a table of data on the profit of a motor transport enterprise for 1995-2002, given in task 1, you must perform the following steps.

    Get data series for linear and exponential trendlines using the TREND and GROW functions.

    Using the TREND and GROWTH functions, make a profit forecast for the enterprise for 2003 and 2004.

    For the initial data and the received data series, construct a diagram.

The solution of the problem

Let's use the worksheet of task 1 (see Fig. 4). Let's start with TREND functions:

    select the range of cells D4:D11, which should be filled with the values ​​of the TREND function corresponding to the known data on the profit of the enterprise;

    call the Function command from the Insert menu. In the Function Wizard dialog box that appears, select the TREND function from the Statistical category, and then click the OK button. The same operation can be performed by pressing the button (Insert function) of the standard toolbar.

    In the Function Arguments dialog box that appears, enter the range of cells C4:C11 in the Known_values_y field; in the Known_values_x field - the range of cells B4:B11;

    to make the entered formula an array formula, use the key combination + + .

The formula we entered in the formula bar will look like: =(TREND(C4:C11;B4:B11)).

As a result, the range of cells D4:D11 is filled with the corresponding values ​​of the TREND function (Fig. 9).

To make a forecast of the company's profit for 2003 and 2004. necessary:

    select the range of cells D12:D13, where the values ​​predicted by the TREND function will be entered.

    call the TREND function and in the Function Arguments dialog box that appears, enter in the Known_values_y field - the range of cells C4:C11; in the Known_values_x field - the range of cells B4:B11; and in the field New_values_x - the range of cells B12:B13.

    turn this formula into an array formula using the keyboard shortcut Ctrl + Shift + Enter.

    The entered formula will look like: =(TREND(C4:C11;B4:B11;B12:B13)), and the range of cells D12:D13 will be filled with the predicted values ​​of the TREND function (see Fig. 9).

Similarly, a data series is filled using the GROWTH function, which is used in the analysis of nonlinear dependencies and works exactly the same as its linear counterpart TREND.

Figure 10 shows the table in formula display mode.

For the initial data and the obtained data series, the diagram shown in fig. eleven.

Task 4

With a table of data on the receipt of applications for services by the dispatching service of a motor transport enterprise for the period from the 1st to the 11th day of the current month, the following actions must be performed.

    Obtain data series for linear regression: using the SLOPE and INTERCEPT functions; using the LINEST function.

    Retrieve a data series for exponential regression using the LYFFPRIB function.

    Using the above functions, make a forecast about the receipt of applications to the dispatch service for the period from the 12th to the 14th day of the current month.

    For the original and received data series, construct a diagram.

The solution of the problem

Note that, unlike the TREND and GROW functions, none of the functions listed above (SLOPE, INTERCEPTION, LINEST, LGRFPRIB) are regressions. These functions play only an auxiliary role, determining the necessary regression parameters.

For linear and exponential regressions built using the SLOPE, INTERCEPT, LINEST, LGRFINB functions, the appearance of their equations is always known, in contrast to the linear and exponential regressions corresponding to the TREND and GROWTH functions.

1 . Let's build a linear regression that has the equation:

y=mx+b

using the SLOPE and INTERCEPT functions, with the slope of the regression m being determined by the SLOPE function, and the constant term b - by the INTERCEPT function.

To do this, we perform the following actions:

    enter the source table in the range of cells A4:B14;

    the value of the parameter m will be determined in cell C19. Select from the Statistical category the Slope function; enter the range of cells B4:B14 in the known_values_y field and the range of cells A4:A14 in the known_values_x field. The formula will be entered into cell C19: =SLOPE(B4:B14;A4:A14);

    using a similar method, the value of the parameter b in cell D19 is determined. And its content will look like this: = INTERCEPT(B4:B14;A4:A14). Thus, the values ​​of the parameters m and b, necessary for constructing a linear regression, will be stored, respectively, in cells C19, D19;

    then we enter the linear regression formula in cell C4 in the form: = $ C * A4 + $ D. In this formula, cells C19 and D19 are written with absolute references (the cell address should not change with possible copying). The absolute reference sign $ can be typed either from the keyboard or using the F4 key, after placing the cursor on the cell address. Using the fill handle, copy this formula to the range of cells C4:C17. We get the desired data series (Fig. 12). Due to the fact that the number of requests is an integer, you should set the number format on the Number tab of the Cell Format window with the number of decimal places to 0.

2 . Now let's build a linear regression given by the equation:

y=mx+b

using the LINEST function.

For this:

    enter the LINEST function as an array formula into the range of cells C20:D20: =(LINEST(B4:B14;A4:A14)). As a result, we get the value of the parameter m in cell C20, and the value of the parameter b in cell D20;

    enter the formula in cell D4: =$C*A4+$D;

    copy this formula using the fill marker to the range of cells D4:D17 and get the desired data series.

3 . We build an exponential regression that has the equation:

with the help of the LGRFPRIBL function, it is performed similarly:

    in the range of cells C21:D21, enter the function LGRFPRIBL as an array formula: =( LGRFPRIBL (B4:B14;A4:A14)). In this case, the value of the parameter m will be determined in cell C21, and the value of the parameter b will be determined in cell D21;

    the formula is entered into cell E4: =$D*$C^A4;

    using the fill marker, this formula is copied to the range of cells E4:E17, where the data series for exponential regression will be located (see Fig. 12).

On fig. 13 shows a table where we can see the functions we use with the necessary cell ranges, as well as formulas.

Value R 2 called determination coefficient.

The task of constructing a regression dependence is to find the vector of coefficients m of the model (1) at which the coefficient R takes the maximum value.

To assess the significance of R, Fisher's F-test is used, calculated by the formula

where n- sample size (number of experiments);

k is the number of model coefficients.

If F exceeds some critical value for the data n and k and the accepted confidence level, then the value of R is considered significant. Tables of critical values ​​of F are given in reference books on mathematical statistics.

Thus, the significance of R is determined not only by its value, but also by the ratio between the number of experiments and the number of coefficients (parameters) of the model. Indeed, the correlation ratio for n=2 for a simple linear model is 1 (through 2 points on the plane, you can always draw a single straight line). However, if the experimental data are random variables, such a value of R should be trusted with great care. Usually, in order to obtain a significant R and reliable regression, it is aimed at ensuring that the number of experiments significantly exceeds the number of model coefficients (n>k).

To build a linear regression model, you must:

1) prepare a list of n rows and m columns containing the experimental data (column containing the output value Y must be either first or last in the list); for example, let's take the data of the previous task, adding a column called "period number", numbering the numbers of periods from 1 to 12. (these will be the values X)

2) go to menu Data/Data Analysis/Regression

If the "Data Analysis" item in the "Tools" menu is missing, then you should go to the "Add-Ins" item of the same menu and check the "Analysis Package" box.

3) in the "Regression" dialog box, set:

input interval Y;

input interval X;

output interval - the upper left cell of the interval in which the calculation results will be placed (it is recommended to place it on a new worksheet);

4) click "Ok" and analyze the results.

It has many uses as it allows for approximate representation given function others are simpler. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities from the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem on a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS is of interest to us from the point of view of regression analysis (in Excel, its methods are implemented using built-in functions), we should immediately proceed to consider a specific problem.

So let X be trading area grocery store, measured in square meters, and Y is the annual turnover, defined in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has one or another retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built with data for n stores.

According to mathematical statistics, the results will be more or less correct if the data on at least 5-6 objects are examined. Also, "anomalous" results cannot be used. In particular, an elite small boutique can have a turnover many times greater than the turnover of large outlets of the “masmarket” class.

The essence of the method

The table data can be shown in Cartesian plane in the form of points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution of the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n .

Of course, you can use the polynomial high degree, but this option is not only difficult to implement, but simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for a straight line y = ax + b, which best approximates the experimental data, and more precisely, the coefficients - a and b.

Accuracy score

For any approximation, the assessment of its accuracy is of particular importance. Denote by e i the difference (deviation) between the functional and experimental values ​​for the point x i , i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, preference should be given to the one that has the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations, there will practically be negative ones.

You can solve the problem using the deviation modules or their squares. The last method received the most wide use. It is used in many areas, including regression analysis (in Excel, its implementation is carried out using two built-in functions), and has long been proven to be effective.

Least square method

In Excel, as you know, there is a built-in autosum function that allows you to calculate the values ​​of all values ​​located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

In mathematical notation, this looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding a straight line that best describes a specific relationship between X and Y amounts to calculating the minimum of a function of two variables:

This requires equating to zero partial derivatives with respect to new variables a and b, and solving a primitive system consisting of two equations with 2 unknowns of the form:

After simple transformations, including dividing by 2 and manipulating the sums, we get:

Solving it, for example, by the Cramer method, we obtain stationary point with some coefficients a * and b * . This is the minimum, i.e., to predict what turnover the store will have when certain area, the straight line y \u003d a * x + b * will do, which is regression model for the example in question. Of course she won't let you find exact result, but will help you get an idea of ​​whether buying a store on credit for a particular area will pay off.

How to implement the least squares method in Excel

Excel has a function for calculating the value of the least squares. She has next view: "TREND" (known Y values; known X values; new X values; constant). Let's apply the formula for calculating the OLS in Excel to our table.

To do this, in the cell in which the result of the calculation using the least squares method in Excel should be displayed, enter the “=” sign and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

  • range of known values ​​for Y (in this case data for trade turnover);
  • range x 1 , …x n , i.e. the size of retail space;
  • both famous and unknown values x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, there is a logical variable "Const" in the formula. If you enter 1 in the field corresponding to it, then this will mean that calculations should be carried out, assuming that b \u003d 0.

If you need to know the forecast for more than one x value, then after entering the formula, you should not press "Enter", but you need to type the combination "Shift" + "Control" + "Enter" ("Enter") on the keyboard.

Some Features

Regression analysis can be accessed even by dummies. Excel formula to predict the value of an array of unknown variables - "TREND" - can be used even by those who have never heard of the least squares method. It is enough just to know some features of its work. In particular:

  • If we arrange the range of known values ​​of the variable y in one row or column, then each row (column) with known values x will be treated by the program as a separate variable.
  • If the range with known x is not specified in the "TREND" window, then in the case of using the function in Excel program will consider it as an array consisting of integers, the number of which corresponds to the range with the given values ​​of the variable y.
  • To output an array of "predicted" values, the trend expression must be entered as an array formula.
  • If no new x values ​​are specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already given parameters y.
  • The range containing the new x values ​​must consist of the same or more rows or columns, as a range with given y values. In other words, it must be proportionate to the independent variables.
  • An array with known x values ​​can contain multiple variables. However, if we are talking only about one, then it is required that the ranges with the given values ​​of x and y be commensurate. In the case of several variables, it is necessary that the range with the given y values ​​fit in one column or one row.

FORECAST function

It is implemented using several functions. One of them is called "PREDICTION". It is similar to TREND, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know the Excel formulas for dummies that allow you to predict the value of the future value of an indicator according to a linear trend.

It is widely used in econometrics in the form of a clear economic interpretation of its parameters.

Linear regression is reduced to finding an equation of the form

or

Type equation allows for set values parameter X have theoretical values ​​of the effective feature, substituting the actual values ​​of the factor into it X.

Building a linear regression comes down to estimating its parameters − a and in. Linear regression parameter estimates can be found by different methods.

The classical approach to estimating linear regression parameters is based on least squares(MNK).

LSM allows one to obtain such parameter estimates a and in, under which the sum of the squared deviations of the actual values ​​of the resultant trait (y) from calculated (theoretical) mini-minimum:

To find the minimum of a function, it is necessary to calculate the partial derivatives with respect to each of the parameters a and b and equate them to zero.

Denote through S, then:

Transforming the formula, we get next system normal equations for parameter estimation a and in:

Solving the system of normal equations (3.5) either by the method sequential exclusion variables, or by the method of determinants, we find the required estimates of the parameters a and in.

Parameter in called the regression coefficient. Its value shows the average change in the result with a change in the factor by one unit.

The regression equation is always supplemented with an indicator of the tightness of the relationship. When using linear regression, the linear correlation coefficient acts as such an indicator. There are different versions of the formula linear coefficient correlations. Some of them are listed below:

As you know, the linear correlation coefficient is within the limits: -1 1.

To assess the quality of the selection linear function the square is calculated

A linear correlation coefficient called determination coefficient . The coefficient of determination characterizes the proportion of the variance of the effective feature y, explained by the regression total variance effective sign:

Accordingly, the value 1 - characterizes the proportion of dispersion y, caused by the influence of other factors not taken into account in the model.

Questions for self-control

1. The essence of the method of least squares?

2. How many variables provide a pairwise regression?

3. What coefficient determines the tightness of the connection between the changes?

4. Within what limits is the coefficient of determination determined?

5. Estimation of parameter b in correlation-regression analysis?

1. Christopher Dougherty. Introduction to econometrics. - M.: INFRA - M, 2001 - 402 p.

2. S.A. Borodich. Econometrics. Minsk LLC "New Knowledge" 2001.


3. R.U. Rakhmetov Short Course in econometrics. Tutorial. Almaty. 2004. -78s.

4. I.I. Eliseeva. Econometrics. - M.: "Finance and statistics", 2002

5. Monthly information and analytical magazine.

Nonlinear economic models. Nonlinear regression models. Variable conversion.

Nonlinear economic models..

Variable conversion.

elasticity coefficient.

If between economic phenomena there are non-linear relations, then they are expressed using the corresponding nonlinear functions: for example, an equilateral hyperbola , second degree parabolas and etc.

There are two classes of non-linear regressions:

1. Regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials various degrees - , ;

Equilateral hyperbole - ;

Semilogarithmic function - .

2. Regressions that are non-linear in the estimated parameters, for example:

Power - ;

Demonstrative -;

Exponential - .

Total sum of squared deviations individual values effective feature at from the average value is caused by the influence of many factors. We conditionally divide the entire set of reasons into two groups: studied factor x and other factors.

If the factor does not affect the result, then the regression line on the graph is parallel to the axis oh and

Then the entire dispersion of the effective attribute is due to the influence of other factors and total amount squared deviations will coincide with the residual. If other factors do not affect the result, then u tied with X functionally and residual amount squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares.

Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor X, i.e. regression at on X, and caused by the action of other causes (unexplained variation). The suitability of the regression line for forecasting depends on which part general variation sign at accounts for the explained variation

Obviously, if the sum of squared deviations due to regression is greater than the residual sum of squares, then the regression equation is statistically significant and the factor X has a significant impact on the outcome. y.

, i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of units of the population n and the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P

The assessment of the significance of the regression equation as a whole is given with the help of F- Fisher's criterion. In this case, a null hypothesis is put forward that the regression coefficient is equal to zero, i.e. b= 0, and hence the factor X does not affect the result y.

The direct calculation of the F-criterion is preceded by an analysis of the variance. Central to it is the expansion of the total sum of squared deviations of the variable at from the average value at into two parts - "explained" and "unexplained":

- total sum of squared deviations;

- sum of squared deviations explained by regression;

is the residual sum of the squares of the deviation.

Any sum of squared deviations is related to the number of degrees of freedom , i.e. with the number of freedom of independent variation of the feature. The number of degrees of freedom is related to the number of population units n and with the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P possible is required to form a given sum of squares.

Dispersion per degree of freedomD.

F-ratios (F-criterion):

If the null hypothesis is true, then the factorial and residual dispersion do not differ from each other. For H 0, a refutation is necessary so that the factor variance exceeds the residual by several times. The English statistician Snedecor developed tables of critical values F-relationships at different levels of materiality null hypothesis and various numbers degrees of freedom. Table value F-criterion is the maximum value of the ratio of variances, which can take place in case of their random divergence for given level the probability of having a null hypothesis. Computed value F-relationship is recognized as reliable if o is greater than the tabular one.

In this case, the null hypothesis about the absence of a relationship of features is rejected and a conclusion is made about the significance of this relationship: F fact > F table H 0 is rejected.

If the value is less than the table F fact ‹, F table, then the probability of the null hypothesis is higher than a given level and it cannot be rejected without a serious risk of drawing the wrong conclusion about the presence of a relationship. In this case, the regression equation is considered statistically insignificant. N o does not deviate.

Standard error of the regression coefficient

To assess the significance of the regression coefficient, its value is compared with its standard error, i.e. the actual value is determined t-Student's criterion: which is then compared with table value at a certain level of significance and the number of degrees of freedom ( n- 2).

Parameter Standard Error a:

The significance of the linear correlation coefficient is checked based on the magnitude of the error correlation coefficient r:

Total variance of a feature X:

Multiple Linear Regression

Model building

Multiple Regression is a regression of the resultant feature with two and a large number factors, i.e. the view model

regression can give good result when modeling, if the influence of other factors affecting the object of study can be neglected. The behavior of individual economic variables cannot be controlled, i.e., it is not possible to ensure the equality of all other conditions for assessing the influence of one factor under study. In this case, you should try to identify the influence of other factors by introducing them into the model, i.e. build an equation multiple regression: y = a+b 1 x 1 +b 2 +…+b p x p + .

The main goal of multiple regression is to build a model with a large number of factors, while determining the influence of each of them individually, as well as their cumulative impact on the modeled indicator. The specification of the model includes two areas of questions: the selection of factors and the choice of the type of regression equation

Least square method is used to estimate the parameters of the regression equation.
Number of lines (initial data)

One of the methods for studying stochastic relationships between features is regression analysis.
Regression analysis is the derivation of a regression equation, which is used to find average value a random variable (feature-result), if the value of another (or other) variables (feature-factors) is known. It includes the following steps:

  1. choice of form of communication (type analytical equation regression);
  2. estimation of equation parameters;
  3. evaluation of the quality of the analytical regression equation.
Most often, a linear form is used to describe the statistical relationship of features. attention to linear connection is explained by a clear economic interpretation of its parameters, limited by the variation of variables, and by the fact that in most cases, non-linear forms of communication are converted (by taking a logarithm or changing variables) into a linear form for performing calculations.
In the case of a linear pair relationship, the regression equation will take the form: y i =a+b·x i +u i . Options given equation a and b are estimated from the data statistical observation x and y . The result of such an assessment is the equation: , where , - estimates of the parameters a and b , - the value of the effective feature (variable) obtained by the regression equation (calculated value).

The most commonly used for parameter estimation is least squares method (LSM).
The least squares method gives the best (consistent, efficient and unbiased) estimates of the parameters of the regression equation. But only if certain assumptions about the random term (u) and the independent variable (x) are met (see OLS assumptions).

The problem of estimating the parameters of a linear pair equation least squares consists in the following: to obtain such estimates of the parameters , , at which the sum of the squared deviations of the actual values ​​of the effective feature - y i from the calculated values ​​- is minimal.
Formally OLS criterion can be written like this: .

Classification of least squares methods

  1. Least square method.
  2. Maximum likelihood method (for a normal classical linear regression model, normality of regression residuals is postulated).
  3. The generalized least squares method of GLSM is used in the case of error autocorrelation and in the case of heteroscedasticity.
  4. Weighted least squares ( special case GMS with heteroscedastic residues).

Illustrate the essence the classical method of least squares graphically. To do this, we will build scatter plot according to observations (x i , y i , i=1;n) in rectangular system coordinates (such a scatter plot is called a correlation field). Let's try to find a straight line that is closest to the points of the correlation field. According to the least squares method, the line is chosen so that the sum of squared vertical distances between the points of the correlation field and this line would be minimal.

Mathematical notation of this problem: .
The values ​​of y i and x i =1...n are known to us, these are observational data. In the function S they are constants. The variables in this function are the required estimates of the parameters - , . To find the minimum of a function of 2 variables, it is necessary to calculate the partial derivatives of this function with respect to each of the parameters and equate them to zero, i.e. .
As a result, we obtain a system of 2 normal linear equations:
Deciding this system, we find the required parameter estimates:

The correctness of the calculation of the parameters of the regression equation can be checked by comparing the sums (some discrepancy is possible due to rounding of the calculations).
To calculate parameter estimates , you can build Table 1.
The sign of the regression coefficient b indicates the direction of the relationship (if b > 0, the relationship is direct, if b<0, то связь обратная). Величина b показывает на сколько единиц изменится в среднем признак-результат -y при изменении признака-фактора - х на 1 единицу своего измерения.
Formally, the value of the parameter a is the average value of y for x equal to zero. If the sign-factor does not have and cannot have a zero value, then the above interpretation of the parameter a does not make sense.

Assessment of the tightness of the relationship between features is carried out using the coefficient of linear pair correlation - r x,y . It can be calculated using the formula: . In addition, the coefficient of linear pair correlation can be determined in terms of the regression coefficient b: .
The range of admissible values ​​of the linear coefficient of pair correlation is from –1 to +1. The sign of the correlation coefficient indicates the direction of the relationship. If r x, y >0, then the connection is direct; if r x, y<0, то связь обратная.
If this coefficient is close to unity in modulus, then the relationship between the features can be interpreted as a fairly close linear one. If its modulus is equal to one ê r x , y ê =1, then the relationship between the features is functional linear. If features x and y are linearly independent, then r x,y is close to 0.
Table 1 can also be used to calculate r x,y.

Table 1

N observationsx iy ix i ∙ y i
1 x 1y 1x 1 y 1
2 x2y2x 2 y 2
...
nx ny nx n y n
Column Sum∑x∑y∑x y
Mean
To assess the quality of the obtained regression equation, the theoretical coefficient of determination is calculated - R 2 yx:

,
where d 2 is the variance y explained by the regression equation;
e 2 - residual (unexplained by the regression equation) variance y ;
s 2 y - total (total) variance y .
The coefficient of determination characterizes the proportion of variation (dispersion) of the resulting feature y, explained by regression (and, consequently, the factor x), in the total variation (dispersion) y. The coefficient of determination R 2 yx takes values ​​from 0 to 1. Accordingly, the value 1-R 2 yx characterizes the proportion of variance y caused by the influence of other factors not taken into account in the model and specification errors.
With paired linear regression R 2 yx =r 2 yx .

100 r first order bonus

Choose the type of work Graduation work Term paper Abstract Master's thesis Report on practice Article Report Review Test work Monograph Problem solving Business plan Answers to questions Creative work Essay Drawing Compositions Translation Presentations Typing Other Increasing the uniqueness of the text Candidate's thesis Laboratory work Help on-line

Ask for a price

The method of least squares is a mathematical (mathematical-statistical) technique that serves to equalize time series, identify the form of a correlation between random variables, etc. It consists in the fact that the function describing this phenomenon is approximated by a simpler function. Moreover, the latter is selected in such a way that the standard deviation (see Variance) of the actual levels of the function at the observed points from the leveled ones is the smallest.

For example, according to available data ( xi,yi) (i = 1, 2, ..., n) such a curve is constructed y = a + bx, on which the minimum of the sum of squared deviations is reached

i.e., a function is minimized that depends on two parameters: a- segment on the y-axis and b- the slope of the straight line.

Equations giving necessary conditions for minimizing a function S(a,b), are called normal equations. As approximating functions, not only linear (alignment along a straight line), but also quadratic, parabolic, exponential, etc. are used. M.2, where the sum of squared distances ( y 1 – ȳ 1)2 + (y 2 – ȳ 2)2 .... is the smallest, and the resulting straight line best reflects the trend of the dynamic series of observations for some indicator over time.

For the unbiasedness of the OLS estimates, it is necessary and sufficient to fulfill the most important condition of regression analysis: the mathematical expectation of a random error conditional on the factors must be equal to zero. This condition, in particular, is met if: 1.the mathematical expectation of random errors is equal to zero, and 2.factors and random errors are independent random variables. The first condition can be considered to be always satisfied for models with a constant, since the constant takes on a non-zero mathematical expectation of errors. The second condition - the condition of exogenous factors - is fundamental. If this property is not satisfied, then we can assume that almost any estimates will be extremely unsatisfactory: they will not even be consistent (that is, even a very large amount of data does not allow obtaining qualitative estimates in this case).

The most common in the practice of statistical estimation of the parameters of regression equations is the method of least squares. This method is based on a number of assumptions about the nature of the data and the results of the model building. The main ones are a clear separation of the initial variables into dependent and independent ones, the uncorrelatedness of the factors included in the equations, the linearity of the relationship, the absence of autocorrelation of the residuals, the equality of their mathematical expectations to zero and constant dispersion.

One of the main hypotheses of the LSM is the assumption that the dispersions of deviations ei are equal, i.e. their spread around the average (zero) value of the series should be a stable value. This property is called homoscedasticity. In practice, the variances of deviations are quite often not the same, that is, heteroscedasticity is observed. This may be due to various reasons. For example, there may be errors in the original data. Random inaccuracies in the source information, such as errors in the order of numbers, can have a significant impact on the results. Often a greater spread of deviations єi is observed at large values ​​of the dependent variable (variables). If the data contains a significant error, then, naturally, the deviation of the model value calculated from the erroneous data will also be large. In order to get rid of this error, we need to reduce the contribution of these data to the calculation results, set a lower weight for them than for all the rest. This idea is implemented in weighted least squares.