Biographies Characteristics Analysis

Multiple linear regression example. Solution using Excel spreadsheet

I have a big bookshelf including many books divided in many varieties. On the top shelf are religious books like Fiqh books, Tauhid books, Tasawuf books, Nahwu books, etc. They are lined up neatly in many rows and some of them are lined up neatly according to the writers. On the second level are my studious books like Grammar books, Writing books, TOEFL books, etc. These are arranged based on the sizes. On the next shelf are many kinds of scientific and knowledgeable books; for example, Philosophies, Politics, Histories, etc. There are three levels for these. Eventually, in the bottom of my bookshelf are dictionaries, they are Arabic dictionaries and English dictionaries as well as Indonesian dictionaries. Indeed, there are six levels in my big bookshelf and they are lined up in many rows. The first level includes religious books, the second level includes my studious books, the level having three levels includes many kinds of scientific and knowledgeable books and the last level includes dictionaries. In short, I love my bookshelf.

Specific-to-general order

The skills needed to write range from making the appropriate graphic marks, through utilizing the resources of the chosen language, to anticipating the reactions of the intended readers. The first skill area involves acquiring a writing system, which may be alphabetic (as in European languages) or nonalphabetic (as in many Asian languages). The second skill area requires selecting the appropriate grammar and vocabulary to form acceptable sentences and then arranging them in paragraphs. Third, writing involves thinking about the purpose of the text to be composed and about its possible effects on the intended readership. One important aspect of this last feature is the choice of a suitable style. Unlike speaking, writing is a complex sociocognitive process that has to be acquired through years of training or schooling. (Swales and Feak, 1994, p. 34)

General-to-specific order

"Working part-time as a cashier at the Piggly Wiggly has given me a great opportunity to observe human behavior. Sometimes I think of the shoppers as white rats in a lab experiment, and the aisles as a maze designed by a psychologist. Most of the rats--customers, I mean--follow a routine pattern, strolling up and down the aisles, checking through my chute, and then escaping through the exit hatch. abnormal customer: the amnesiac, the super shopper, and the dawdler. . ."

There are many factors that contribute to student success in college. The first factor is having a goal in mind before establishing a course of study. The goal may be as general as wanting to better educate oneself for the future. A more specific goal would be to earn a teaching credential. A second factor related to student success is self-motivation and commitment. A student who wants to succeed and works towards this desire will find success easily as a college student. A third factor linked to student success is using college services. Most beginning college students fail to realize how important it can be to see a counselor or consult with a librarian or financial aid officer.

There are three reasons why Canada is one of the best countries in the world. First, Canada has an excellent health care service. All Canadians have access to medical services at a reasonable price. Second, Canada has a high standard of education. Students are taught to be well-trained teachers and are encouraged to continue studying at university. Finally, Canada's cities are clean and efficiently organized. Canadian cities have many parks and lots of space for people to live. As a result, Canada is a desirable place to live.

York was charged by six German soldiers who came at him with fixed bayonets. He drew a bead on the sixth man, fired, and then on the fifth. He worked his way down the line, and before he knew it, the first man was all by himself. York killed him with a single shot.

As he looked around campus, which had hardly changed, hely relieved those moments he had spent with Nancy. He recalled how the two of them would seat by the pond, chatting endlessly as they fed the fish and also how they would take walks together, lost in their own world. Yes, Nancy was one of the few friends that he had ever had. ….He was suddenly filled with nostalgia as he recalled that afternoon he had bid farewell to Nancy. He sniffed loudly as his eyes filled with tears.

Examples of solving problems on multiple regression

Example 1 The regression equation, built on 17 observations, has the form:

Arrange the missing values, as well as build a confidence interval for b 2 with a probability of 0.99.

Decision. Missing values ​​are determined using the formulas:

Thus, the regression equation with statistical characteristics looks like this:

Confidence interval for b 2 build according to the corresponding formula. Here the significance level is 0.01, and the number of degrees of freedom is np– 1 = 17 – 3 – 1 = 13, where n= 17 – sample size, p= 3 is the number of factors in the regression equation. From here

or . This confidence interval covers the true value of the parameter with a probability of 0.99.

Example 2 The regression equation in standardized variables looks like this:

In this case, the variations of all variables are equal to the following values:

Compare the factors according to the degree of influence on the resulting feature and determine the values ​​of partial elasticity coefficients.

Decision. Standardized regression equations allow you to compare factors by the strength of their influence on the result. At the same time, the greater the absolute value of the coefficient of the standardized variable, the stronger this factor affects the resulting trait. In the equation under consideration, the factor that has the strongest influence on the result is x 1, which has a coefficient of 0.82, the weakest is the factor x 3 with a coefficient equal to - 0.43.

In a linear multiple regression model, the generalized (average) partial elasticity coefficient is determined by an expression that includes the average values ​​of the variables and the coefficient at the corresponding factor of the natural scale regression equation. In the conditions of the problem, these quantities are not specified. Therefore, we use the expressions for variation with respect to variables:

Odds b j associated with standardized coefficients βj the corresponding ratio, which we substitute into the formula for the average coefficient of elasticity:

.

In this case, the sign of the elasticity coefficient will coincide with the sign βj:

Example 3 Based on 32 observations, the following data were obtained:

Determine the values ​​of the adjusted coefficient of determination, partial coefficients of elasticity and parameter a.

Decision. The value of the adjusted coefficient of determination is determined by one of the formulas for its calculation:

Partial coefficients of elasticity (average over the population) are calculated using the appropriate formulas:

Since the linear equation of multiple regression is performed by substituting the average values ​​of all variables into it, we determine the parameter a:

Example 4 For some variables, the following statistics are available:

Build a regression equation in standardized and natural scales.

Decision. Since the pair correlation coefficients between variables are initially known, one should start by constructing a regression equation on a standardized scale. To do this, it is necessary to solve the corresponding system of normal equations, which in the case of two factors has the form:

or, after substituting the initial data:

We solve this system in any way, we get: β1 = 0,3076, β2 = 0,62.

Let's write the regression equation on a standardized scale:

Now let's move on to the natural scale regression equation, for which we use the formulas for calculating regression coefficients through beta coefficients and the fairness property of the regression equation for average variables:

The natural scale regression equation is:

Example 5 When building a linear multiple regression for 48 measurements, the coefficient of determination was 0.578. After eliminating the factors x 3, x 7 and x 8 the coefficient of determination decreased to 0.495. Was the decision to change the composition of the influencing variables at significance levels of 0.1, 0.05 and 0.01 justified?

Decision. Let - the coefficient of determination of the regression equation with the initial set of factors, - the coefficient of determination after the exclusion of three factors. We put forward hypotheses:

;

The main hypothesis suggests that the decrease in magnitude was not significant, and the decision to exclude a group of factors was correct. The alternative hypothesis indicates the correctness of the decision to exclude.

To test the null hypothesis, we use the following statistics:

,

where n = 48, p= 10 - initial number of factors, k= 3 - the number of excluded factors. Then

Let's compare the obtained value with the critical one F(α ; 3; 39) at levels 0.1; 0.05 and 0.01:

F(0,1; 3; 37) = 2,238;

F(0,05; 3; 37) = 2,86;

F(0,01; 3; 37) = 4,36.

At the level α = 0,1 F obl > F cr, zero - the hypothesis is rejected, the exclusion of this group of factors is not justified, at levels 0.05 0.01 zero - the hypothesis cannot be rejected, and the exclusion of factors can be considered justified.

Example 6. Based on quarterly data from 2000 to 2004, an equation was obtained. At the same time, ESS=110.3, RSS=21.4 (ESS – explained RMSE, RSS – residual RMSE). Three dummy variables were added to the equation, corresponding to the first three quarters of the year, and the ESS value increased to 120.2. Is there seasonality in this equation?

Decision. This is a task to check the validity of including a group of factors in the multiple regression equation. Three variables were added to the original three-factor equation to represent the first three quarters of the year.

Let us determine the coefficients of determination of the equations. The total standard deviation is defined as the sum of the factorial and residual standard deviations:

TSS = ESS 1 + RSS 1 = 110.3 + 21.4 = 131.7

We test hypotheses. To test the null hypothesis, we use statistics

Here n= 20 (20 quarters over five years - from 2000 to 2004), p = 6 (total factors in the regression equation after including new factors), k= 3 (number of included factors). Thus:

Let us determine the critical values ​​of the Fisher statistics at various levels of significance:

At significance levels of 0.1 and 0.05 F obl> F cr, zero - the hypothesis is rejected in favor of the alternative one, and the seasonality in the regression is justified (the addition of three new factors is justified), and at the level of 0.01 F obl< F cr, and zero – the hypothesis cannot be rejected; the addition of new factors is not justified, the seasonality in the regression is not significant.

Example 7 When analyzing data for heteroscedasticity, the entire sample was divided into three subsamples after ordering by one of the factors. Then, based on the results of a three-way regression analysis, it was determined that the residual RMSE in the first subsample was 180, and in the third - 63. Is the presence of heteroscedasticity confirmed if the data volume in each subsample is 20?

Decision. Calculate the statistics to test the null hypothesis of homoscedasticity using the Goldfeld–Quandt test:

.

Find the critical values ​​of the Fisher statistics:

Therefore, at significance levels of 0.1 and 0.05 F obl> F cr, and heteroscedasticity takes place, and at the level of 0.01 F obl< F cr, and the homoscedasticity hypothesis cannot be rejected.

Example 8. Based on quarterly data, a multiple regression equation was obtained for which ESS = 120.32 and RSS = 41.4. For the same model, regressions were carried out separately based on the following data: 1991 quarter 1 - 1995 quarter 1 and 1995 quarter 2 - 1996 quarter 4. In these regressions, the residual standard deviations were 22.25 and 12.32, respectively. . Test the hypothesis about the presence of structural changes in the sample.

Decision. The problem of the presence of structural changes in the sample is solved using the Chow test.

Hypotheses have the form: , where s0, s 1 and s2 are residual standard deviations for the single equation for the entire sample and the regression equations for two subsamples of the total sample, respectively. The main hypothesis denies the presence of structural changes in the sample. To test the null hypothesis, statistics are calculated ( n = 24; p = 3):

Because F is a statistic less than one, null means that the hypothesis cannot be rejected for any level of significance. For example, for a significance level of 0.05.

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are concrete examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

Task. At six enterprises, we analyzed the average monthly salary and the number of employees who left due to own will. In tabular form we have:

The number of people who left

The salary

30000 rubles

35000 rubles

40000 rubles

45000 rubles

50000 rubles

55000 rubles

60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

  • from the "File" tab, go to the "Options" section;
  • in the window that opens, select the line "Add-ons";
  • click on the "Go" button located at the bottom, to the right of the "Management" line;
  • check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, enter the range of values ​​for Y (the number of employees who quit) and for X (their salaries);
  • We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values ​​are, or even A new book, specially designed for storing such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model for a particular task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a particular model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values ​​are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values ​​of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.

month number

month name

price of item N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input Y interval" field, a range of values ​​for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input X interval" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting equation is adequate linear regression, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term linear equation rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e., we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • volume annual turnover(VO);
  • accounts receivable (VD);
  • cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

  • call the "Data Analysis" window;
  • select the "Regression" section;
  • in the box "Input interval Y" enter the range of values ​​of dependent variables from column G;
  • click on the icon with a red arrow to the right of the "Input interval X" window and select the range of all values ​​​​from columns B, C, D, F on the sheet.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the Excel spreadsheet sheet, the regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In the more familiar mathematical form it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems from the field of econometrics.

Questions:

4. Estimation of the parameters of the linear model of multiple regression.

5. Evaluation of the quality of multiple linear regression.

6. Analysis and forecasting based on multifactorial models.

Multiple regression is a generalization of pairwise regression. It is used to describe the relationship between the explained (dependent) variable Y and the explanatory (independent) variables X 1 ,X 2 ,…,X k . Multiple regression can be either linear or non-linear, but linear multiple regression is the most widely used in economics.

theoretical linear model multiple regression looks like:

the corresponding sample regression is denoted by:

As in pairwise regression, the random term ε must satisfy the basic assumptions of the regression analysis. Then, with the help of LSM, the best unbiased and efficient estimates of the theoretical regression parameters are obtained. In addition, the variables Х 1 ,Х 2 ,…,Х k must be uncorrelated (linearly independent) with each other. In order to write the formulas for estimating the regression coefficients (2), obtained on the basis of the LSM, we introduce the following notation:

Then we can write in vector-matrix form theoretical model:

and selective regression

OLS leads to the following formula for estimating the vector of coefficients sample regression:

(3)

To estimate multiple linear regression coefficients with two independent variables , we can solve the system of equations:

(4)

As in paired linear regression, for multiple regression, the regression standard error S is calculated:

(5)

and standard errors of the regression coefficients:

(6)

the significance of the coefficients is checked using the t-test.

having a Student distribution with the number of degrees of freedom v= n-k-1.

To assess the quality of the regression, the coefficient (index) of determination is used:

, (8)

the closer to 1, the higher the quality of the regression.

To check the significance of the coefficient of determination, the Fisher criterion or F-statistics is used.



(9)

with v1=k, v2=n-k-1 degrees of freedom.

In multivariate regression, adding additional explanatory variables increases the coefficient of determination. To compensate for such an increase, an adjusted (or normalized) determination coefficient is introduced:

(10)

If the increase in the share of the regression explained by adding a new variable is small, then it may decrease. So, adding a new variable is inappropriate.

Example 4:

Let the dependence of the profit of the enterprise on the cost of new equipment and machinery and on the cost of improving the skills of employees be considered. Collected statistical data on 6 enterprises of the same type. Data in million den. units are given in table 1.

Table 1

Plot two-way linear regression and evaluate its significance. Let us introduce the notation:

We transpose the matrix X:

Inversion of this matrix:

Thus, the dependence of profit on the cost of new equipment and machinery and on the cost of improving the skills of employees can be described by the following regression:

Using formula (5), where k=2, we calculate the standard error of regression S=0.636.

We calculate the standard errors of the regression coefficients using formula (6):

Similarly:

Let's check the significance of the regression coefficients a 1 , a 2 . calculate t calc.

We choose the significance level , the number of degrees of freedom

means coefficient a 1 significant.

Let us estimate the significance of the coefficient a 2:

Coefficient a 2 insignificant.

Let's calculate the coefficient of determination according to the formula (7) . The profit of the enterprise by 96% depends on the cost of new equipment and machinery and advanced training by 4% from other and random factors. Let's check the significance of the coefficient of determination. Calculate F calc.:

then. the coefficient of determination is significant, the regression equation is significant.

Of great importance in the analysis based on multivariate regression is the comparison of the influence of factors on the dependent indicator y. Regression coefficients are not used for this purpose, due to differences in units of measurement and varying degrees fluctuations. From these shortcomings, the free elasticity coefficients are:

Elasticity shows how many percent the dependent indicator y changes on average when the variable changes by 1%, provided that the values ​​of the other variables remain unchanged. The larger , the greater the influence of the corresponding variable. As in paired regression, for multiple regression, a distinction is made between a point forecast and an interval forecast. A point forecast (number) is obtained by substituting the predicted values ​​of the independent variables into the multiple regression equation. Denote by:

(12)

vector of predictive values ​​of independent variables, then point forecast

standard error prediction in the case of multiple regression is defined as follows:

(15)

We choose the significance level α according to the Student's distribution table. For the significance level α and the number of degrees of freedom ν = n-k-1, we find t cr. Then the true value of y p with probability 1- α falls into the interval:


Topic 5:

Time series.

Questions:

4. Basic concepts of time series.

5. The main development trend is a trend.

6. Building an additive model.

Time series represent a set of values ​​of any indicator for several consecutive moments or periods of time.

The moment (or period) of time is denoted by t, and the value of the indicator at the moment of time is denoted by y (t) and is called row level .

Each level of the time series is formed under the influence of a large number of factors that can be divided into 3 groups:

Long-term, permanent factors that have a decisive influence on the phenomenon under study and form the main trend of the series - the trend T(t).

Short-term periodic factors that form seasonal fluctuations of the S(t) series.

Random factors that form random changes in the levels of the series ε(t).

Additive model time series is a model in which each level of the series is represented by the sum of the trend, seasonal and random component:

Multiplicative model is a model in which each level of the series is the product of the listed components:

The choice of one of the models is based on the analysis of the structure of seasonal fluctuations. If the oscillation amplitude is approximately constant, then an additive model is built. If the amplitude increases, then the multiplicative model.

The main task of econometric analysis is to identify each of the listed components.

The main development trend (trend) called a smooth and stable change in the levels of the series over time, free from random and seasonal fluctuations.

The task of identifying the main development trends is called time series alignment .

Time series alignment methods include:

1) the method of enlargement of intervals,

2) method moving average,

3) analytical alignment.

1) The periods of time to which the levels of the series refer are enlarged. Then, the levels of the series are summed over the enlarged intervals. Fluctuations in levels due to random causes cancel each other out. The general trend is more clearly revealed.

2) To determine the number of first levels of the series, the average value. Then the average is calculated from the same number of levels in the series, starting from the second level, and so on. the average value slides along the series of dynamics, advancing by 1 period (point in time). The number of levels of the series over which the average is calculated can be even or odd. For an odd moving average, refer to the middle of the sliding period. For an even period, finding the average value is not compared with the definition of t, but a centering procedure is applied, i.e. calculate the average of two consecutive moving averages.

3) Construction analytic function characterizing the dependence of the level of the series on time. The following functions are used to build trends:

The trend parameters are determined using the least squares method. The choice of the best function is carried out on the basis of the coefficient R 2 .

We will build an additive model using an example.

Example 7:

There are quarterly data on the volume of electricity consumption in a certain area for 4 years. Data in million kW in table 1.

Table 1

Build a time series model.

In this example, we consider the quarter number as an independent variable, and the quarterly electricity consumption as the dependent variable y(t).

From the scatterplot, you can see that the trend (trend) is linear. You can also see the presence of seasonal fluctuations (period = 4) of the same amplitude, so we will build an additive model.

Model building includes the following steps:

1. We will align the original series using the moving average for 4 quarters and perform the centering:

1.1. Let's sum the levels of the series sequentially for every 4 quarters with a shift of 1 point in time.

1.2. By dividing the resulting sums by 4 we find the moving averages.

1.3. We bring these values ​​in line with the actual points in time, for which we find the average value of two consecutive moving averages - centered moving averages.

2. Calculate the seasonal variation. Seasonal variation (t) = y(t) - centered moving average. Let's build a table 2.

table 2

Through block number t Electricity consumption Y(t) 4 quarter moving average Centered moving average Seasonal Variation Estimation
6,0 - - -
4,4 6,1 - -
5,0 6,4 6,25 -1,25
9,0 6,5 6,45 2,55
7,2 6,75 6,625 0,575
: : : : :
6,6 8,35 8,375 -1,775
7,0 - - -
10,8 - - -

3. Based on the seasonal variation in Table 3, the seasonal component is calculated.

Indicators Year Quarter number in year I II III IV
- - -1,250 2,550
0,575 -2,075 -1,100 2,700
0,550 -2,025 -1,475 2,875
0,675 -1,775 - -
Total 1,8 -5,875 -3,825 8,125 Sum
The average 0,6 -1,958 -1,275 2,708 0,075
Seasonal component 0,581 -1,977 -1,294 2,690

4. Eliminate seasonal component from the initial levels of the series:

Conclusion:

The additive model explains 98.4% general variation levels of the original time series.

By clicking on the "Download archive" button, you will download the file you need for free.
Before download given file remember those good essays, control, term papers, theses, articles and other documents that lie unclaimed on your computer. This is your work, it should participate in the development of society and benefit people. Find these works and send them to the knowledge base.
We and all students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

To download an archive with a document, enter a five-digit number in the field below and click the "Download archive" button

###### ## ## ###### ######
## ### ### ## ##
## #### ## ##### ##
## ## ## ## ## ##
## ## ###### ## ## ## ## ##
#### ## ###### #### ####

Enter the number shown above:

Similar Documents

    Fundamentals of building and testing adequacy economic models multiple regression, the problem of their specification and the consequences of errors. Methodical and informational support of multiple regression. Numerical example multiple regression models.

    term paper, added 02/10/2014

    The concept of a multiple regression model. The essence of the least squares method, which is used to determine the parameters of the multiple linear regression equation. Evaluation of the quality of the fit regression equation to the data. Determination coefficient.

    term paper, added 01/22/2015

    Building a model of multiple linear regression according to the given parameters. Evaluation of the quality of the model by the coefficients of determination and multiple correlation. Determining the significance of the regression equation based on Fisher's F-test and Student's t-test.

    test, added 12/01/2013

    Building a multiple regression equation in linear form with full set factors, selection of informative factors. Checking the significance of the regression equation by Fisher's test and the statistical significance of the regression parameters by Student's test.

    laboratory work, added 10/17/2009

    Description of the classical linear model of multiple regression. Analysis of the matrix of paired correlation coefficients for the presence of multicollinearity. Evaluation of the paired regression model with the most significant factor. Graphical construction of the forecast interval.

    term paper, added 01/17/2016

    Factors that form the price of apartments in houses under construction in St. Petersburg. Compilation of a matrix of paired correlation coefficients of the initial variables. Testing the errors of the multiple regression equation for heteroscedasticity. Gelfeld-Quandt test.

    test, added 05/14/2015

    Estimation of the distribution of the variable X1. Modeling the relationship between variables Y and X1 using a linear function and the method of multiple linear regression. Comparison of the quality of the constructed models. Drawing up a point forecast for given values.

    term paper, added 06/24/2015

Good afternoon, dear readers.
In previous articles, using practical examples, I showed how to solve classification problems (credit scoring problem) and the basics of text information analysis (passport problem). Today I would like to touch on another class of problems, namely regression recovery. Tasks of this class are usually used in forecasting.
For an example of solving a forecasting problem, I took the Energy efficiency dataset from the largest UCI repository. Traditionally, we will use Python with pandas and scikit-learn analytic packages as tools.

Description of the data set and problem statement

A data set is given that describes the following attributes of the room:

It contains the characteristics of the room on the basis of which the analysis will be carried out, and - the load values ​​\u200b\u200bthat need to be predicted.

Preliminary data analysis

First, let's load our data and look at it:

From pandas import read_csv, DataFrame from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score from sklearn.cross_validation import train_test_split dataset = read_csvici("EnergyEfficiency /ENB2012_data.csv",";") dataset.head()

X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
0 0.98 514.5 294.0 110.25 7 2 0 0 15.55 21.33
1 0.98 514.5 294.0 110.25 7 3 0 0 15.55 21.33
2 0.98 514.5 294.0 110.25 7 4 0 0 15.55 21.33
3 0.98 514.5 294.0 110.25 7 5 0 0 15.55 21.33
4 0.90 563.5 318.5 122.50 7 2 0 0 20.84 28.28

Now let's see if any attributes are related. This can be done by calculating the correlation coefficients for all columns. How to do this was described in a previous article:

dataset.corr()

X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2
X1 1.000000e+00 -9.919015e-01 -2.037817e-01 -8.688234e-01 8.277473e-01 0.000000 1.283986e-17 1.764620e-17 0.622272 0.634339
X2 -9.919015e-01 1.000000e+00 1.955016e-01 8.807195e-01 -8.581477e-01 0.000000 1.318356e-16 -3.558613e-16 -0.658120 -0.672999
X3 -2.037817e-01 1.955016e-01 1.000000e+00 -2.923165e-01 2.809757e-01 0.000000 -7.969726e-19 0.000000e+00 0.455671 0.427117
X4 -8.688234e-01 8.807195e-01 -2.923165e-01 1.000000e+00 -9.725122e-01 0.000000 -1.381805e-16 -1.079129e-16 -0.861828 -0.862547
X5 8.277473e-01 -8.581477e-01 2.809757e-01 -9.725122e-01 1.000000e+00 0.000000 1.861418e-18 0.000000e+00 0.889431 0.895785
X6 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.000000 0.000000e+00 0.000000e+00 -0.002587 0.014290
X7 1.283986e-17 1.318356e-16 -7.969726e-19 -1.381805e-16 1.861418e-18 0.000000 1.000000e+00 2.129642e-01 0.269841 0.207505
X8 1.764620e-17 -3.558613e-16 0.000000e+00 -1.079129e-16 0.000000e+00 0.000000 2.129642e-01 1.000000e+00 0.087368 0.050525
Y1 6.222722e-01 -6.581202e-01 4.556712e-01 -8.618283e-01 8.894307e-01 -0.002587 2.698410e-01 8.736759e-02 1.000000 0.975862
Y2 6.343391e-01 -6.729989e-01 4.271170e-01 -8.625466e-01 8.957852e-01 0.014290 2.075050e-01 5.052512e-02 0.975862 1.000000

As you can see from our matrix, the following columns correlate with each other (the value of the correlation coefficient is greater than 95%):
  • y1 --> y2
  • x1 --> x2
  • x4 --> x5
Now let's choose which columns of our pairs we can remove from our selection. To do this, in each pair, we select the columns that have a greater impact on the predicted values Y1 and Y2 and leave them, and delete the rest.
As you can see, matrices with correlation coefficients on y1 ,y2 more value render X2 and X5 than X1 and X4, so we can remove the last columns we can.

Dataset = dataset.drop(["X1","X4"], axis=1) dataset.head()
In addition, it can be seen that the fields Y1 and Y2 very closely correlate with each other. But, since we need to predict both values, we leave them “as is”.

Model selection

Separate the forecast values ​​from our sample:

Trg = dataset[["Y1","Y2"]] trn = dataset.drop(["Y1","Y2"], axis=1)
After processing the data, you can proceed to building the model. To build the model, we will use the following methods:

The theory about these methods can be read in the course of lectures by K.V. Vorontsov on machine learning.
We will evaluate using the determination coefficient ( R-square). This coefficient is defined as follows:

Where is the conditional variance of the dependent variable at by factor X.
The coefficient takes a value on the interval and the closer it is to 1, the stronger the dependence.
Well, now you can go directly to building a model and choosing a model. Let's put all our models in one list for the convenience of further analysis:

Models=
So the models are ready, now we will split our original data into 2 subsamples: test and educational. Those who have read my previous articles know that this can be done using the train_test_split() function from the scikit-learn package:

Xtrn, Xtest, Ytrn, Ytest = train_test_split(trn, trg, test_size=0.4)
Now, since we need to predict 2 parameters, we need to build a regression for each of them. In addition, for further analysis, you can record the results obtained in a temporary DataFrame. You can do it like this:

#create temporary structures TestModels = DataFrame() tmp = () #for each model from the list for model in models: #get the model name m = str(model) tmp["Model"] = m[:m.index("( ")] #for each column of the result set for i in xrange(Ytrn.shape): #train the model model.fit(Xtrn, Ytrn[:,i]) #calculate the coefficient of determination tmp["R2_Y%s"%str(i +1)] = r2_score(Ytest[:,0], model.predict(Xtest)) #write data and final DataFrame TestModels = TestModels.append() #make index by model name TestModels.set_index("Model", inplace= true)
As you can see from the code above, the r2_score() function is used to calculate the coefficient.
So, the data for analysis is received. Let's now build graphs and see which model showed the best result:

Fig, axes = plt.subplots(ncols=2, figsize=(10,4)) TestModels.R2_Y1.plot(ax=axes, kind="bar", title="(!LANG:R2_Y1") TestModels.R2_Y2.plot(ax=axes, kind="bar", color="green", title="R2_Y2") !}

Analysis of results and conclusions

From the graphs above, we can conclude that the method coped with the task better than others. Random Forest(random forest). Its coefficients of determination are higher than the rest in both variables:
For further analysis, let's retrain our model:

Model = modelsmodel.fit(Xtrn, Ytrn)
On closer examination, the question may arise why the dependent sample was divided last time Ytrn to variables (by columns), and now we don't do that.
The fact is that some methods, such as RandomForestRegressor, can work with several predictive variables, while others (for example SVR) can work with only one variable. Therefore, in the previous training, we used a partition by columns to avoid errors in the process of building some models.
Choosing a model is, of course, good, but it would also be nice to have information about how each factor will affect the predicted value. To do this, the model has a property feature_importances_.
With it, you can see the weight of each factor in the final models:

Model.feature_importances_
array([ 0.40717901, 0.11394948, 0.34984766, 0.00751686, 0.09158358,
0.02992342])

In our case, it can be seen that the total height and area affect the heating and cooling load the most. Their total contribution to the predictive model is about 72%.
It should also be noted that according to the above scheme, you can see the influence of each factor separately on heating and separately on cooling, but since these factors are very closely correlated with each other (), we made a general conclusion on both of them, which was written above .

Conclusion

In the article, I tried to show the main stages in regression analysis data with Python and analytic packages pandas and scikit-learn.
It should be noted that the data set was specifically chosen in such a way as to be as formalized and primary processing input data would be minimal. In my opinion, the article will be useful to those who are just starting their journey in data analysis, as well as to those who have a good theoretical base, but choose tools for work.