Biographies Characteristics Analysis

Statistical methods of scientific data analysis. Statistical methods - what is it? Application of statistical methods

FEDERAL AGENCY FOR EDUCATION

STATE EDUCATIONAL INSTITUTION

HIGHER PROFESSIONAL EDUCATION

"YUGORSK STATE UNIVERSITY"

INSTITUTE OF ADDITIONAL EDUCATION

PROFESSIONAL RETRAINING PROGRAM

"STATE AND MUNICIPAL MANAGEMENT"

ESSAY

Subject: "Statistics"

"Statistical research methods"

Performed:

Khanty-Mansiysk

Introduction

1. Methods of statistical research.

1.1. Statistical observation method

1.4. Variation Series

1.5. Sampling method

1.6. Correlation and regression analysis

1.7. Series of dynamics

1.8. Statistical Indices

Conclusion

List of used literature


Complete and reliable statistical information is the necessary basis on which the process of economic management is based. All information of national economic significance is ultimately processed and analyzed using statistics.

It is statistical data that make it possible to determine the volume of gross domestic product and national income, to identify the main trends in the development of economic sectors, to assess the level of inflation, to analyze the state of financial and commodity markets, to study the standard of living of the population and other socio-economic phenomena and processes. Mastering statistical methodology is one of the conditions for understanding market conditions, studying trends and forecasting, and making optimal decisions at all levels of activity.

Statistical science is a branch of knowledge that studies the phenomena of social life from their quantitative side inextricably linked with their qualitative content in specific conditions of place and time. Statistical practice is the activity of collecting, accumulating, processing and analyzing digital data characterizing all phenomena in the life of society.

Speaking about statistics, it should be remembered that the figures in statistics are not abstract, but express a deep economic meaning. Every economist must be able to use statistical figures, analyze them, and be able to use them to substantiate their conclusions.

Statistical laws operate within the time and place in which they are found.

The surrounding world consists of mass phenomena. If an individual fact depends on the laws of chance, then the mass of phenomena is subject to laws. To detect these patterns, the law of large numbers is used.

To obtain statistical information, state and departmental statistics bodies, as well as commercial structures, conduct various kinds of statistical research. The process of statistical research includes three main stages: data collection, their summary and grouping, analysis and calculation of generalizing indicators.

The results and quality of all subsequent work largely depend on how the primary statistical material is collected, how it is processed and grouped, and in the end, in case of violations, it can lead to absolutely erroneous conclusions.

Complicated, time-consuming and responsible is the final, analytical stage of the study. At this stage, average indicators and distribution indicators are calculated, the structure of the population is analyzed, the dynamics and the relationship between the studied phenomena and processes are studied.

At all stages of research, statistics uses different methods. The methods of statistics are special prims and methods of studying mass social phenomena.

At the first stage of the study, methods of mass observation are applied, primary statistical material is collected. The main condition is mass character, because the laws of social life are manifested in a sufficiently large array of data due to the operation of the law of large numbers, i.e. in summary statistical characteristics, randomness cancels each other out.

At the second stage of the study, when the collected information is subjected to statistical processing, the grouping method is used. The use of the grouping method requires an indispensable condition - the qualitative homogeneity of the population.

At the third stage of the study, statistical information is analyzed using such methods as the method of generalizing indicators, tabular and graphical methods, methods for assessing variation, the balance method, and the index method.

Analytical work should contain elements of foresight, indicate the possible consequences of emerging situations.

The management of statistics in the country is carried out by the State Committee of the Russian Federation on Statistics. As a federal executive body, it exercises general management of statistics in the country, provides official statistical information to the President, the Government, the Federal Assembly, federal executive authorities, public and international organizations, develops statistical methodology, coordinates the statistical activities of federal and regional executive organizations, and analyzes economic and statistical information, draws up national accounts and makes balance calculations.

The system of statistical bodies in the Russian Federation is formed in accordance with the administrative-territorial division of the country. In the republics that are part of the Russian Federation, there are Republican committees. In autonomous districts, territories, regions, in Moscow and St. Petersburg, there are State Committees on Statistics.

In districts (cities) - departments (departments) of state statistics. In addition to the state, there is also departmental statistics (at enterprises, departments, ministries). It provides internal needs for statistical information.

The purpose of this work is to consider statistical research methods.

1. Methods of statistical research

There is a close relationship between the science of statistics and practice: statistics uses practice data, generalizes and develops methods for conducting statistical research. In turn, in practice, the theoretical provisions of statistical science are applied to solve specific management problems. Knowledge of statistics is necessary for a modern specialist to make decisions in the conditions of stochastics (when the analyzed phenomena are influenced by chance), to analyze the elements of a market economy, to collect information, due to an increase in the number of business units and their types, audit, financial management, forecasting.

To study the subject of statistics, specific techniques have been developed and applied, the totality of which forms the methodology of statistics (methods of mass observations, groupings, generalizing indicators, time series, index method, etc.). The use of specific methods in statistics is predetermined by the tasks set and depends on the nature of the initial information. At the same time, statistics is based on such dialectical categories as quantity and quality, necessity and chance, causality, regularity, individual and mass, individual and general. Statistical methods are used comprehensively (systemically). This is due to the complexity of the process of economic and statistical research, which consists of three main stages: the first is the collection of primary statistical information; the second - statistical summary and processing of primary information; the third is the generalization and interpretation of statistical information.

The general methodology for studying statistical populations is to use the basic principles that guide any science. These principles, as a kind of principles, include the following:

1. objectivity of the studied phenomena and processes;

2. identifying the relationship and consistency in which the content of the studied factors is manifested;

3. goal setting, i.e. achievement of the set goals on the part of the researcher studying the relevant statistical data.

This is expressed in obtaining information about trends, patterns and possible consequences of the development of the processes under study. Knowledge of the patterns of development of socio-economic processes that are of interest to society is of great practical importance.

Among the features of statistical data analysis are the method of mass observation, the scientific validity of the qualitative content of groupings and its results, the calculation and analysis of generalized and generalizing indicators of the objects under study.

As for the specific methods of economic, industrial or statistics of culture, population, national wealth, etc., there may be specific methods for collecting, grouping and analyzing the corresponding aggregates (sum of facts).

In economic statistics, for example, the balance method is widely used as the most common method of linking individual indicators in a single system of economic relations in social production. The methods used in economic statistics also include the compilation of groupings, the calculation of relative indicators (percentage ratio), comparisons, the calculation of various types of averages, indices, etc.

The method of connecting links consists in the fact that two volumetric, i.e. Quantitative indicators are compared on the basis of the relationship existing between them. For example, labor productivity in physical terms and hours worked, or the volume of traffic in tons and the average distance of transportation in km.

When analyzing the dynamics of the development of the national economy, the main method for identifying this dynamics (movement) is the index method, methods of analyzing time series.

In the statistical analysis of the main economic patterns of the development of the national economy, an important statistical method is the calculation of the closeness of relationships between indicators using correlation and dispersion analysis, etc.

In addition to these methods, mathematical and statistical methods of research have become widespread, which are expanding as the scale of the use of computers moves and the creation of automated systems.

Stages of statistical research:

1. Statistical observation - mass scientifically organized collection of primary information about individual units of the phenomenon under study.

2. Grouping and summary of material - generalization of observational data to obtain absolute values ​​(accounting and estimated indicators) of the phenomenon.

3. Processing of statistical data and analysis of the results to obtain reasonable conclusions about the state of the phenomenon under study and the patterns of its development.

All stages of statistical research are closely related to each other and are equally important. The shortcomings and errors that occur at each stage affect the entire study as a whole. Therefore, the correct use of special methods of statistical science at each stage makes it possible to obtain reliable information as a result of statistical research.

Methods of statistical research:

1. Statistical observation

2. Summary and grouping of data

3. Calculation of generalizing indicators (absolute, relative and average values)

4. Statistical distributions (variation series)

5. Sampling method

6. Correlation and regression analysis

7. Series of dynamics

The task of statistics is the calculation of statistical indicators and their analysis, thanks to which the governing bodies receive a comprehensive description of the managed object, whether it be the entire national economy or its individual sectors, enterprises and their divisions. It is impossible to manage socio-economic systems without having operational, reliable and complete statistical information.


Statistical observation is a systematic, scientifically organized and, as a rule, systematic collection of data on the phenomena of social life. It is carried out by registering predetermined essential features in order to obtain further generalizing characteristics of these phenomena.

For example, when conducting a population census, information about each resident of the country is recorded about his gender, age, marital status, education, etc., and then the statistical authorities determine, based on this information, the country's population, its age structure, location within the country, family composition and other indicators.

The following requirements are imposed on statistical observation: completeness of coverage of the studied population, reliability and accuracy of data, their uniformity and comparability.

Forms, types and methods of statistical observation

Statistical observation is carried out in two forms: reporting and specially organized statistical observation.

reporting called such an organizational form of statistical observation, in which information is received by the statistical authorities from enterprises, institutions and organizations in the form of mandatory reports on their activities.

Reporting can be national and intradepartmental.

Nationwide - goes to the higher authorities and to the state statistics bodies. It is necessary for the purposes of generalization, control, analysis and forecasting.

Intradepartmental - used in ministries and departments for operational needs.

Reporting is approved by the State Statistics Committee of the Russian Federation. Reporting is compiled on the basis of primary accounting. The peculiarity of reporting is that it is mandatory, documented and legally confirmed by the signature of the head.

Specially organized statistical observation- observation organized for some special purpose to obtain information that is not in the reporting, or to verify and clarify the reporting data. This is a census of the population, livestock, equipment, all kinds of one-time records. Like, for example, household budget surveys, opinion polls, etc.

Types of statistical observation can be grouped according to two criteria: by the nature of the registration of facts and by the coverage of population units.

By nature of registration facts statistical observation can be: current or systematic and discontinuous .

Current monitoring is a continuous accounting, for example, of production, release of material from a warehouse, etc., i.e. registration is carried out as the fact occurs.

Discontinuous monitoring can be periodic, i.e. repeating at regular intervals. For example, a livestock census on January 1 or registration of market prices on the 22nd of each month. One-time observation is organized as needed, i.e. without observance of periodicity or in general once. For example, the study of public opinion.

By coverage of population units Observation can be continuous or non-continuous.

At continuous All units of the population are subjected to observation. For example, the census.

At discontinuous observation, a part of the units of the population is examined. Non-continuous observation can be divided into subspecies: selective, monographic, method of the main array.

Selective observation is an observation based on the principle of random selection. With its proper organization and conduct, selective observation provides sufficiently reliable data on the population under study. In some cases, they can replace continuous accounting, because the results of a sample observation with a well-defined probability can be extended to the entire population. For example, quality control of products, the study of livestock productivity, etc. In a market economy, the scope of selective observation is expanding.

Monographic observation- this is a detailed, in-depth study and description of units of the population that are characteristic in some respect. It is carried out in order to identify existing and emerging trends in the development of the phenomenon (identifying shortcomings, studying best practices, new forms of organization, etc.)

Main Array Method consists in the fact that the largest units are subjected to the survey, which, taken together, have a predominant share in the totality according to the main feature (features) for this study. So, when studying the work of markets in cities, the markets of large cities are examined, where 50% of the total population lives, and the turnover of the markets is 60% of the total turnover.

By source of information Distinguish between direct observation, documentary and survey.

direct called such an observation in which the registrars themselves, by measuring, weighing or counting, establish the fact and record it in the observation form (form).

Documentary- involves recording answers on the basis of relevant documents.

Poll- this is an observation in which answers to questions are recorded from the words of the respondent. For example, the census.

In statistics, information about the phenomenon under study can be collected in various ways: reporting, expeditionary, self-calculation, questionnaire, correspondent.

Essence reporting method is to provide reports in a strictly mandatory manner.

Expeditionary The method consists in the fact that specially attracted and trained workers record information in the observation form (population census).

At self-calculation(self-registration) forms are filled in by the respondents themselves. This method is used, for example, in the study of pendulum migration (movement of the population from the place of residence to the place of work and back).

Questionnaire the method is the collection of statistical data using special questionnaires (questionnaires) sent to a certain circle of people or published in the periodical press. This method is used very widely, especially in various sociological surveys. However, it has a large share of subjectivity.

Essence correspondent The method lies in the fact that the statistical authorities agree with certain persons (voluntary correspondents), who undertake to observe any phenomena within the established time frame and report the results to the statistical authorities. For example, expert assessments are carried out on specific issues of the country's socio-economic development.

1.2. Summary and grouping of statistical observation materials

Essence and tasks of summary and grouping

Summary- this is an operation to work out specific single facts that form a set and collected as a result of observation. As a result of the summary, many individual indicators related to each unit of the object of observation turn into a system of statistical tables and results, typical features and patterns of the phenomenon under study as a whole appear.

According to the depth and accuracy of processing, a summary is distinguished between simple and complex.

Simple Summary- this is an operation for calculating the totals, i.e. by the set of units of observation.

Complex summary- this is a complex of operations, including the grouping of units of observation, the calculation of the results for each group and for the object as a whole, and the presentation of the results in the form of statistical tables.

The summary process includes the following steps:

Selection of a grouping attribute;

Determining the order of group formation;

Development of a system of indicators to characterize groups and the object as a whole;

Design table layouts to present summary results.

In the form of processing, the summary is:

Centralized (all primary material goes to one higher organization, for example, the State Statistics Committee of the Russian Federation, and is completely processed there);

Decentralized (the processing of the collected material goes in an ascending line, i.e. the material is summarized and grouped at each stage).

In practice, both forms of reporting are usually combined. So, for example, in a census, preliminary results are obtained in the order of a decentralized summary, and consolidated final results are obtained as a result of a centralized development of census forms.

According to the execution technique, the summary is mechanized and manual.

grouping called the division of the studied population into homogeneous groups according to certain essential features.

On the basis of the grouping method, the central tasks of the study are solved, and the correct application of other methods of statistical and statistical-mathematical analysis is ensured.

The work of grouping is complex and difficult. Grouping techniques are diverse, which is due to the variety of grouping characteristics and various research objectives. The main tasks solved with the help of groupings include:

Identification of socio-economic types;

The study of the structure of the population, structural changes in it;

Revealing the connection between phenomena and interdependence.

Grouping types

Depending on the tasks solved with the help of groupings, there are 3 types of groupings: typological, structural and analytical.

Typological grouping solves the problem of identifying socio-economic types. When constructing a grouping of this type, the main attention should be paid to the identification of types and the choice of a grouping feature. At the same time, they proceed from the essence of the phenomenon under study. (table 2.3).

Structural grouping solves the problem of studying the composition of individual typical groups on some basis. For example, the distribution of the resident population by age groups.

Analytical grouping allows you to identify the relationship between phenomena and their features, i.e. identify the influence of some signs (factorial) on others (effective). The relationship is manifested in the fact that with an increase in the factor attribute, the value of the resultant attribute increases or decreases. Analytical grouping is always based on factorial trait, and each group is characterized average the values ​​of the effective sign.

For example, the dependence of the volume of retail turnover on the size of the retail space of the store. Here, the factorial (grouping) sign is the sales area, and the resultant sign is the average turnover per store.

By complexity, the grouping can be simple and complex (combined).

AT simple grouping at the base has one sign, and in difficult- two or more in combination (in combination). In this case, groups are first formed according to one (main) attribute, and then each of them is divided into subgroups according to the second attribute, and so on.

1.3. Absolute and relative statistics

Absolute statistics

The initial, primary form of expression of statistical indicators are absolute values. Absolute values characterize the size of phenomena in terms of mass, area, volume, length, time, etc.

Individual absolute indicators are obtained, as a rule, directly in the process of observation as a result of measurement, weighing, counting, and evaluation. In some cases, the absolute individual scores are the difference.

Summary, final volumetric absolute indicators are obtained as a result of summary and grouping.

Absolute statistical indicators are always named numbers, i.e. have units. There are 3 types of units of measurement of absolute values: natural, labor and cost.

natural units measurements - express the magnitude of the phenomenon in physical terms, i.e. measures of weight, volume, length, time, counting, i.e. in kilograms, cubic meters, kilometers, hours, pieces, etc.

A variety of natural units are conditionally natural units of measurement which are used to bring together several varieties of the same use-value. One of them is taken as a standard, while others are converted using special coefficients into units of measure of this standard. So, for example, soap with different content of fatty acids is converted to 40% content of fatty acids.

In some cases, one unit of measurement is not enough to characterize a phenomenon, and the product of two units of measurement is used.

An example is the freight turnover in ton-kilometers, the production of electricity in kilowatt-hours, etc.

In a market economy, the most important are cost (monetary) units of measurement(ruble, dollar, mark, etc.). They allow you to get a monetary assessment of any socio-economic phenomena (volume of production, turnover, national income, etc.). However, it should be remembered that in conditions of high inflation rates, the indicators in monetary terms become incomparable. This should be taken into account when analyzing cost indicators in dynamics. To achieve comparability, indicators must be recalculated into comparable prices.

Labor units of measurement(man-hours, man-days) are used to determine the cost of labor in the production of products, for the performance of some work, etc.

Relative statistical quantities, their essence and forms of expression

Relative values in statistics, quantities are called that express the quantitative relationship between the phenomena of social life. They are obtained by dividing one value by another.

The value with which comparison is made (denominator) is called the base, the base of comparison; and the one that is compared (numerator) is called the compared, reporting or current value.

The relative value shows how many times the compared value is greater or less than the base value, or what proportion the first is from the second; and in some cases - how many units of one quantity are per unit (or per 100, per 1000, etc.) of another (basic) quantity.

As a result of comparing the absolute values ​​of the same name, abstract unnamed relative values ​​are obtained, showing how many times the given value is greater or less than the base value. In this case, the base value is taken as a unit (the result is coefficient).

In addition to the coefficient, a widely used form of expressing relative values ​​is interest(%). In this case, the base value is taken as 100 units.

Relative values ​​can be expressed in ppm (‰), in decimille (0 / 000). In these cases, the comparison base is taken as 1,000 and 10,000, respectively. In some cases, the comparison base can also be taken as 100,000.

Relative values ​​can be named numbers. Its name is a combination of the names of the compared and basic indicators. For example, population density per sq. km (how many people per 1 square kilometer).

Types of relative values

Types of relative values ​​are subdivided depending on their content. These are relative values: the plan task, the fulfillment of the plan, dynamics, structure, coordination, intensity and level of economic development, comparison.

Relative value planned task represents the ratio of the indicator value established for the planned period to its value achieved by the planned period.

Relative value plan implementation the value expressing the ratio between the actual and planned level of the indicator is called.

Relative value speakers is the ratio of the level of an indicator for a given period to the level of the same indicator in the past.

The above three relative values ​​are interconnected, namely: the relative value of the dynamics is equal to the product of the relative values ​​of the planned task and the implementation of the plan.

Relative value structures is the ratio of the dimensions of the part to the whole. It characterizes the structure, composition of a particular set.

These same percentages are called specific gravity.

Relative value coordination called the ratio of the parts of the whole to each other. As a result, they get how many times this part is larger than the base part. Or how many percent of it is or how many units of this structural part fall on 1 unit (100 or 1000, etc. units) of the basic structural part.

Relative value intensity characterizes the development of the studied phenomenon or process in another environment. This is the relationship of two interrelated phenomena, but different. It can be expressed both as a percentage, and in ppm, and prodecemille, and named. A variation of the relative intensity value is the indicator level of economic development characterizing the production per capita.

Relative value comparisons represents the ratio of the absolute indicators of the same name for different objects (enterprises, districts, regions, countries, etc.). It can be expressed both in coefficients and as a percentage.

Average values, their essence and types

Statistics, as you know, studies mass socio-economic phenomena. Each of these phenomena can have a different quantitative expression of the same feature. For example, the wages of the same profession of workers or the prices on the market for the same product, etc.

To study any population according to varying (quantitatively changing) characteristics, statistics uses averages.

average value- this is a generalizing quantitative characteristic of a set of similar phenomena one by one variable sign.

The most important property of the average value is that it represents the value of a certain attribute in the entire population as a single number, despite its quantitative differences in individual units of the population, and expresses the common thing that is inherent in all units of the population under study. Thus, through the characteristic of a unit of the population, it characterizes the entire population as a whole.

Averages are related to the law of large numbers. The essence of this connection lies in the fact that when averaging random deviations of individual values, due to the operation of the law of large numbers, they cancel each other out and in the average the main development trend, necessity, regularity is revealed, however, for this, the average must be calculated on the basis of a generalization of the mass of facts.

Average values ​​allow comparison of indicators related to populations with different numbers of units.

The most important condition for the scientific use of averages in the statistical analysis of social phenomena is homogeneity the population for which the average is calculated. The average, which is identical in form and calculation technique, is fictitious under some conditions (for a heterogeneous population), and corresponds to reality in others (for a homogeneous population). The qualitative homogeneity of the population is determined on the basis of a comprehensive theoretical analysis of the essence of the phenomenon. For example, when calculating the average yield, it is required that the input data refer to the same crop (average wheat yield) or group of crops (average cereal yield). You can not calculate the average for heterogeneous crops.

Mathematical techniques used in various sections of statistics are directly related to the calculation of averages.

Averages in social phenomena have a relative constancy, i.e. over a certain period of time, phenomena of the same type are characterized by approximately the same averages.

The middle values ​​are very closely related to the grouping method, since to characterize phenomena, it is necessary to calculate not only general (for the entire phenomenon) averages, but also group averages (for typical groups of this phenomenon according to the trait under study).

Types of averages

The form in which the initial data for calculating the average value is presented depends on what formula it will be determined by. Consider the most commonly used types of averages in statistics:

arithmetic mean;

Average harmonic;

Geometric mean;

Mean square.

1.4. Variation Series

Essence and causes of variation

Information about the average levels of the studied indicators is usually insufficient for a deep analysis of the process or phenomenon being studied.

It is also necessary to take into account the spread or variation in the values ​​of individual units, which is an important characteristic of the studied population. Each individual value of a trait is formed under the combined influence of many factors. Socio-economic phenomena tend to have great variation. The reasons for this variation are contained in the essence of the phenomenon.

Variation measures determine how the trait values ​​are grouped around the mean. They are used to characterize ordered statistical aggregates: groupings, classifications, distribution series. Stock prices, volumes of supply and demand, interest rates in different periods and in different places are subject to the greatest variation.

Absolute and relative indicators of variation

According to the meaning of the definition, variation is measured by the degree of fluctuation of the trait options from the level of their average value, i.e. as the difference xx. On the use of deviations from the mean, most of the indicators used in statistics to measure variations in the values ​​of a feature in the population are built.

The simplest absolute measure of variation is range of variation R=xmax-xmin . The range of variation is expressed in the same units of measurement as X. It depends only on the two extreme values ​​of the trait and, therefore, does not sufficiently characterize the fluctuation of the trait.

Absolute rates of variation depend on the units of measure of the trait and make it difficult to compare two or more different variation series.

Relative measures of variation are calculated as the ratio of various absolute indicators of variation to the arithmetic mean. The most common of these is the coefficient of variation.

The coefficient of variation characterizes the fluctuation of the trait within the average. Its best values ​​are up to 10%, good up to 50%, bad over 50%. If the coefficient of variation does not exceed 33%, then the population for the trait under consideration can be considered homogeneous.

1.5. Sampling method

The essence of the sampling method is to judge the numerical characteristics of the whole (general population) by the properties of a part (sample), by individual groups of options for their total population, which is sometimes thought of as a collection of an infinitely large volume. The basis of the sampling method is the internal connection that exists in populations between the individual and the general, the part and the whole.

The sampling method has obvious advantages over a continuous study of the general population, since it reduces the amount of work (by reducing the number of observations), allows you to save effort and money, obtain information about such populations, a complete examination of which is practically impossible or impractical.

Experience has shown that a correctly made sample represents or represents (from Latin represento - I represent) the structure and state of the general population quite well. However, as a rule, there is no complete coincidence of sample data with the data of processing the general population. This is the disadvantage of the sampling method, against which the advantages of a continuous description of the general population are visible.

In view of the incomplete display of the statistical characteristics (parameters) of the general population by the sample, an important task arises for the researcher: firstly, to take into account and observe the conditions under which the sample best represents the general population, and secondly, in each specific case to establish with what With certainty, one can transfer the results of a sample observation to the entire population from which the sample is taken.

The representativeness of the sample depends on a number of conditions and, above all, on how it is carried out, either systematically (ie, according to a pre-planned scheme), or by unplanned selection of an option from the general population. In any case, the sample should be typical and completely objective. These requirements must be met strictly as the most essential conditions for the representativeness of the sample. Before processing the sample material, it must be carefully checked and the sample freed from everything superfluous, which violates the conditions of representativeness. At the same time, when forming a sample, it is impossible to act arbitrarily, to include in its composition only those options that seem typical, and to reject all the rest. A benign sample should be objective, that is, it should be made without biased motives, with the exclusion of subjective influences on its composition. The fulfillment of this condition of representativeness corresponds to the principle of randomization (from the English rendom-case), or random selection of a variant from the general population.

This principle underlies the theory of the sampling method and must be observed in all cases of the formation of a representative sample, not excluding cases of planned or deliberate selection.

There are various selection methods. Depending on the selection method, the following types of samples are distinguished:

Random sample with return;

Random sampling without return;

Mechanical;

typical;

Serial.

Consider the formation of random samples with and without return. If the sample is made from a mass of products (for example, from a box), then after thorough mixing, objects should be taken randomly, that is, so that they all have the same probability of being included in the sample. Often, to form a random sample, the elements of the general population are pre-numbered, and each number is recorded on a separate card. The result is a pack of cards, the number of which coincides with the size of the general population. After thorough mixing, one card is taken from this pack. An object that has the same number with a card is considered to be in the sample. In this case, two fundamentally different ways of forming a sample population are possible.

The first way - the card taken out after fixing its number is returned to the pack, after which the cards are thoroughly mixed again. By repeating such samples on one card, it is possible to form a sample of any size. The sample set formed according to this scheme is called a random sample with a return.

The second way - each card taken out after its recording is not returned back. By repeating the sample according to this scheme for one card, you can get a sample of any given size. The sample set formed according to this scheme is called a random sample without a return. A random sample without return is formed if the required number of cards is taken from a thoroughly mixed pack at once.

However, with a large size of the general population, the method of forming a random sample with and without a return described above turns out to be very laborious. In this case, tables of random numbers are used, in which the numbers are arranged in random order. The share of what would be selected, for example, 50 objects from a numbered general population, open any page of the table of random numbers and write out 50 random numbers in a row; the sample includes those objects whose numbers coincide with the random numbers written out, if the random number of the table turns out to be greater than the volume of the general population, then such a number is skipped.

Note that the distinction between random samples with and without reversal is blurred if they are an insignificant part of a large population.

With the mechanical method of forming a sample population, the elements of the general population to be surveyed are selected at a certain interval. So, for example, if the sample should be 50% of the general population, then every second element of the general population is selected. If the sample is ten percent, then every tenth element is selected, and so on.

It should be noted that sometimes mechanical selection may not provide a representative sample. For example, if every twelfth turning roller is selected, and immediately after the selection, the cutter is replaced, then all the rollers turned with blunt cutters will be selected. In this case, it is necessary to eliminate the coincidence of the selection rhythm with the rhythm of the replacement of the cutter, for which at least every tenth roller out of twelve turned ones should be selected.

With a large number of homogeneous products produced, when various machines and even workshops take part in its manufacture, a typical selection method is used to form a representative sample. In this case, the general population is preliminarily divided into non-overlapping groups. Then, from each group, according to the scheme of random sampling with or without return, a certain number of elements are selected. They form a sample set, which is called typical.

Let, for example, selectively examine the products of a workshop in which there are 10 machines that produce the same products. Using a random sampling scheme with or without return, products are selected, first from products made on the first, then on the second, etc. machines. This method of selection allows you to form a typical sample.

Sometimes in practice it is advisable to use a serial selection method, the idea of ​​which is that the general population is divided into a certain number of non-overlapping series and all elements of only selected series are controlled according to a random sampling scheme with or without return. For example, if products are manufactured by a large group of automatic machines, then the products of only a few machines are subjected to a continuous examination. Serial selection is used if the examined trait fluctuates slightly in different series.

Which method of selection should be preferred in a given situation should be judged on the basis of the requirements of the task and the conditions of production. Note that in practice, when compiling a sample, several methods of selection are often used simultaneously in combination.

1.6. Correlation and regression analysis

Regression and correlation analyzes are powerful methods that allow you to analyze large amounts of information in order to investigate the likely relationship of two or more variables.

Tasks correlation analysis are reduced to measuring the tightness of a known relationship between varying features, determining unknown causal relationships (the causal nature of which must be clarified with the help of theoretical analysis) and evaluating the factors that have the greatest influence on the resulting feature.

tasks regression analysis are the choice of the type of model (form of connection), the establishment of the degree of influence of independent variables on the dependent variable and the determination of the calculated values ​​of the dependent variable (regression functions).

The solution of all these problems leads to the need for the integrated use of these methods.

1.7. Series of dynamics

The concept of time series and types of time series

Near speakers called a series of sequentially arranged in time statistical indicators, which in their change reflect the course of development of the phenomenon under study.

A series of dynamics consists of two elements: moment or period of time, which includes data and statistical indicators (levels). Both elements together form members of the series. The levels of the series are usually denoted by "y", and the time period - by "t".

According to the duration of time, which include the levels of the series, the series of dynamics are divided into instant and interval.

AT moment series each level characterizes the phenomena at a point in time. For example: the number of deposits of the population in institutions of the savings bank of the Russian Federation, at the end of the year.

AT interval series dynamics, each level of the series characterizes the phenomenon over a period of time. For example: watch production in Russia by years.

In the interval series of dynamics, the levels of the series can be summed up and the total value for a series of successive periods can be obtained. In moment series, this sum does not make sense.

Depending on the method of expressing the levels of the series, the series of dynamics of absolute values, relative values ​​and average values ​​are distinguished.

Time series can be with equal and unequal intervals. The concept of interval in moment and interval series is different. The interval of a moment series is the period of time from one date to another date for which the data is given. If this is data on the number of deposits at the end of the year, then the interval is from the end of one year to the end of another year. The interval of the interval series is the period of time for which the data are summarized. If this is the production of watches by years, then the interval is one year.

The interval of the series can be equal and unequal both in the moment and in the interval series of dynamics.

With the help of time series, the dynamics determine the speed and intensity of the development of phenomena, identify the main trend in their development, highlight seasonal fluctuations, compare the development of individual indicators in different countries over time, and identify relationships between phenomena that develop over time.

1.8. Statistical Indices

The concept of indices

The word "index" is Latin and means "indicator", "pointer". In statistics, an index is understood as a generalizing quantitative indicator that expresses the ratio of two sets consisting of elements that are not directly summable. For example, the volume of production of an enterprise in physical terms cannot be summed up (except for a homogeneous one), but this is necessary for a generalizing characteristic of the volume. It is impossible to summarize the prices for certain types of products, etc. Indices are used to generalize the characteristics of such aggregates in dynamics, in space and in comparison with the plan. In addition to the summary characteristics of phenomena, indices make it possible to assess the role of individual factors in changing a complex phenomenon. Indexes are also used to identify structural shifts in the national economy.

Indices are calculated both for a complex phenomenon (general or summary) and for its individual elements (individual indices).

In indices characterizing the change in a phenomenon over time, a distinction is made between the base and reporting (current) periods. Basic period - this is the period of time to which the value, taken as the basis of comparison, refers. It is denoted by the subscript "0". Reporting period is the period of time to which the value being compared belongs. It is denoted by a subscript "1".

Individual indices are the usual relative value.

Composite index- characterizes the change in the entire complex population as a whole, i.e. consisting of non-summable elements. Therefore, in order to calculate such an index, it is necessary to overcome the non-summation of the elements of the population.

This is achieved by introducing an additional indicator (component). The composite index consists of two elements: indexed value and weight.

Indexed value is the indicator for which the index is calculated. Weight (co-meter) is an additional indicator introduced for the purpose of measuring the indexed value. In the composite index, the numerator and denominator are always a complex set, expressed as the sum of the products of the indexed value and weight.

Depending on the object of study, both general and individual indices are divided into indices volumetric (quantitative) indicators(physical volume of production, sown area, number of workers, etc.) and quality indexes(prices, costs, productivity, labor productivity, wages, etc.).

Depending on the base of comparison, individual and general indices can be chain and basic .

Depending on the calculation methodology, general indices have two forms: aggregate and middle shape index.

Properly conducted data collection, analysis and statistical calculations make it possible to provide interested structures and the public with information about the development of the economy, about the direction of its development, show the efficiency of resource use, take into account the employment of the population and its ability to work, determine the rate of price growth and the impact of trade on the market itself or separately taken sphere.

List of used literature

1. Glinsky V.V., Ionin V.G. Statistical analysis. Textbook. - M.: FILIN, 1998 - 264 p.

2. Eliseeva I.I., Yuzbashev M.M. General theory of statistics. Textbook.-

M.: Finance and statistics, 1995 - 368 p.

3. Efimova M.R., Petrova E.V., Rumyantsev V.N. General theory of statistics. Textbook.-M.: INFRA-M, 1996 - 416 p.

4. Kostina L.V. Technique for constructing statistical graphs. Methodological guide. - Kazan, TISBI, 2000 - 49 p.

5. Course of socio-economic statistics: Textbook / ed. prof. M.G. Nazarova.-M.: Finstatinform, UNITI-DIANA, 2000-771 p.

6. General theory of statistics: statistical methodology in the study of commercial activity: Textbook / ed. A.A. Spirina, O.E. Bashenoy-M.: Finance and statistics, 1994 - 296 p.

7. Statistics: a course of lectures / Kharchenko L.P., Dolzhenkova V.G., Ionin V.G. and others - Novosibirsk: NGAEiU, M .: INFRA-M, 1997 - 310 p.

8. Statistical dictionary / ch.ed. M.A. Korolev.-M.: Finance and statistics, 1989 - 623 p.

9. Theory of Statistics: Textbook / ed. prof. Shmoylova R.A. - M.: Finance and statistics, 1996 - 464 p.

Observation as the initial stage of the study is associated with the collection of initial data on the issue under study. It is characteristic of many sciences. However, each science has its own specifics, differing in its observations. Therefore, not every observation is statistical.

Statistical study- this is a collection, summary and analysis of data (facts) on socio-economic, demographic and other phenomena and processes of public life in the state, scientifically organized according to a single program, with registration of their most significant features in accounting documentation.

Distinctive features (specifics) of statistical research are: purposefulness, organization, mass character, consistency (complexity), comparability, documentation, controllability, practicality.

In general, a statistical study should:

  • To have a socially useful goal and universal (state) significance;
  • Relate to the specific conditions of its place and time;
  • Express the statistical type of accounting (and not accounting and not operational);
  • Carried out according to a pre-developed program with its scientifically based methodological and other support;
  • To carry out the collection of mass data (facts), which reflect the entire set of causal and other factors that characterize the phenomenon in many ways;
  • Register in the form of accounting documents of the established form;
  • Guarantee the absence of observational errors or reduce them to the minimum possible;
  • Provide for certain quality criteria and ways to control the collected data, ensuring their reliability, completeness and content;
  • Focus on cost-effective technology for collecting and processing data;
  • To be a reliable information base for all subsequent stages of statistical research and all users of statistical information.

Studies that do not meet these requirements are not statistical. Statistical studies are not, for example, observations and studies: mothers with a playing child (personal question); spectators at a theatrical production (there is no accounting documentation for the spectacle); a researcher for physical and chemical experiments with their measurements, calculations and documentary registration (not mass-public data); a doctor for patients with the maintenance of medical cards (operational records); accountant for the movement of funds in the bank account of the enterprise (accounting); journalists for the public and private life of government officials or other celebrities (not the subject of statistics).

A set of units that have mass character, typicality, qualitative uniformity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is the object of statistical research.

See also:

Statistical observation is the first stage of statistical research, which is a scientifically organized collection of data on the studied phenomena and processes of social life.

Stages of statistical operations

Any statistical research consists of six stages.

Stage 1. Statistical research begins with the formation of a primary statistical information base for the selected set of indicators.
  • Holding.
  • Use of official state and corporate (branded) sources.
  • Use of scientific statistical research in journals, newspapers, monographs, etc.
  • Use of electronic media (Internet, CD, floppy disks, etc.).
Stage 2. Primary generalization and grouping of statistical data.
  • , cumulates (), graphs of frequency distribution (frequencies).
  • Formation and their primary analysis. Graphical forecast (with the concept of "optimist", "pessimist", "realist").
  • Calculation of the moments of the K-th order (averages, dispersions, measures of skewness, measurement of kurtosis) in order to determine the indicators of the center of expansion, indicators of skewness (asymmetry), indicators of kurtosis (pointiness).
  • Formation and primary calculations of complex statistical indicators (relative, summary multilevel).
  • Formation and primary calculations of index indicators.
Stage 3. The next stage of statistical research includes the economic interpretation of the primary generalization.
  • and financial evaluation of the object of analysis.
  • Formation of anxiety (satisfaction) of economic and financial situations.
  • Warning about approaching threshold statistical values ​​in applied, as a rule, macroeconomic problems.
  • Diversification of the primary statistical generalization of the obtained applied results along the hierarchy of power, partnership, business.
Stage 4. Computer analysis of primary and generalized extended (volumetric) statistical data.
  • Variation Analysis of Extended Statistical Data.
  • Analysis of the dynamics of extended statistical data.
  • Analysis of extended statistical data links.
  • Multidimensional summaries and groupings.
Stage 5. Computer forecasting in the selected most important areas.
  • Method of Least Squares (LSM).
  • Moving averages.
  • Technical analysis.
  • Summary analysis and forecast options views with recommendations for management and investment adjustments.
Stage 6. Generalized analysis of the obtained results and checking them for reliability according to statistical criteria. Stage 7. The final stage of the statistical study is the adoption.

There are five main types of statistical analysis used in marketing research: descriptive analysis, inferential analysis, difference analysis, relationship analysis and predictive analysis. Sometimes these types of analysis are used separately, sometimes together.

Descriptive analysis is based on the use of such statistical measures as the average value (mean), mode, standard deviation, range or amplitude of variation.

Analysis, which is based on the use of statistical procedures (for example, hypothesis testing) in order to generalize the results obtained to the entire population, is called inferential analysis.

Difference analysis is used to compare the results of a study of two groups (two market segments) to determine the degree of real difference in their behavior, in response to the same advertisement, etc.

Relationship analysis is aimed at determining the systematic relationships (their directionality and strength) of variables. For example, determining how an increase in advertising costs affects the increase in sales.

Predictive analysis is used to predict future developments, for example through time series analysis. Statistical forecasting methods are discussed in Section 7.

Descriptive analysis tools

To describe information obtained from sample measurements, two groups of measures are widely used. The first includes "central tendency" measures, or measures that describe a typical respondent or a typical response. The second includes measures of variation, or measures that describe the degree to which respondents or responses are similar or dissimilar to "typical" respondents or responses.

There are other descriptive measures, such as measures of asymmetry (how far the found distribution curves differ from normal distribution curves). However, they are not used as often as the above, and are not of particular interest to the customer.

Only a brief description of these measures is given below. More detailed information can be obtained from books on mathematical statistics, for example,.

Measures of central tendency include mode, median, and mean.

The mode characterizes the value of the feature that appears most often compared to other values ​​of the given feature. Fashion is relative, and it is not necessary for the majority of respondents to indicate exactly this value of the feature.

The median characterizes the value of the attribute, which occupies a middle place in an ordered series of values ​​of this attribute.

The third measure of central tendency is the mean, which is most often calculated as the arithmetic mean. When it is calculated, the total volume of the attribute is equally distributed among all units of the population.

It can be seen that the degree of information content of the average value is greater than the median, and the median is the mode.

However, the considered measures do not characterize the variation of answers to a certain question or, in other words, the dissimilarity, difference of respondents or measured characteristics. Obviously, in addition to knowing the values ​​of the measures of the central tendency, it is important to establish how close the rest of the obtained estimates are to these values. Three measures of variation are commonly used: frequency distribution, range of variation, and standard deviation.

Frequency allocation represents in tabular or graphical form the number of occurrences of each value of the measured characteristic (attribute) in each selected range of its values. The frequency distribution allows you to quickly draw conclusions about the degree of detail of the measurement results.

Span variation defines the absolute difference between the maximum and minimum values ​​of the measured feature. In other words, it is the difference between the endpoints in the distribution of the ordered values ​​of the measured trait. This measure determines the distribution interval of characteristic values.

Standard deviation is a generalizing statistical characteristic of the variation of the trait values. If this measure is small, then the distribution curve has a narrow, compressed form (measurement results have a high degree of similarity); if the measure is large, then the distribution curve has a wide, stretched form (the degree of difference in estimates is large).

It was previously noted that the choice of measurement scale, and hence the type of questions in the questionnaire, predetermines the amount of information received. Similarly, the amount of information obtained by using the measures discussed above is different. The general rule is that statistical measures make it possible to obtain more information when using the most informative measurement scales. The choice of measurement scale predetermines the choice of statistical measures. For example, one of the questions in a demographic survey that used a naming scale was about nationality. Russians were assigned code 1, Ukrainians - 2, Tatars - 3, etc. In this case, of course, you can calculate the average value. But how to interpret the average nationality equal to, say, 5.67? To calculate averages, you must use an interval scale or a ratio scale. However, in our example, you can use fashion.

As for the measures of variation, the nominal scale applies the frequency distribution, the order scale uses the cumulative frequency distribution, and the interval and ratio scales apply the standard deviation.

statistical inference

Inference is a type of logical analysis aimed at obtaining general conclusions about the entire population based on observations of a small group of units in this population.

Conclusions are drawn on the basis of the analysis of a small number of facts. For example, if two of your friends who have the same brand of car complain about its quality, then you can conclude that the quality of this brand of car as a whole is low.

Statistical inference is based on a statistical analysis of the results of sample studies and is aimed at assessing the parameters of the population as a whole. In this case, the results of selective studies are only the starting point for obtaining general conclusions.

For example, an automaker conducted two independent surveys to measure customer satisfaction with their cars. The first sample included 100 consumers who bought this model within the last six months. The second sample included 1000 consumers. During telephone interviews, respondents answered the question: “Are you satisfied or not satisfied with the car model you bought?” The first survey revealed 30% of the dissatisfied, the second - 35%.

Since there are sampling errors in both the first and second cases, the following conclusion can be drawn. For the first case: about 30% of respondents expressed dissatisfaction with the purchased car model. For the second case, about 35% of respondents expressed dissatisfaction with the purchased car model. What general conclusion can be drawn in this case? How to get rid of the term "about"? To do this, we introduce an error indicator: 30% ± x% and 35% ± y% and compare x and y. Using logical analysis, we can conclude that a large sample contains a smaller error and that on its basis it is possible to draw more correct conclusions about the opinion of the entire population of consumers. It can be seen that the sample size is the decisive factor for obtaining correct conclusions. This indicator is present in all formulas that determine the content of various methods of statistical inference.

When conducting marketing research, the following methods of statistical inference are most often used: parameter estimation and hypothesis testing.

Parameter Estimation population is the process of determining, based on sample data, the interval in which one of the parameters of the population, such as the mean, is located. To do this, use the following statistics: average values, standard error and the desired level of confidence (usually 95% or 99%).

Their role in parameter estimation will be discussed below.

The mean square error is, as noted above, a measure of variation in the sample distribution under the theoretical assumption that many independent samples of the same general population were studied.

It is determined by the following formula:

Where s x - standard error of the sample mean;

s - standard deviation from the average value in the sample;

n - sample size.

If percentage measures are used that express the alternative variability of qualitative traits, then

where s is the standard error of the sample mean when using percentage measures;

p is the percentage of respondents in the sample who supported the first alternative;

q = (100 - q) - percentage of respondents in the sample who supported

second alternative;

n - sample size.

It can be seen that the mean sampling error is the larger, the greater the variation, and the smaller, the larger the sample size.

Since there is always a sampling error, it is necessary to estimate the spread of the values ​​of the studied parameter of the general population. Suppose the researcher has chosen a confidence level of 99%. It follows from the properties of the normal distribution curve that the parameter Z = ± 2.58 corresponds to it. The average for the general population as a whole is calculated by the formula

If percentages are used, then

This means that if you want the range of estimates at 99% confidence to include the true estimate for the population, then you need to multiply the standard error by 2.58 and add that result to the p (upper bound) percentage. If we subtract this product, then we find the lower limit estimate.

How do these formulas relate to statistical inference?

Since the population parameter is being estimated, the range in which the true value of the population parameter falls is indicated here. For this purpose, a statistical measure of the central tendency, the magnitude of the variance and the sample size are taken for the sample. Next, an assumption is made about the level of confidence and the range of dispersion of the parameter for the general population is calculated.

For example, for sample members (100 readers of a newspaper), it was found that the average reading time of a newspaper is 45 minutes with an RMS error of 20 minutes. With a confidence level of 95%, we get

At a 99% confidence level, we get

It can be seen that the confidence interval is wider for 99% compared to the 95% confidence level.

If percentages are used and it turned out that out of a sample of 100 people, 50% of the respondents drink coffee in the morning, then at a confidence level of 99% we get the following range of estimates:

Thus, the logic of statistical inference is aimed at obtaining final conclusions about the studied parameter of the general population on the basis of a selective study carried out according to the laws of mathematical statistics. If a simple conclusion is used that is not based on statistical measurements, then the final conclusions are subjective and based on the same facts, different specialists can draw different conclusions.

When using statistical inference, formulas are used that are objective in nature, based on generally accepted statistical concepts. As a result, the final conclusions are much more objective.

In some cases, judgments are made regarding some parameter of the general population (the value of the mean, variance, the nature of the distribution, the form and closeness of the relationship between variables) based only on some assumptions, reflections, intuition, incomplete knowledge. Such judgments are called hypotheses.

A statistical hypothesis is an assumption about a property of the population that can be tested based on sample data.

Under hypothesis testing refers to the statistical procedure used to confirm or reject a hypothesis based on the results of sample studies. Hypothesis testing is carried out on the basis of identifying the consistency of empirical data with hypothetical ones. If the discrepancy between the compared values ​​does not go beyond the limits of random errors, the hypothesis is accepted. At the same time, no conclusions are made about the correctness of the hypothesis itself, it is only about the consistency of the compared data.

Hypothesis testing is carried out in five stages:

1. Some assumption is made about some characteristic of the general population, for example, about the average value of a certain parameter.

2. A random sample is formed, a selective study is carried out and statistical indicators of the sample are determined.

3. The hypothetical and statistical values ​​of the studied characteristic are compared.

4. It is determined whether or not the results of the sample study correspond to the accepted hypothesis.

5. If the results of the sample study do not confirm the hypothesis, the latter is revised - it must correspond to the data of the sample study.

Due to the variation in the results of sample studies, it is impossible to draw an absolutely accurate conclusion about the reliability of the hypothesis by conducting a simple arithmetic comparison of the values ​​of the characteristics. Therefore, statistical hypothesis testing involves the use of: the sample value of the characteristic, the standard deviation, the desired level of confidence, and the hypothetical value of the characteristic for the population as a whole.

The following formula is used to test hypotheses about averages:

For example, when advertising a college sales training program, the program manager estimated that graduates of the program were earning an average of $1,750 per month. So the hypothetical population average is $1,750. To test this hypothesis, a telephone survey of sales agents of different firms was conducted.

The sample was 100 people, the sample mean was $1,800, and the standard deviation was $350. The question arises whether there is a large difference ($50) between the hypothetical salary and its average value for the sample. We carry out calculations according to the formula (4.2):

It can be seen that the mean square error of the mean was $35, and the quotient of 50 divided by 45 is 1.43 (normalized deviation), which is less than ±1.96, a value characterizing the 95% confidence level. In this case, the proposed hypothesis can be considered reliable.

When using a percentage measure, the hypothesis is tested as follows. Suppose that, based on his own experience, one of the motorists hypothesized that only 10% of motorists use seat belts. However, national sample studies of 1000 motorists showed that 80% of them use seat belts. The calculations in this case are carried out as follows:

where p is the percentage of sample studies;

πH- percentage of the hypothesis;

s p - root mean square error in calculations in percent.

It can be seen that the initial hypothesis differed from the found 80% by the value of 55.3 multiplied by the standard error, i.e. cannot be considered reliable.

In some cases, it is advisable to use directional hypotheses. Directional hypotheses defines the directions of possible values ​​of some parameter of the general population. For example, the salary is more than $1,750. In this case, only one side of the distribution curve is used, which is reflected in the use of the "+" and "-" signs in the calculation formulas.

More detailed information on this issue can be obtained from.

Here, however, a question arises. If you can conduct selective studies, then why put forward hypotheses? Processing the results of selective studies makes it possible to obtain average values ​​and their statistical characteristics without putting forward any hypotheses. Therefore, hypothesis testing is more likely to be used in cases where it is impossible or extremely difficult to conduct full-scale studies and when it is necessary to compare the results of several studies (for different groups of respondents or conducted at different times). Problems of this kind usually arise in social statistics. The labor intensity of statistical and sociological research leads to the fact that almost all of them are based on discontinuous accounting. Therefore, the problem of evidence-based conclusions in social statistics is particularly acute.

When applying the hypothesis testing procedure, it should be remembered that it can guarantee results with a certain probability only for “unbiased” samples, based on objective data.

Difference Analysis

Checking the significance of differences consists in comparing the answers to the same question received for two or more independent groups of respondents. In addition, in some cases it is of interest to compare the answers to two or more independent questions for the same sample.

An example of the first case is the study of the question: what do the inhabitants of a certain region prefer to drink in the morning: coffee or tea. Initially, 100 respondents were interviewed on the basis of a random sample, 60% of whom prefer coffee; a year later, the study was repeated, and only 40% of the 300 people surveyed were in favor of coffee. How can the results of these two studies be compared? Direct arithmetic comparison of 40% and 60% is impossible due to different sampling errors. Although in the case of large differences in numbers, say 20 and 80%, it is easier to conclude that there is a change in tastes in favor of coffee. However, if there is confidence that this large difference is due primarily to the fact that in the first case a very small sample was used, then such a conclusion may turn out to be doubtful. Thus, when conducting such a comparison, two critical factors must be taken into account: the degree of significance of differences between the parameter values ​​for two samples and the mean square errors of two samples, determined by their volumes.

The null hypothesis is used to test whether the difference between the measured means is significant. The null hypothesis assumes that two populations that are compared on one or more characteristics do not differ from each other. In this case, it is assumed that the actual difference between the compared values ​​is equal to zero, and the difference from zero revealed from the data is of a random nature , .

To check if the difference between the two measured means (percentages) is significant, they are first compared, and then the resulting difference is converted to the value of standard errors, and it is determined how far they deviate from the hypothetical zero value.

Once the standard errors are determined, the area under the normal distribution curve becomes known and it becomes possible to draw a conclusion about the probability of fulfilling the null hypothesis.

Consider the following example. Let's try to answer the question: "Is there a difference in the consumption of soft drinks between girls and boys?" The survey asked about the number of cans of soft drinks consumed during the week. Descriptive statistics showed that, on average, boys consume 9 and girls 7.5 cans of soft drinks. The mean square deviations were 2 and 1.2, respectively. The sample size in both cases was 100 people. Checking for a statistically significant difference in the estimates was carried out as follows:

where x 1 and x 2 are the averages for two samples;

s 1 and s 2 - standard deviations for two samples;

n 1 and n 2 - the volume of the first and second samples, respectively.

The numerator of this formula characterizes the difference between the averages. In addition, it is necessary to take into account the difference in the shape of the two distribution curves. This is done in the denominator of the formula. The sampling distribution is now considered as the sampling distribution of the difference between means (percentage measures). If the null hypothesis is true, then the distribution of the difference is a normal curve with mean equal to zero and mean square error equal to 1.

It can be seen that the value of 6.43 significantly exceeds the value of ±1.96 (95% confidence level) and ±2.58 (99% confidence level). This means that the null hypothesis is not true.

On fig. 4.6 shows the distribution curves for these two compared samples and the mean square error of the difference curve. The mean square error of the average difference curve is 0. Due to the large value of the mean square errors, the probability of the validity of the null hypothesis that there is no difference between the two means is less than 0.001.

Fundamentals of statistical data analysis

statistics"biostatistics".

1. nominal;
2. ordinal;
3. interval;

samples

representative

sample frame simple random sample interval sampling

stratified sampling

cluster and sampling quota

null hypothesis

alternative hypothesis power

confidence level."


Title: Fundamentals of statistical data analysis
Detailed description:

After the completion of any scientific research, fundamental or experimental, a statistical analysis of the data obtained is carried out. In order for the statistical analysis to be successfully carried out and to solve the tasks, the study must be properly planned. Therefore, without understanding the basics of statistics, it is impossible to plan and process the results of a scientific experiment. However, medical education does not provide not only knowledge of statistics, but even the basics of higher mathematics. Therefore, one can often come across the opinion that only a statistician should deal with statistical processing in biomedical research, and a medical researcher should focus on medical issues of his scientific work. Such a division of labor, implying assistance in data analysis, is fully justified. However, an understanding of the principles of statistics is necessary at least in order to avoid incorrect setting of the problem for a specialist, communication with whom before the start of the study is as important as at the stage of data processing.

Before talking about the basics of statistical analysis, it is necessary to clarify the meaning of the term " statistics". There are many definitions, but the most complete and concise, in our opinion, is the definition of statistics as "the science of collecting, presenting and analyzing data". In turn, the use of statistics in applications to the living world is called "biometrics" or " biostatistics".

It should be noted that very often statistics is reduced only to the processing of experimental data, without paying attention to the stage of obtaining them. However, statistical knowledge is necessary already during the planning of the experiment, so that the indicators obtained during it can provide the researcher with reliable information. Therefore, we can say that the statistical analysis of the results of the experiment begins even before the start of the study.

Already at the stage of developing a plan, the researcher should clearly understand what type of variables will be in his work. All variables can be divided into two classes: qualitative and quantitative. What range a variable can take depends on the scale of measurement. There are four main scales:

1. nominal;
2. ordinal;
3. interval;
4. rational (scale of relations).

In the nominal scale (the scale of "names") there are only symbols for describing some classes of objects, for example, "gender" or "profession of the patient". The nominal scale implies that the variable will take values, quantitative relationships between which cannot be determined. Thus, it is impossible to establish a mathematical relationship between the male and female sexes. Conventional numerical designations (women - 0, men - 1, or vice versa) are given absolutely arbitrarily and are intended only for computer processing. The nominal scale is qualitative in its purest form; individual categories in this scale are expressed by frequencies (the number or proportion of observations, percentages).

The ordinal (ordinal) scale provides that individual categories in it can be arranged in ascending or descending order. In medical statistics, a classic example of an ordinal scale is the gradation of the severity of a disease. In this case, we can build the severity in ascending order, but still do not have the ability to specify quantitative relationships, i.e. the distance between the values ​​measured in the ordinal scale is unknown or does not matter. It is easy to establish the order of the values ​​of the “severity” variable, but it is impossible to determine how many times a severe condition differs from a moderate condition.

The ordinal scale refers to semi-quantitative data types, and its gradations can be described both by frequencies (as in a qualitative scale) and by measures of central values, which we will discuss below.

Interval and rational scales are purely quantitative data types. In the interval scale, we can already determine how much one value of a variable differs from another. Thus, an increase in body temperature by 1 degree Celsius always means an increase in the heat released by a fixed number of units. However, the interval scale has both positive and negative values ​​(there is no absolute zero). In this regard, it is impossible to say that 20 degrees Celsius is twice as warm as 10. We can only state that 20 degrees is as much warmer than 10 as 30 is warmer than 20.

The rational scale (the ratio scale) has one reference point and only positive values. In medicine, most rational scales are concentrations. For example, a glucose level of 10 mmol/L is twice the concentration compared to 5 mmol/L. For temperature, the rational scale is the Kelvin scale, where there is absolute zero (absence of heat).

It should be added that any quantitative variable can be continuous, as in the case of measuring body temperature (this is a continuous interval scale), or discrete, if we count the number of blood cells or the offspring of laboratory animals (this is a discrete rational scale).

These differences are of decisive importance for the choice of methods for statistical analysis of experimental results. So, for nominal data, the chi-square test is applicable, and the well-known Student's test requires that the variable (interval or rational) be continuous.

After the question of the type of the variable has been resolved, it is necessary to start forming samples. A sample is a small group of objects of a certain class (in medicine, a population). To obtain absolutely accurate data, it is necessary to study all objects of a given class, however, for practical (often financial) reasons, only a part of the population, which is called the sample, is studied. In the future, statistical analysis allows the researcher to extend the patterns obtained to the entire population with a certain degree of accuracy. In fact, all biomedical statistics is aimed at obtaining the most accurate results from the least possible number of observations, because in human research, an ethical issue is also important. We cannot afford to put more patients at risk than necessary.

The creation of a sample is regulated by a number of mandatory requirements, the violation of which can lead to erroneous conclusions from the results of the study. First, sample size is important. The accuracy of estimating the studied parameters depends on the sample size. The word "accuracy" should be taken into account here. The larger the size of the study groups, the more accurate (but not necessarily correct) results the scientist receives. In order for the results of sampling studies to be transferable to the entire population as a whole, the sample must be representative. The representativeness of the sample implies that it reflects all the essential properties of the population. In other words, in the studied groups, persons of different sex, age, professions, social status, etc. are found with the same frequency as in the entire population.

However, before starting the selection of the study group, one should decide on the need to study a particular population. An example of a population can be all patients with a certain nosology or people of working age, etc. Thus, the results obtained for a population of young people of military age can hardly be extrapolated to postmenopausal women. The set of characteristics that the study group will have determines the "generalizability" of the study data.

Samples can be generated in various ways. The simplest of them is to select, using a random number generator, the required number of objects from a population, or sample frame(sampling frame). This method is called simple random sample". If you randomly select a starting point in the sampling frame, and then take every second, fifth, or tenth object (depending on what group sizes are required in the study), you get interval sampling. Interval sampling is not random, since the possibility of periodic repetitions of data within the sampling frame is never excluded.

It is possible to create the so-called " stratified sampling”, which assumes that the population consists of several different groups and this structure should be reproduced in the experimental group. For example, if the ratio of men to women in a population is 30:70, then in a stratified sample, their ratio should be the same. With this approach, it is critical not to balance the sample excessively, that is, to avoid the homogeneity of its characteristics, otherwise the researcher may miss the chance to find differences or relationships in the data.

In addition to the described methods of forming groups, there are also cluster and sampling quota. The first one is used when obtaining complete information about the sample frame is difficult due to its size. Then the sample is formed from several groups included in the population. The second - quota - is similar to a stratified sample, but here the distribution of objects does not correspond to that in the population.

Returning to the sample size, it should be said that it is closely related to the probability of statistical errors of the first and second kind. Statistical errors may be due to the fact that the study does not study the entire population, but part of it. Type I error is the erroneous deviation null hypothesis. In turn, the null hypothesis is the assumption that all the studied groups are taken from the same general population, which means that the differences or relationships between them are random. If we draw an analogy with diagnostic tests, then a type I error is a false positive result.

Type II error is an incorrect deviation alternative hypothesis, the meaning of which lies in the fact that the differences or relationships between groups are due not to a random coincidence, but to the influence of the studied factors. And again the analogy with diagnostics: an error of the second kind is a false negative result. Related to this error is the notion power, which tells about how effective a certain statistical method is under given conditions, about its sensitivity. The power is calculated by the formula: 1-β, where β is the probability of a Type II error. This indicator depends mainly on the sample size. The larger the group sizes, the lower the probability of a Type II error and the higher the power of statistical tests. This dependence is at least quadratic, that is, reducing the sample size by half will lead to a drop in power at least four times. The minimum allowable power is considered to be 80%, and the maximum allowable level of error of the first kind is 5%. However, it should always be remembered that these boundaries are arbitrary and may change depending on the nature and objectives of the study. As a rule, the scientific community recognizes an arbitrary change in power, but in the overwhelming majority of cases, the level of error of the first kind cannot exceed 5%.

All of the above is directly related to the research planning stage. However, many researchers mistakenly refer to statistical data processing only as some kind of manipulation performed after the completion of the main part of the work. Often, after the end of an unplanned experiment, there is an irresistible desire to order the analysis of statistical data on the side. But it will be very difficult even for a statistician to extract the result expected by the researcher from the “heap of garbage”. Therefore, with insufficient knowledge of biostatistics, it is necessary to seek help in statistical analysis even before the start of the experiment.

Turning to the analysis procedure itself, two main types of statistical techniques should be pointed out: descriptive and evidence-based (analytical). Descriptive techniques include techniques to present data in a compact and easy-to-understand manner. These include tables, graphs, frequencies (absolute and relative), measures of central tendency (mean, median, mode) and measures of data spread (variance, standard deviation, interquartile interval, etc.). In other words, descriptive methods characterize the studied samples.

The most popular (though often misleading) way of describing available quantitative data is to define the following indicators:

  • the number of observations in the sample or its size;
  • average value (arithmetic mean);
  • standard deviation is a measure of how widely the values ​​of variables change.

It is important to remember that the arithmetic mean and standard deviation are measures of central tendency and scatter in a fairly small number of samples. In such samples, the values ​​of most objects deviate from the mean with equal probability, and their distribution forms a symmetrical “bell” (Gaussian or Gauss-Laplace curve). Such a distribution is also called “normal”, but in the practice of a medical experiment it occurs only in 30% of cases. If the values ​​of the variable are distributed asymmetrically about the center, then the groups are best described using the median and quantiles (percentiles, quartiles, deciles).

Having completed the description of the groups, it is necessary to answer the question about their relationships and the possibility of generalizing the results of the study to the entire population. For this, evidence-based methods of biostatistics are used. It is about them that researchers first of all remember when it comes to statistical data processing. Usually this stage of work is called "testing statistical hypotheses".

Hypothesis testing tasks can be divided into two large groups. The first group answers the question of whether there are differences between groups in terms of the level of some indicator, for example, differences in the level of hepatic transaminases in patients with hepatitis and healthy people. The second group allows you to prove the existence of a relationship between two or more indicators, for example, the function of the liver and the immune system.

In practical terms, tasks from the first group can be divided into two subtypes:

  • comparison of the indicator in only two groups (healthy and sick, men and women);
  • comparison of three or more groups (study of different doses of the drug).

It should be taken into account that statistical methods differ significantly for qualitative and quantitative data.

In a situation where the variable being studied is qualitative and only two groups are being compared, the chi-square test can be used. This is a fairly powerful and widely known criterion, however, it is not effective enough if the number of observations is small. To solve this problem, there are several methods, such as the Yates correction for continuity and Fisher's exact method.

If the variable under study is quantitative, then one of two types of statistical tests can be used. Criteria of the first type are based on a specific type of distribution of the general population and operate with the parameters of this population. Such criteria are called "parametric", and they are usually based on the assumption of a normal distribution of values. Non-parametric tests are not based on the assumption about the type of distribution of the general population and do not use its parameters. Sometimes such criteria are called "distribution-free tests". To a certain extent, this is erroneous, since any non-parametric test assumes that the distributions in all compared groups will be the same, otherwise false positive results may be obtained.

There are two parametric tests applied to data drawn from a normally distributed population: Student's t-test to compare two groups and Fisher's F-test to test for equality of variances (aka ANOVA). There are much more nonparametric criteria. Different tests differ from each other in the assumptions on which they are based, in the complexity of the calculations, in statistical power, etc. However, the Wilcoxon test (for related groups) and the Mann-Whitney test, also known as the test Wilcoxon for independent samples. These tests are convenient in that they do not require assumptions about the nature of the data distribution. But if it turns out that the samples are taken from a normally distributed general population, then their statistical power will not differ significantly from that for the Student's test.

A complete description of statistical methods can be found in the specialized literature, however, the key point is that each statistical test requires a set of rules (assumptions) and conditions for its use, and mechanical enumeration of several methods to find the “desired” result is absolutely unacceptable from a scientific point of view. . In this sense, statistical tests are similar to drugs - each has indications and contraindications, side effects and the likelihood of ineffectiveness. And just as dangerous is the uncontrolled use of statistical tests, because hypotheses and conclusions are based on them.

For a more complete understanding of the issue of the accuracy of statistical analysis, it is necessary to define and analyze the concept of " confidence level." Confidence probability is a value taken as a boundary between probable and improbable events. Traditionally, it is denoted by the letter "p". For many researchers, the sole purpose of performing statistical analysis is to calculate the coveted p value, which seems to put down commas in the well-known phrase “execution cannot be pardoned.” The maximum allowable confidence level is 0.05. It should be remembered that the confidence level is not the probability of some event, but a matter of confidence. By exposing the confidence probability before starting the analysis, we thereby determine the degree of confidence in the results of our research. And, as you know, excessive gullibility and excessive suspicion equally negatively affect the results of any work.

The level of confidence indicates the maximum probability of a Type I error that the researcher considers acceptable. Decreasing the level of confidence, in other words, tightening the conditions for testing hypotheses, increases the likelihood of type II errors. Therefore, the choice of the confidence level should be carried out taking into account the possible damage from the occurrence of errors of the first and second kind. For example, the strict limits adopted in biomedical statistics, which determine the proportion of false positive results of no more than 5%, is a severe necessity, because new treatments are introduced or rejected based on the results of medical research, and this is a matter of life for many thousands of people.

It must be borne in mind that the p value itself is not very informative for the doctor, since it only tells about the probability of erroneous rejection of the null hypothesis. This indicator says nothing, for example, about the size of the therapeutic effect when using the study drug in the general population. Therefore, there is an opinion that instead of the level of confidence, it would be better to evaluate the results of the study by the size of the confidence interval. A confidence interval is a range of values ​​within which the true population value (for mean, median, or frequency) is contained with a certain probability. In practice, it is more convenient to have both of these values, which makes it possible to more confidently judge the applicability of the results obtained to the population as a whole.

In conclusion, a few words should be said about the tools used by a statistician or a researcher who independently analyzes data. Manual calculations are long gone. The statistical computer programs that exist today make it possible to carry out statistical analysis without having a serious mathematical background. Such powerful systems as SPSS, SAS, R, etc. enable the researcher to use complex and powerful statistical methods. However, this is not always a good thing. Without knowing the degree of applicability of the statistical tests used to specific experimental data, the researcher can make calculations and even get some numbers at the output, but the result will be very doubtful. Therefore, a prerequisite for statistical processing of the results of the experiment should be a good knowledge of the mathematical foundations of statistics.


Sufficiently detailed in the domestic literature. In the practice of Russian enterprises, meanwhile, only some of them are used. Consider next some methods of statistical processing.

General information

In the practice of domestic enterprises, there are mainly statistical control methods. If we talk about the regulation of the technological process, then it is noted extremely rarely. Application of statistical methods provides that a group of specialists who have the appropriate qualifications is formed at the enterprise.

Meaning

According to ISO ser. 9000, the supplier needs to determine the need for statistical methods that are applied in the process of developing, regulating and verifying the capabilities of the production process and the characteristics of products. The methods used are based on the theory of probability and mathematical calculations. Statistical methods for data analysis can be implemented at any stage of the product life cycle. They provide an assessment and account of the degree of heterogeneity of products or the variability of their properties relative to the established denominations or required values, as well as the variability of the process of its creation. Statistical methods are methods by which it is possible to judge the state of the phenomena that are being studied with a given accuracy and reliability. They allow you to predict certain problems, develop optimal solutions based on the studied factual information, trends and patterns.

Directions of use

The main areas in which there are widespread statistical methods are:


Practice of developed countries

Statistical methods are a base that ensures the creation of products with high consumer characteristics. These techniques are widely used in industrialized countries. Statistical methods are, in fact, guarantees that consumers receive products that meet established requirements. The effect of their use has been proven by the practice of industrial enterprises in Japan. It was they who contributed to the achievement of the highest production level in this country. Long-term experience of foreign countries shows how effective these techniques are. In particular, it is known that Hewlelt Packard, using statistical methods, was able to reduce the number of marriages per month from 9,000 to 45 units in one of the cases.

Difficulties of implementation

In domestic practice, there are a number of obstacles that do not allow the use statistical methods of study indicators. Difficulties arise due to:


Program development

It must be said that determining the need for certain statistical methods in the field of quality, choosing, mastering specific techniques is a rather complicated and lengthy job for any domestic enterprise. For its effective implementation, it is advisable to develop a special long-term program. It should provide for the formation of a service whose tasks will include the organization and methodological guidance of the application of statistical methods. Within the framework of the program, it is necessary to provide for equipping with appropriate technical means, training specialists, and determining the composition of production tasks that should be solved using the selected methods. Mastering is recommended to start with using the simplest approaches. For example, you can use the well-known elementary production. Subsequently, it is advisable to move on to other methods. For example, it can be analysis of variance, selective processing of information, regulation of processes, planning of factorial research and experiments, etc.

Classification

Statistical methods of economic analysis include different tricks. Needless to say, there are quite a few of them. However, a leading expert in the field of quality management in Japan, K. Ishikawa, recommends using seven basic methods:

  1. Pareto charts.
  2. Grouping information according to common features.
  3. Control cards.
  4. Cause and effect diagrams.
  5. Histograms.
  6. Control sheets.
  7. Scatter charts.

Based on his own experience in the field of management, Ishikawa claims that 95% of all issues and problems in the enterprise can be solved using these seven approaches.

Pareto chart

This one is based on a certain ratio. It has been called the "Pareto Principle". According to him, out of 20% of the causes, 80% of the consequences appear. in a visual and understandable form shows the relative influence of each circumstance on the overall problem in descending order. This impact can be investigated on the number of losses, defects, provoked by each cause. Relative influence is illustrated by bars, cumulative influence of factors by a cumulative straight line.

cause and effect diagram

On it, the problem under study is conventionally depicted in the form of a horizontal straight arrow, and the conditions and factors that indirectly or directly affect it are in the form of oblique arrows. When building, even seemingly insignificant circumstances should be taken into account. This is due to the fact that in practice there are quite often cases in which the solution of the problem is ensured by the exclusion of several seemingly insignificant factors. The reasons that influence the main circumstances (of the first and subsequent orders) are depicted on the diagram by horizontal short arrows. The detailed diagram will be in the form of a fish skeleton.

Grouping information

This economic-statistical method is used to organize a set of indicators that were obtained by evaluating and measuring one or more parameters of an object. As a rule, such information is presented in the form of an unordered sequence of values. These can be the linear dimensions of the workpiece, the melting point, the hardness of the material, the number of defects, and so on. Based on such a system, it is difficult to draw conclusions about the properties of the product or the processes of its creation. Ordering is carried out using line graphs. They clearly show changes in observed parameters over a certain period.

Control sheet

As a rule, it is presented in the form of a frequency distribution table for the occurrence of the measured values ​​of the object's parameters in the corresponding intervals. Checklists are compiled depending on the purpose of the study. The range of indicator values ​​is divided into equal intervals. Their number is usually chosen equal to the square root of the number of measurements taken. The form should be simple in order to eliminate problems when filling out, reading, checking.

bar graph

It is presented in the form of a stepped polygon. It clearly illustrates the distribution of measurement indicators. The range of set values ​​is divided into equal intervals, which are plotted along the x-axis. A rectangle is built for each interval. Its height is equal to the frequency of occurrence of the value in the given interval.

Scatterplots

They are used when testing a hypothesis about the relationship of two variables. The model is built as follows. The value of one parameter is plotted on the abscissa axis, and the value of another indicator is plotted on the ordinate. As a result, a dot appears on the graph. These actions are repeated for all values ​​of the variables. If there is a relationship, the correlation field is extended, and the direction will not coincide with the direction of the y-axis. If there is no constraint, it will be parallel to one of the axes or will have the shape of a circle.

Control cards

They are used when evaluating a process over a specific period. The formation of control charts is based on the following provisions:

  1. All processes deviate from the set parameters over time.
  2. The unstable course of the phenomenon does not change by chance. Deviations that go beyond the boundaries of the expected limits are non-random.
  3. Individual changes can be predicted.
  4. A stable process can randomly deviate within the expected limits.

Use in the practice of Russian enterprises

It should be said that domestic and foreign experience shows that the most effective statistical method for assessing the stability and accuracy of equipment and technological processes is the compilation of control charts. This method is also used in the regulation of production potential capacities. When constructing maps, it is necessary to choose the parameter under study correctly. It is recommended to give preference to those indicators that are directly related to the intended use of the product, that can be easily measured and that can be influenced by process control. If such a choice is difficult or not justified, it is possible to evaluate the values ​​correlated (interrelated) with the controlled parameter.

Nuances

If the measurement of indicators with the accuracy required for mapping according to a quantitative criterion is not economically or technically possible, an alternative sign is used. Terms such as "marriage" and "defect" are associated with it. The latter is understood as each separate non-compliance of the product with the established requirements. Marriage is a product, the provision of which is not allowed to consumers, due to the presence of defects in it.

Peculiarities

Each type of card has its own specifics. It must be taken into account when choosing them for a particular case. Cards by quantitative criterion are considered more sensitive to process changes than those that use an alternative feature. However, the former are more labor intensive. They are used for:

  1. Process debugging.
  2. Assessing the possibilities of introducing technology.
  3. Checking the accuracy of the equipment.
  4. Tolerance definitions.
  5. Mappings of several valid ways to create a product.

Additionally

If the disorder of the process is characterized by the displacement of the controlled parameter, it is necessary to use X-maps. If there is an increase in the dispersion of values, R or S models should be chosen. It is necessary, however, to take into account a number of features. In particular, the use of S-charts will make it possible to more accurately and quickly establish the disorder of the process than R-models with the same ones. At the same time, the construction of the latter does not require complex calculations.

Conclusion

In economics, it is possible to explore the factors that are found in the course of a qualitative assessment, in space and dynamics. They can be used to perform predictive calculations. Statistical methods of economic analysis do not include methods for assessing the cause-and-effect relationships of economic processes and events, identifying promising and untapped reserves for improving performance. In other words, factorial techniques are not included in the considered approaches.