Biographies Characteristics Analysis

The maximum sampling error is. Sample observation: concept, types, sampling errors, evaluation of results

The discrepancies between the value of any indicator found through statistical observation and its actual size are called observation errors . Depending on the reasons for their occurrence, registration errors and representativeness errors are distinguished.

Registration errors arise as a result of incorrect identification of facts or erroneous recording during the process of observation or interview. They can be random or systematic. Random registration errors can be made by both respondents in their responses and by interviewers. Systematic errors can be both intentional and unintentional. Deliberate – conscious, tendentious distortions of the actual state of affairs. Unintentional ones are caused by various accidental reasons (negligence, inattention).

Representativeness errors (representativeness) arise as a result of an incomplete survey and if the population being surveyed does not fully reproduce the general population. They can be random or systematic. Random errors of representativeness are deviations that arise during incomplete observation due to the fact that the set of selected observation units (sample) does not fully reproduce the entire population as a whole. Systematic errors of representativeness are deviations that arise as a result of violation of the principles of random selection of units. Representativeness errors are organically inherent in selective observation and arise due to the fact that the sample population does not completely reproduce the general population. It is impossible to avoid errors of representativeness, however, using methods of probability theory based on the use of limit theorems of the law of large numbers, these errors can be reduced to minimum values, the boundaries of which are established with sufficiently high accuracy.

Sampling errors – the difference between the characteristics of the sample and the general population. For the average value, the error will be determined by the formula

Where

Magnitude
called extreme error samples.

The maximum sampling error is a random value. Limit theorems of the law of large numbers are devoted to studying the patterns of random sampling errors. These patterns are most fully revealed in the theorems of P. L. Chebyshev and A. M. Lyapunov.

Theorem of P. L. Chebyshev in relation to the method under consideration, it can be formulated as follows: with a sufficiently large number of independent observations, it is possible, with a probability close to one (i.e., almost with certainty), to assert that the deviation of the sample average from the general average will be as small as desired. In the theorem of P. L. Chebyshev it is proved that the magnitude of the error should not exceed . In turn, the value , expressing the standard deviation of the sample mean from the general mean, depends on the variability of the characteristic in the population and number of selected units n. This dependence is expressed by the formula

, (7.2)

Where also depends on the sampling method.

Size =called average sampling error. In this expression – general variance, n– size of the sample population.

Let's consider how the number of selected units affects the average error n. Logically, it is not difficult to verify that when a large number of units are selected, the differences between the averages will be smaller, that is, there is an inverse relationship between the average sampling error and the number of selected units. In this case, not just an inverse mathematical relationship is formed, but a relationship that shows that the square of the discrepancy between the averages is inversely proportional to the number of selected units.

An increase in the variability of a characteristic entails an increase in the standard deviation, and, consequently, an error. If we assume that all units will have the same value of the attribute, then the standard deviation will become zero and the sampling error will also disappear. Then there is no need to apply sampling. However, it should be borne in mind that the magnitude of the variability of a trait in the general population is unknown, since the sizes of the units in it are unknown. It is possible to calculate only the variability of a characteristic in a sample population. The relationship between the variances of the general and sample populations is expressed by the formula

Since the value at sufficiently large n is close to unity, we can approximately assume that the sample variance is equal to the general variance, i.e.

Consequently, the average sampling error shows what possible deviations of the characteristics of the sample population from the corresponding characteristics of the general population. However, the magnitude of this error can be judged with a certain probability. The probability value is indicated by the multiplier

Theorem of A. M. Lyapunov . A. M. Lyapunov proved that the distribution of sample means (and therefore their deviations from the general mean) with a sufficiently large number of independent observations is approximately normal, provided that the general population has a finite mean and limited variance.

Mathematically Lyapunov's theorem can be written like this:

(7.3)

Where
, (7.4)

Where
– mathematical constant;

marginal sampling error , which makes it possible to find out within what limits the value of the general average lies.

The values ​​of this integral for various values ​​of the confidence coefficient t calculated and presented in special mathematical tables. In particular, when:

Because the t indicates the likelihood of discrepancy
, i.e., the probability by what amount the general average will differ from the sample average, then it can be read as follows: with a probability of 0.683, it can be stated that the difference between the sample and general averages does not exceed one value of the average sampling error. In other words, in 68.3% of cases the representativeness error will not exceed the limits
With a probability of 0.954 it can be stated that the representativeness error does not exceed
(i.e. in 95% of cases). With a probability of 0.997, i.e. quite close to unity, we can expect that the difference between the sample and general average will not exceed three times the average sampling error, etc.

Logically, the connection here looks quite clear: the greater the limits within which a possible error is allowed, the more likely it is to judge its magnitude.

Knowing the sample mean value of the attribute
and marginal sampling error
, it is possible to determine the boundaries (limits) within which the general average is contained

1 . Proper random sampling – this method is focused on selecting units from the general population without any division into parts or groups. At the same time, in order to comply with the basic principle of sampling - an equal opportunity for all units of the general population to be selected - a scheme for randomly extracting units by drawing lots (lottery) or a table of random numbers is used. Repeated and non-repetitive selection of units is possible

The average error of a random sample is the standard deviation of possible values ​​of the sample average from the general average. The average sampling errors using the purely random sampling method are presented in Table. 7.2.

Table 7.2

Average sampling error μ

When selecting

repeated

repeatable

For average

The following notations are used in the table:

– variance of the sample population;

– sample size;

– size of the general population;

– sample proportion of units possessing the studied trait;

– the number of units possessing the characteristic being studied;

– sample size.

To increase accuracy instead of a multiplier you should take a multiplier
, but with a large number N the difference between these expressions has no practical meaning.

Maximum error of a truly random sample
calculated by the formula

, (7.6)

Where t – the confidence coefficient depends on the probability value.

Example. When examining one hundred samples of products selected from the batch at random, 20 turned out to be non-standard. With a probability of 0.954, determine the limits within which the share of non-standard products in the batch lies.

Solution. Let us calculate the general share ( R):
.

Share of non-standard products:
.

The maximum error of the sample share with a probability of 0.954 is calculated using formula (7.6) using the formula in table. 7.2 for share:

With a probability of 0.954, it can be stated that the share of non-standard products in a batch of goods is within 12% ≤ P≤ 28 %.

In the practice of designing sample observation, there is a need to determine the size of the sample, which is necessary to ensure a certain accuracy in the calculation of general averages. The maximum sampling error and its probability are given. From the formula
and formulas for average sampling errors, the required sample size is established. Formulas for determining the sample size ( n) depend on the selection method. The calculation of the sample size for a purely random sample is given in Table. 7.3.

Table 7.3

Estimated selection

for average

Repeated

Repeatless

2 . Mechanical sampling – with this method, they proceed from taking into account certain features of the location of objects in the general population, their ordering (by list, number, alphabet). Mechanical sampling is carried out by selecting individual objects of the general population at a certain interval (every 10th or 20th). The interval is calculated in relation to , Where n– sample size, N– size of the general population. So, if from a population of 500,000 units it is expected to obtain a 2% sample, i.e., to select 10,000 units, then the selection proportion will be
The selection of units is carried out in accordance with the established proportion at regular intervals. If the location of objects in the general population is random, then mechanical sampling is similar in content to random selection. In mechanical selection, only non-repetitive sampling is used.

The average error and sample size during mechanical selection are calculated using the formulas for proper random sampling (see Tables 7.2 and 7.3).

3 . Typical sample , in which the general population is divided according to some essential characteristics into typical groups; the selection of units is made from typical groups. With this method of selection, the general population is divided into groups that are homogeneous in some respects, which have their own characteristics, and the question comes down to determining the size of samples from each group. May be uniform sampling – with this method, the same number of units is selected from each typical group
This approach is justified only if the numbers of the original typical groups are equal. With typical selection, disproportionate to the size of the groups, the total number of selected units is divided by the number of typical groups, the resulting value gives the number of selection from each typical group.

A more advanced form of selection is proportional sampling . A scheme for forming a sample population is called proportional when the number of samples taken from each typical group in the general population is proportional to the numbers, variances (or a combination of both numbers and variances). We conditionally determine the sample size to be 100 units and select units from the groups:

in proportion to the size of their general population (Table 7.4). The table indicates:

N i– size of the typical group;

d j– share ( N i/ N);

N– size of the general population;

n i– the sample size from a typical group is calculated:

, (7.7)

n– size of the sample from the general population.

Table 7.4

N i

d j

n i

proportional to the standard deviation (Table 7.5).

here  i– standard deviation of typical groups;

n i – the sample size from a typical group is calculated using the formula

(7.8)

Table 7.5

N i

n i

combined (Table 7.6).

The sample size is calculated using the formula

. (7.9)

Table 7.6

i N i

When conducting a typical sample, direct selection from each group is carried out using random sampling.

Average sampling errors are calculated using the formulas in Table. 7.7 depending on the method of selection from typical groups.

Table 7.7

Selection method

Repeated

Repeatless

for average

for share

for average

for share

Disproportional to group size

Proportional to group size

Proportional to fluctuations in groups (is the most profitable)

Here
– the average of the within-group variances of typical groups;

– the proportion of units possessing the studied trait;

– the average of the within-group variances for the share;

– standard deviation in a sample of i th typical group;

– sample size from a typical group;

– total sample size;

– volume of a typical group;

– volume of the general population.

The sample size from each typical group should be proportional to the standard deviation in this group
.Calculation of numbers
produced according to the formulas given in table. 7.8.

Table 7.8

4 . Serial sampling – convenient in cases where population units are combined into small groups or series. In serial sampling, the general population is divided into groups of equal size – series. Series are selected into the sample population. The essence of serial sampling is the random or mechanical selection of series, within which a continuous examination of units is carried out. The average error of a serial sample with equal series depends on the magnitude of the between-group variance only. The average errors are summarized in table. 7.9.

Table 7.9

Series selection method

for average

for share

Repeated

Repeatless

Here R– number of series in the general population;

r– number of selected series;

– interseries (intergroup) dispersion of means;

– interseries (intergroup) dispersion of the share.

With serial selection, the required number of selected series is determined in the same way as with the purely random selection method.

The number of serial samples is calculated using the formulas given in table. 7.10.

Table 7.10

Example. In the mechanical shop of the plant, 100 workers work in ten teams. In order to study the qualifications of workers, a 20% serial non-repetitive sampling was carried out, which included two teams. The following distribution of surveyed workers by category was obtained:

Categories of workers in brigade 1

Categories of workers in brigade 2

Categories of workers in brigade 1

Categories of workers in brigade 2

It is necessary to determine with a probability of 0.997 the limits within which the average category of workers in a machine shop lies.

Solution. Let us define sample averages for teams and the overall average as a weighted average of group averages:

Let us determine the inter-series dispersion using formulas (5.25):

Let's calculate the average sampling error using the formula in Table. 7.9:

Let's calculate the maximum sampling error with a probability of 0.997:

With a probability of 0.997, it can be stated that the average category of workers in a machine shop is within the range

The average sampling error shows how much the sample population parameter deviates on average from the corresponding population parameter. If we calculate the average of the errors of all possible samples of a certain type of a given volume ( n), extracted from the same general population, we obtain their generalizing characteristic - average sampling error().

In the theory of sampling observation, formulas are derived for determining , which are individual for different selection methods (repeated and non-repetitive), types of samples used and types of statistical indicators being assessed.

For example, if a repeated random sampling is used, it is defined as:

When estimating the average value of a characteristic;

If the attribute is alternative, and the share is estimated.

In case of non-repetitive random selection, an amendment is made to the formulas (1 - n/N):

- for the average value of the characteristic;

- for a share.

The probability of obtaining exactly this error value is always equal to 0.683. In practice, they prefer to obtain data with a higher probability, but this leads to an increase in the magnitude of the sampling error.

The maximum sampling error () is equal to t-fold the number of average sampling errors (in sampling theory, the coefficient t is usually called the confidence coefficient):

If the sampling error is doubled (t = 2), we get a much greater probability that it will not exceed a certain limit (in our case, double the average error) - 0.954. If we take t = 3, then the confidence probability will be 0.997 - almost certainty.

The level of marginal sampling error depends on the following factors:

  • degree of variation of units of the general population;
  • sample size;
  • selected selection schemes (non-repetitive selection gives a smaller error);
  • confidence level.

If the sample size is more than 30, then the t value is determined from the normal distribution table, if less - from the Student distribution table.

Let us present some values ​​of the confidence coefficient from the normal distribution table.

The confidence interval for the mean value of the attribute and for the share in the population is established as follows:

So, determining the boundaries of the general average and share consists of the following steps:

Sampling errors for different types of selection

  1. Actually random and mechanical sampling. The average error of the actual random and mechanical sampling is found using the formulas presented in Table. 11.3.

Example 11.2. To study the level of capital productivity, a sample survey of 90 enterprises out of 225 was conducted using a random repeated sampling method, which resulted in the data presented in the table.

In the example under consideration, we have a 40% sample (90: 225 = 0.4, or 40%). Let us determine its maximum error and boundaries for the average value of the attribute in the population according to the steps of the algorithm:

  1. Based on the results of the sample survey, we calculate the average value and variance in the sample population:
Table 11.5.
Observation results Calculated values
level of capital productivity, rub., x i number of enterprises, f i middle of the interval, x i \xb4 x i\xb4 f i x i\xb4 2 f i
Up to 1.4 13 1,3 16,9 21,97
1,4-1,6 15 1,5 22,5 33,75
1,6-1,8 17 1,7 28,9 49,13
1,8-2,0 15 1,9 28,5 54,15
2,0-2,2 16 2,1 33,6 70,56
2.2 and higher 14 2,3 32,2 74,06
Total 90 - 162,6 303,62

Sample mean

Sample variance of the studied trait

For our data, we determine the maximum sampling error, for example, with a probability of 0.954. Using the table of probability values ​​of the normal distribution function (see an excerpt from it given in Appendix 1), we find the value of the confidence coefficient t, corresponding to a probability of 0.954. With a probability of 0.954, the t coefficient is 2.

Thus, in 954 cases out of 1000, the average value of capital productivity will not be higher than 1.88 rubles. and not less than 1.74 rubles.

A repeated random sampling scheme was used above. Let's see if the survey results change if we assume that the selection was carried out according to a non-repetitive selection scheme. In this case, the average error is calculated using the formula

Then, with a probability equal to 0.954, the value of the maximum sampling error will be:

Confidence limits for the average value of a characteristic during non-repetitive random selection will have the following values:

Having compared the results of the two selection schemes, we can conclude that the use of non-repetition random sampling gives more accurate results compared to the use of repeated selection with the same confidence probability. Moreover, the larger the sample size, the more significantly the boundaries of the average values ​​narrow when moving from one selection scheme to another.

Using the example data, we determine within what boundaries the share of enterprises with a capital productivity level not exceeding 2.0 rubles is located in the general population:

  1. Let's calculate the sample share.

The number of enterprises in the sample with a capital productivity level not exceeding 2.0 rubles is 60 units. Then

m = 60, n = 90, w = m/n = 60: 90 = 0.667;

  1. calculate the variance of the share in the sample population
  1. the average sampling error using a repeated sampling scheme will be

If we assume that a non-repetitive sampling scheme was used, then the average sampling error, taking into account the correction for the finiteness of the population, will be

  1. Let's set the confidence probability and determine the maximum sampling error.

With a probability value of P = 0.997, according to the normal distribution table, we obtain the value for the confidence coefficient t = 3 (see the excerpt from it given in Appendix 1):

Thus, with a probability of 0.997 it can be stated that in the general population the share of enterprises with a capital productivity level not exceeding 2.0 rubles is no less than 54.7% and no more than 78.7%.

  1. Typical sample. With a typical sample, the general population of objects is divided into k groups, then

N 1 + N 2 + … + N i + … + N k = N.

The volume of units extracted from each typical group depends on the sampling method adopted; their total number forms the required sample size

n 1 + n 2 + … + n i + … + n k = n.

There are the following two ways of organizing selection within a typical group: proportional to the volume of typical groups and proportional to the degree of fluctuation of attribute values ​​among observation units in groups. Let's consider the first of them, as the most frequently used.

Selection proportional to the size of typical groups assumes that in each of them the following number of population units will be selected:

n = n i N i /N

where n i is the number of units extracted for the sample from the i-th typical group;

n - total sample size;

N i is the number of units in the general population that made up the i-th typical group;

N is the total number of units in the population.

The selection of units within groups occurs in the form of random or mechanical sampling.

Formulas for estimating the average sampling error for the mean and proportion are presented in Table. 11.6.

Here is the average of the group variances of typical groups.

Example 11.3. In one of the Moscow universities, a sample survey of students was conducted to determine the average attendance of the university library by one student per semester. For this purpose, a 5% non-repetition typical sample was used, the typical groups of which correspond to the course number. With a selection proportional to the size of typical groups, the following data were obtained:

Table 11.7.
Course number Total students, people, N i Examined as a result of selective observation, people, n i Average number of library visits per student per semester, x i Within-group sample variance,
1 650 33 11 6
2 610 31 8 15
3 580 29 5 18
4 360 18 6 24
5 350 17 10 12
Total 2 550 128 8 -

The number of students who need to be examined in each course is calculated as follows:

similarly for other groups:

n 2 = 31 (persons);

n 3 = 29 (persons);

The distribution of sample mean values ​​always has a normal distribution law (or approaches it) for n > 100, regardless of the nature of the distribution of the general population. However, in the case of small samples, a different distribution law applies - the Student distribution. In this case, the confidence coefficient is found from the Student t-distribution table depending on the confidence probability P and sample size n. Appendix 1 provides a fragment of the Student t-distribution table, presented as a dependence of the confidence probability on the sample size and the confidence coefficient t.

Example 11.4. Suppose that a sample survey of eight academy students showed that they spent the following number of hours preparing for a test in statistics: 8.5; 8.0; 7.8; 9.0; 7.2; 6.2; 8.4; 6.6.

Let us estimate the sample average time expenditure and construct a confidence interval for the average value of the characteristic in the general population, taking the confidence probability equal to 0.95.

That is, with a probability of 0.95, it can be stated that the student’s time spent preparing for the test is in the range from 6.9 to 8.5 hours.

11.2.2. Determining the size of the sample population

Before directly conducting a sample observation, the question of how many units of the population under study must be selected for the survey is always resolved. Formulas for determining the sample size are derived from the formulas for maximum sampling errors in accordance with the following starting points (Table 11.7):

  1. type of intended sample;
  2. selection method (repeated or non-repetitive);
  3. selection of the parameter to be evaluated (average value of the characteristic or proportion).

In addition, it is necessary to determine in advance the value of the confidence probability that suits the consumer of information, and the size of the permissible maximum sampling error.

Note: when using the formulas given in the table, it is recommended to round up the resulting sample size to ensure some margin of accuracy.

Example 11.5. Let us calculate how many of the 507 industrial enterprises should be checked by the tax inspectorate in order to determine, with a probability of 0.997, the share of enterprises with violations in paying taxes. According to data from a previous similar survey, the standard deviation was 0.15; The sampling error is expected to be no higher than 0.05.

When using repeated random sampling, check

In case of repeated random selection, it will be necessary to check

As we can see, the use of non-repetitive sampling allows us to survey a much smaller number of objects.

Example 11.6. It is planned to conduct a survey of wages at industry enterprises using a random, non-repetitive sampling method. What should be the size of the sample population if at the time of the survey the number of employees in the industry was 100,000 people? The maximum sampling error should not exceed 100 rubles. with a probability of 0.954. Based on the results of previous salary surveys in the industry, it is known that the standard deviation is 500 rubles.

Therefore, to solve this problem, it is necessary to include at least 100 people in the sample.

Statistical population- a set of units that have mass, typicality, qualitative homogeneity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is an object.

Unit of the population— each specific unit of a statistical population.

The same statistical population can be homogeneous in one characteristic and heterogeneous in another.

Qualitative uniformity- similarity of all units of the population on some basis and dissimilarity on all others.

In a statistical population, the differences between one population unit and another are often of a quantitative nature. Quantitative changes in the values ​​of a characteristic of different units of a population are called variation.

Variation of a trait- a quantitative change in a characteristic (for a quantitative characteristic) during the transition from one unit of the population to another.

Sign- this is a property, characteristic feature or other feature of units, objects and phenomena that can be observed or measured. Signs are divided into quantitative and qualitative. The diversity and variability of the value of a characteristic in individual units of a population is called variation.

Attributive (qualitative) characteristics cannot be expressed numerically (population composition by gender). Quantitative characteristics have a numerical expression (population composition by age).

Index- this is a generalizing quantitative and qualitative characteristic of any property of units or the population as a whole under specific conditions of time and place.

Scorecard is a set of indicators that comprehensively reflect the phenomenon being studied.

For example, salary is studied:
  • Sign - wages
  • Statistical population - all employees
  • The unit of the population is each employee
  • Qualitative homogeneity - accrued wages
  • Variation of a sign - a series of numbers

Population and sample from it

The basis is a set of data obtained as a result of measuring one or more characteristics. A truly observed set of objects, statistically represented by a number of observations of a random variable, is sampling, and the hypothetically existing (conjectural) - general population. The population may be finite (number of observations N = const) or infinite ( N = ∞), and a sample from a population is always the result of a limited number of observations. The number of observations forming a sample is called sample size. If the sample size is large enough ( n → ∞) the sample is considered big, otherwise it is called sampling limited volume. The sample is considered small, if when measuring a one-dimensional random variable the sample size does not exceed 30 ( n<= 30 ), and when measuring several simultaneously ( k) features in multidimensional relation space n To k does not exceed 10 (n/k< 10) . The sample forms variation series, if its members are ordinal statistics, i.e. sample values ​​of the random variable X are ordered in ascending order (ranked), the values ​​of the characteristic are called options.

Example. Almost the same randomly selected set of objects - commercial banks of one administrative district of Moscow, can be considered as a sample from the general population of all commercial banks in this district, and as a sample from the general population of all commercial banks in Moscow, as well as as a sample from the commercial banks of the country and etc.

Basic methods of organizing sampling

The reliability of statistical conclusions and meaningful interpretation of the results depends on representativeness samples, i.e. completeness and adequacy of the representation of the properties of the general population, in relation to which this sample can be considered representative. The study of the statistical properties of a population can be organized in two ways: using continuous And not continuous. Continuous observation provides for examination of all units studied totality, A partial (selective) observation- only parts of it.

There are five main ways to organize sample observation:

1. simple random selection, in which objects are randomly selected from a population of objects (for example, using a table or random number generator), with each of the possible samples having equal probability. Such samples are called actually random;

2. simple selection using a regular procedure is carried out using a mechanical component (for example, date, day of the week, apartment number, letters of the alphabet, etc.) and the samples obtained in this way are called mechanical;

3. stratified selection consists in the fact that the general population of the volume is divided into subpopulations or layers (strata) of the volume so that . Strata are homogeneous objects in terms of statistical characteristics (for example, the population is divided into strata by age groups or social class; enterprises by industry). In this case, the samples are called stratified(otherwise, stratified, typical, regionalized);

4. methods serial selection are used to form serial or nest samples. They are convenient if it is necessary to survey a “block” or a series of objects at once (for example, a batch of goods, products of a certain series, or the population of a territorial-administrative division of the country). The selection of series can be done purely randomly or mechanically. In this case, a complete inspection of a certain batch of goods, or an entire territorial unit (a residential building or block), is carried out;

5. combined(stepped) selection can combine several selection methods at once (for example, stratified and random or random and mechanical); such a sample is called combined.

Types of selection

By mind individual, group and combined selection are distinguished. At individual selection individual units of the general population are selected into the sample population, with group selection- qualitatively homogeneous groups (series) of units, and combined selection involves a combination of the first and second types.

By method selection is distinguished repeated and non-repetitive sample.

Repeatless called selection in which a unit included in the sample does not return to the original population and does not participate in further selection; while the number of units in the general population N is reduced during the selection process. At repeated selection caught in the sample, a unit after registration is returned to the general population and thus retains an equal opportunity, along with other units, to be used in a further selection procedure; while the number of units in the general population N remains unchanged (the method is rarely used in socio-economic research). However, with large N (N → ∞) formulas for repeatable selection approaches those for repeated selection and the latter are practically more often used ( N = const).

Basic characteristics of the parameters of the general and sample population

The statistical conclusions of the study are based on the distribution of the random variable, and the observed values (x 1, x 2, ..., x n) are called realizations of the random variable X(n is sample size). The distribution of a random variable in the general population is of a theoretical, ideal nature, and its sample analogue is empirical distribution. Some theoretical distributions are specified analytically, i.e. their options determine the value of the distribution function at each point in the space of possible values ​​of the random variable. For a sample, the distribution function is difficult and sometimes impossible to determine, therefore options are estimated from empirical data, and then they are substituted into an analytical expression describing the theoretical distribution. In this case, the assumption (or hypothesis) about the type of distribution can be either statistically correct or erroneous. But in any case, the empirical distribution reconstructed from the sample only roughly characterizes the true one. The most important distribution parameters are expected value and variance.

By their nature, distributions are continuous And discrete. The best known continuous distribution is normal. Sample analogues of the parameters and for it are: mean value and empirical variance. Among discrete ones in socio-economic research, the most frequently used alternative (dichotomous) distribution. The mathematical expectation parameter of this distribution expresses the relative value (or share) units of the population that have the characteristic being studied (it is indicated by the letter); the proportion of the population that does not have this characteristic is denoted by the letter q (q = 1 - p). The variance of the alternative distribution also has an empirical analogue.

Depending on the type of distribution and on the method of selecting population units, the characteristics of the distribution parameters are calculated differently. The main ones for theoretical and empirical distributions are given in table. 1.

Sample fraction k n The ratio of the number of units in the sample population to the number of units in the general population is called:

kn = n/N.

Sample fraction w is the ratio of units possessing the characteristic being studied x to sample size n:

w = n n /n.

Example. In a batch of goods containing 1000 units, with a 5% sample sample share k n in absolute value is 50 units. (n = N*0.05); if 2 defective products are found in this sample, then sample defect rate w will be 0.04 (w = 2/50 = 0.04 or 4%).

Since the sample population is different from the general population, there are sampling errors.

Table 1. Main parameters of the general and sample populations

Sampling errors

In any case (continuous and selective), errors of two types may occur: registration and representativeness. Errors registration can have random And systematic character. Random errors consist of many different uncontrollable causes, are unintentional and usually balance each other out (for example, changes in device performance due to temperature fluctuations in the room).

Systematic errors are biased because they violate the rules for selecting objects for the sample (for example, deviations in measurements when changing the settings of the measuring device).

Example. To assess the social situation of the population in the city, it is planned to survey 25% of families. If the selection of every fourth apartment is based on its number, then there is a danger of selecting all apartments of only one type (for example, one-room apartments), which will provide a systematic error and distort the results; choosing an apartment number by lot is more preferable, since the error will be random.

Representativeness errors are inherent only in sample observation, they cannot be avoided and they arise as a result of the fact that the sample population does not completely reproduce the general population. The values ​​of the indicators obtained from the sample differ from the indicators of the same values ​​in the general population (or obtained through continuous observation).

Sampling bias is the difference between the parameter value in the population and its sample value. For the average value of a quantitative characteristic it is equal to: , and for the share (alternative characteristic) - .

Sampling errors are inherent only to sample observations. The larger these errors, the more the empirical distribution differs from the theoretical one. The parameters of the empirical distribution are random variables, therefore, sampling errors are also random variables, they can take different values ​​for different samples and therefore it is customary to calculate average error.

Average sampling error is a quantity expressing the standard deviation of the sample mean from the mathematical expectation. This value, subject to the principle of random selection, depends primarily on the sample size and on the degree of variation of the characteristic: the greater and the smaller the variation of the characteristic (and therefore the value), the smaller the average sampling error. The relationship between the variances of the general and sample populations is expressed by the formula:

those. when large enough, we can assume that . The average sampling error shows possible deviations of the sample population parameter from the general population parameter. In table 2 shows expressions for calculating the average sampling error for different methods of organizing observation.

Table 2. Average error (m) of sample mean and proportion for different types of samples

Where is the average of the within-group sample variances for a continuous attribute;

Average of the within-group variances of the proportion;

— number of selected series, — total number of series;

,

where is the average of the th series;

— the overall average for the entire sample population for a continuous characteristic;

,

where is the share of the characteristic in the th series;

— the total share of the characteristic across the entire sample population.

However, the magnitude of the average error can only be judged with a certain probability P (P ≤ 1). Lyapunov A.M. proved that the distribution of sample means, and therefore their deviations from the general mean, for a sufficiently large number approximately obeys the normal distribution law, provided that the general population has a finite mean and limited variance.

Mathematically, this statement for the average is expressed as:

and for the share, expression (1) will take the form:

Where - There is marginal sampling error, which is a multiple of the average sampling error , and the multiplicity coefficient is the Student's test ("confidence coefficient"), proposed by W.S. Gosset (pseudonym "Student"); values ​​for different sample sizes are stored in a special table.

The values ​​of the function Ф(t) for some values ​​of t are equal to:

Therefore, expression (3) can be read as follows: with probability P = 0.683 (68.3%) it can be argued that the difference between the sample and general average will not exceed one value of the average error m(t=1), with probability P = 0.954 (95.4%)- that it will not exceed the value of two average errors m (t = 2) , with probability P = 0.997 (99.7%)- will not exceed three values m (t = 3) . Thus, the probability that this difference will exceed three times the average error is determined by error level and amounts to no more 0,3% .

In table 3 shows formulas for calculating the maximum sampling error.

Table 3. Marginal error (D) of the sample for the mean and proportion (p) for different types of sample observation

Generalization of sample results to the population

The ultimate goal of sample observation is to characterize the general population. With small sample sizes, empirical estimates of parameters ( and ) may deviate significantly from their true values ​​( and ). Therefore, there is a need to establish boundaries within which the true values ​​( and ) lie for the sample values ​​of the parameters ( and ).

Confidence interval of any parameter θ of the general population is the random range of values ​​of this parameter, which with a probability close to 1 ( reliability) contains the true value of this parameter.

Marginal error samples Δ allows you to determine the limiting values ​​of the characteristics of the general population and their confidence intervals, which are equal:

Bottom line confidence interval obtained by subtraction maximum error from the sample mean (share), and the upper one by adding it.

Confidence interval for the average it uses the maximum sampling error and for a given confidence level is determined by the formula:

This means that with a given probability R, which is called the confidence level and is uniquely determined by the value t, it can be argued that the true value of the average lies in the range from , and the true value of the share is in the range from

When calculating the confidence interval for three standard confidence levels P = 95%, P = 99% and P = 99.9% the value is selected by . Applications depending on the number of degrees of freedom. If the sample size is large enough, then the values ​​corresponding to these probabilities t are equal: 1,96, 2,58 And 3,29 . Thus, the marginal sampling error allows us to determine the limiting values ​​of the characteristics of the population and their confidence intervals:

The distribution of the results of sample observation to the general population in socio-economic research has its own characteristics, since it requires complete representation of all its types and groups. The basis for the possibility of such distribution is the calculation relative error:

Where Δ % - relative maximum sampling error; , .

There are two main methods for extending a sample observation to a population: direct recalculation and coefficient method.

Essence direct conversion consists of multiplying the sample mean!!\overline(x) by the size of the population.

Example. Let the average number of toddlers in the city be estimated by the sampling method and amount to one person. If there are 1000 young families in the city, then the number of required places in municipal nurseries is obtained by multiplying this average by the size of the general population N = 1000, i.e. will have 1200 seats.

Odds method It is advisable to use in the case when selective observation is carried out in order to clarify the data of continuous observation.

The following formula is used:

where all variables are the population size:

Required sample size

Table 4. Required sample size (n) for different types of sample observation organization

When planning a sample observation with a predetermined value of the permissible sampling error, it is necessary to correctly estimate the required sample size. This volume can be determined on the basis of the permissible error during sample observation based on a given probability that guarantees the permissible value of the error level (taking into account the method of organizing the observation). Formulas for determining the required sample size n can be easily obtained directly from the formulas for the maximum sampling error. So, from the expression for the marginal error:

sample size is directly determined n:

This formula shows that as the maximum sampling error decreases Δ the required sample size increases significantly, which is proportional to the variance and the square of the Student's t test.

For a specific method of organizing observation, the required sample size is calculated according to the formulas given in table. 9.4.

Practical calculation examples

Example 1. Calculation of the mean value and confidence interval for a continuous quantitative characteristic.

To assess the speed of settlement with creditors, a random sample of 10 payment documents was carried out at the bank. Their values ​​turned out to be equal (in days): 10; 3; 15; 15; 22; 7; 8; 1; 19; 20.

Necessary with probability P = 0.954 determine the marginal error Δ sample mean and confidence limits of mean calculation time.

Solution. The average value is calculated using the formula from table. 9.1 for the sample population

The variance is calculated using the formula from table. 9.1.

Mean square error of the day.

The average error is calculated using the formula:

those. the average is x ± m = 12.0 ± 2.3 days.

The reliability of the mean was

We calculate the maximum error using the formula from table. 9.3 for repeated sampling, since the population size is unknown, and for P = 0.954 level of confidence.

Thus, the average value is `x ± D = `x ± 2m = 12.0 ± 4.6, i.e. its true value lies in the range from 7.4 to 16.6 days.

Using a Student's t-table. The application allows us to conclude that for n = 10 - 1 = 9 degrees of freedom, the obtained value is reliable with a significance level of a £ 0.001, i.e. the resulting mean value is significantly different from 0.

Example 2. Estimation of probability (general share) p.

During a mechanical sampling method of surveying the social status of 1000 families, it was revealed that the proportion of low-income families was w = 0.3 (30%)(sample was 2% , i.e. n/N = 0.02). Required with confidence level p = 0.997 determine the indicator R low-income families throughout the region.

Solution. Based on the presented function values Ф(t) find for a given confidence level P = 0.997 meaning t = 3(see formula 3). Marginal error of fraction w determine by the formula from the table. 9.3 for non-repetitive sampling (mechanical sampling is always non-repetitive):

Maximum relative sampling error in % will be:

The probability (general share) of low-income families in the region will be р=w±Δw, and confidence limits p are calculated based on the double inequality:

w — Δ w ≤ p ≤ w — Δ w, i.e. the true value of p lies within:

0,3 — 0,014 < p <0,3 + 0,014, а именно от 28,6% до 31,4%.

Thus, with a probability of 0.997 it can be stated that the share of low-income families among all families in the region ranges from 28.6% to 31.4%.

Example 3. Calculation of the mean value and confidence interval for a discrete characteristic specified by an interval series.

In table 5. the distribution of applications for the production of orders according to the timing of their implementation by the enterprise has been specified.

Table 5. Distribution of observations by time of appearance

Solution. The average time for completing orders is calculated using the formula:

The average period will be:

= (3*20 + 9*80 + 24*60 + 48*20 + 72*20)/200 = 23.1 months.

We get the same answer if we use the data on p i from the penultimate column of the table. 9.5, using the formula:

Note that the middle of the interval for the last gradation is found by artificially supplementing it with the width of the interval of the previous gradation equal to 60 - 36 = 24 months.

The variance is calculated using the formula

Where x i- the middle of the interval series.

Therefore!!\sigma = \frac (20^2 + 14^2 + 1 + 25^2 + 49^2)(4), and the mean square error is .

The average error is calculated using the monthly formula, i.e. the average value is!!\overline(x) ± m = 23.1 ± 13.4.

We calculate the maximum error using the formula from table. 9.3 for repeated selection, since the population size is unknown, for a 0.954 confidence level:

So the average is:

those. its true value lies in the range from 0 to 50 months.

Example 4. To determine the speed of settlements with creditors of N = 500 corporation enterprises in a commercial bank, it is necessary to conduct a sample study using a random non-repetitive selection method. Determine the required sample size n so that with probability P = 0.954 the error of the sample mean does not exceed 3 days if trial estimates showed that the standard deviation s was 10 days.

Solution. To determine the number of required studies n, we will use the formula for non-repetitive selection from the table. 9.4:

In it, the t value is determined from a confidence level of P = 0.954. It is equal to 2. The mean square value is s = 10, the population size is N = 500, and the maximum error of the mean is Δ x = 3. Substituting these values ​​into the formula, we get:

those. It is enough to compile a sample of 41 enterprises to estimate the required parameter - the speed of settlements with creditors.

To characterize the reliability of sample indicators, a distinction is made between the average and maximum sampling errors, which are characteristic only of sample observations. These indicators reflect the difference between sample and corresponding general indicators.

Average sampling error is determined primarily by the sample size and depends on the structure and degree of variation of the trait being studied.

The meaning of average sampling error is as follows. The calculated values ​​of the sample proportion (w) and sample mean () are random variables in nature. They can take on different values ​​depending on which specific population units are included in the sample. For example, if, when determining the average age of an enterprise's employees, more young people are included in one sample and older workers in another, then the sample means and sampling errors will be different. Average sampling error determined by the formula:

(27) or - resampling. (28)

Where: μ – average sampling error;

σ – standard deviation of the characteristic in the general population;

n – sample size.

The magnitude of the error μ shows how much the average value of the attribute established in the sample differs from the true value of the attribute in the general population.

It follows from the formula that the sampling error is directly proportional to the standard deviation and inversely proportional to the square root of the number of units included in the sample. This means, for example, that the greater the spread of attribute values ​​in the population, that is, the greater the dispersion, the larger the sample size must be if we want to trust the results of a sample survey. And, conversely, with low dispersion, you can limit yourself to a small number of the sample population. The sampling error will be within acceptable limits.

Since with non-repetitive sampling the population size N is reduced during sampling, an additional factor is included in the formula for calculating the average sampling error

(1- ). The formula for average sampling error takes the following form:

The average error is smaller for non-repetitive sampling, which determines its wider use.

For practical conclusions, a characterization of the population based on sample results is needed. Sample averages and shares are distributed to the general population, taking into account the limit of their possible error, and with a probability level that guarantees it. Having specified a specific probability level, the value of the normalized deviation is selected and the maximum sampling error is determined.

Reliability (confidence probability) of the assessment of X based on X* called probability γ , with which inequality is realized


׀Х-Х*׀< δ, (30)

where δ is the maximum sampling error, characterizing the width of the interval in which, with probability γ, the value of the studied population parameter is located.

Trusted called the interval (X* - δ; X* + δ), which covers the parameter X under study (that is, the value of the parameter X is within this interval) with a given reliability γ.

Typically, the reliability of the estimate is specified in advance, and a number close to one is taken as γ: 0.95; 0.99 or 0.999.

The maximum error δ is related to the average error μ by the following relation: , (31)

where: t is the confidence coefficient depending on the probability P with which it can be stated that the marginal error δ will not exceed t-fold the average error μ (it is also called critical points or quantiles of the Student distribution).

As follows from the relation , the marginal error is directly proportional to the average sampling error and the confidence coefficient, which depends on the given level of reliability of the estimate.

From the formula for the average sampling error and the ratio of the marginal and average errors, we obtain:

Taking into account the confidence probability, this formula will take the form:

Systematic and random errors

Modular unit 2 Sampling errors

Since a sample usually covers a very small part of the population, it should be assumed that there will be differences between the estimate and the characteristics of the population that the estimate reflects. These differences are called mapping errors or representativeness errors. Representativeness errors are divided into two types: systematic and random.

Systematic errors- this is a constant overestimation or underestimation of the assessment value compared to the characteristics of the general population. The reason for the appearance of systematic error is non-compliance with the principle of equal probability of each unit of the general population being included in the sample, that is, the sample is formed from predominantly the “worst” (or “best”) representatives of the general population. Compliance with the principle of equal opportunity for each unit to be included in the sample allows us to completely eliminate this type of error.

Random errors - These are differences that vary from sample to sample in sign and magnitude between the estimate and the assessed characteristic of the population. The reason for the occurrence of random errors is the play of chance when forming a sample that constitutes only a part of the general population. This type of error is organically inherent in the sampling method. It is impossible to exclude them completely; the task is to predict their possible magnitude and reduce them to a minimum. The order of actions related to this follows from the consideration of three types of random errors: specific, average and extreme.

2.2.1 Specific error is the error of one sample taken. If the mean for this sample () is an estimate for the general mean (0) and, assuming that this general mean is known to us, then the difference = -0 and will be the specific error of this sample. If we repeat the sample from this general population many times, then each time we get a new value for a specific error: ..., and so on. Regarding these specific errors, we can say the following: some of them will coincide with each other in magnitude and sign, that is, there is a distribution of errors, some of them will be equal to 0, there is a coincidence of the estimate and the parameter of the general population;

2.2.2 Average error is the mean square of all specific estimation errors possible by chance: , where is the magnitude of changing specific errors; frequency (probability) of occurrence of a particular error. The average sampling error shows how much error, on average, can be made if a judgment is made about a population parameter based on the estimate. The above formula reveals the content of the average error, but it cannot be used for practical calculations, if only because it presupposes knowledge of the population parameter, which in itself eliminates the need for sampling.



Practical calculations of the average estimation error are based on the premise that it (the average error) is essentially the standard deviation of all possible estimation values. This premise allows us to obtain algorithms for calculating the average error based on data from a single sample. In particular, the average error of the sample mean can be established based on the following reasoning. There is a sample (,…) consisting of units. For the sample, the sample mean is defined as an estimate of the general average. Each value (,...) under the sum sign should be considered as an independent random variable, since with infinite repetition of the sample the first, second, etc. units can take on any of the values ​​present in the population. Therefore, since, as is known, the variance of the sum of independent random variables is equal to the sum of the variances, then . It follows that the average error for the sample average will be equal and it is inversely related to the size of the sample (through the square root of it) and in direct proportion to the standard deviation of the characteristic in the general population. This is logical, since the sample average is a consistent estimate for the general average and, as the sample size increases, its value approaches the estimated parameter of the general population. The direct dependence of the average error on the variability of a characteristic is due to the fact that the greater the variability of the characteristic in the general population, the more difficult it is to build an adequate model of the general population based on the sample. In practice, the standard deviation of a characteristic in the population is replaced by its estimate in the sample, and then the formula for calculating the average error of the sample mean takes the form: taking into account the bias of the sample variance, the sample standard deviation is calculated using the formula =. Since the symbol n denotes the sample size. , then the denominator when calculating the standard deviation should not use the sample size (n), but the so-called number of degrees of freedom (n-1). The number of degrees of freedom is understood as the number of units in an aggregate that can freely vary (change) if any characteristic is determined from the aggregate. In our case, since the average of the sample is determined, the units can vary freely.

Table 2.2 provides formulas for calculating the average errors of various sample estimates. As can be seen from this table, the average error for all estimates is inversely related to the sample size and directly related to variability. This can also be said regarding the average error of the sample fraction (frequency). Under the root is the variance of the alternative characteristic, established from the sample ()

The formulas given in Table 2.2 refer to the so-called random, repeated selection of units in the sample. With other selection methods, which will be discussed below, the formulas will be slightly modified.

Table 2.2

Formulas for calculating average errors of sample estimates

2.2.3 Marginal sampling error Knowledge of the estimate and its average error is in some cases completely insufficient. For example, when using hormones in animal feeding, knowing only the average size of their undecomposed harmful residues and the average error means exposing consumers of the product to serious danger. This strongly suggests the need to determine the maximum ( maximum error). When using the sampling method, the maximum error is set not in the form of a specific value, but in the form of equal boundaries

(intervals) in either direction from the assessment value.

Determination of the limits of the maximum error is based on the characteristics of the distribution of specific errors. For so-called large samples, the number of which is more than 30 units (), specific errors are distributed in accordance with the normal distribution law; with small samples () specific errors are distributed in accordance with the Gosset distribution law

(Student). In relation to specific errors in the sample average, the normal distribution function has the form: , where is the probability density of the occurrence of certain values, provided that , where is the sample average; - general average, - average error for sample average. Since the average error () is a constant value, specific errors are distributed in accordance with the normal law, expressed in shares of the average error, or the so-called normalized deviations.

By taking the integral of the normal distribution function, we can establish the probability that the error will be contained in a certain interval of change t and the probability that the error will go beyond this interval (the opposite event). For example, the probability that the error will not exceed half the average error (in either direction from the general average) is 0.3829, that the error will be contained within one average error - 0.6827, 2 average errors -0.9545 and so on.

The relationship between the level of probability and the interval of change t (and, ultimately, the interval of change of error) allows us to approach the determination of the interval (or limits) of the maximum error, linking its value with the probability of occurrence. The probability of occurrence is the probability that the error will be in some interval. The probability of occurrence will be “confidence” if the opposite event (the error will be outside the interval) has such a probability of occurrence that can be neglected. Therefore, the confidence level of probability is set, as a rule, at least 0.90 (the probability of the opposite event is 0.10). The more negative consequences the occurrence of errors outside the established interval has, the higher the confidence level of probability should be (0.95; 0.99; 0.999 and so on).

Having chosen the confidence level of probability from the table of the probability integral of the normal distribution, you should find the corresponding value of t, and then using the expression = determine the interval of the maximum error. The meaning of the obtained value is as follows: with the accepted confidence level of probability, the maximum error of the sample average will not exceed the value .

To establish the limits of the maximum error based on large samples for other estimates (variance, standard deviation, proportion, and so on), the approach discussed above is used, taking into account the fact that a different algorithm is used to determine the average error for each estimate.

As for small samples (), as already mentioned, the distribution of estimation errors in this case corresponds to the t - Student distribution. The peculiarity of this distribution is that as a parameter in it, along with the error, there is the sample size, or rather not the sample size, but the number of degrees of freedom. As the sample size increases, the t-Student distribution approaches normal, and at these distributions practically coincide. Comparing the values ​​of the t-Student value and the t-normal distribution at the same confidence level, we can say that the t-Student value is always greater than the t-normal distribution, and the differences increase with a decrease in the sample size and with an increase in the confidence level of probability. Consequently, when using small samples, compared to large samples, there are wider limits of the maximum error, and these limits expand with a decrease in the sample size and an increase in the confidence level of probability.