Biographies Characteristics Analysis

What is the mean sampling error? Determining the sample size

The concept of selective observation.

With the statistical method of observation, it is possible to use two methods of observation: continuous, covering all units of the population, and selective (non-continuous).

By sampling we mean a research method associated with the establishment of general indicators of the population for some part of it based on the random selection method.

During selective observation, a relatively small part of the entire population (5-10%) is examined.

The entire population to be surveyed is called general population.

The part of units selected from the general population that is subject to survey is called sample population or sampling.

Indicators characterizing the general and sample population:

1) Share of an alternative characteristic;

IN population the proportion of units possessing any alternative characteristic is denoted by the letter “P”.

IN sample population the proportion of units possessing any alternative characteristic is denoted by the letter “w”.

2) Average feature size;

IN population The average size of a feature is indicated by a letter (general average).

IN sample population The average feature size is indicated by a letter (sample mean).

Definition of sampling error.

Sample observation is based on the principle of equal opportunity for units of the general population to fall into the sample. This avoids systematic observation errors. However, due to the fact that the population under study consists of units with varying characteristics, the composition of the sample may differ from the composition of the general population, causing discrepancies between the general and sample characteristics.

Such discrepancies are called representativeness errors or sampling errors.

Determining sampling error is the main problem solved during sample observation.

In mathematical statistics, it is proven that the average sampling error is determined by the formula:

Where m is the sampling error;

s 2 0 – dispersion of the general population;

n – number of units in the sample population.

In practice, the sample population variance s 2 is used to determine the average sampling error.

There is equality between the general and sample variances:

(2).

From formula (2) it is clear that the general variance is greater than the sample variance by the amount (). However, with a sufficiently large sample size, this ratio is close to unity, so we can write that

However, this formula for determining the average sampling error applies only to repeated sampling.

In practice it is usually used non-repetitive selection and the average sampling error is calculated slightly differently, since the sample size is reduced during the study:

(4)

where n is the size of the sample population;

N – population size;

s 2 - sample variance.

For the share of an alternative characteristic, the average sampling error at repeated selection determined by the formula:

(5), where

w (1-w) - average error of the sample proportion of an alternative feature;

w is the share of an alternative characteristic in the sample population.

At re-selection The average error in the share of an alternative characteristic is determined using a simplified formula:

(6)

If the sample size does not exceed 5%, the average error of the sample fraction and sample mean is determined using simplified formulas (3) and (6).

Determining the average error of the sample mean and the sample proportion is necessary to establish the possible values ​​of the general mean (x) and the general proportion (P) based on the sample mean (x) and the sample proportion (w).

One of the possible values ​​within which the general average is located is determined by the formula:

For the general share, this interval can be written as :

(8)

The characteristics of the share and the average in the general population obtained in this way differ from the value of the sample share and the sample average by the amount m. However, this cannot be guaranteed with complete certainty, but only with a certain degree of probability.

In mathematical statistics, it is proven that the limits of the characteristics of the general and sample averages differ by the amount m only with a probability of 0.683. Consequently, only in 683 cases out of 1000 is the general average within x= x m x, in other cases it will go beyond these limits.

The probability of judgments can be increased if we expand the limits of deviations by taking as a measure the average sampling error increased by t times.

The t factor is called the confidence coefficient. It is determined depending on the confidence level with which the research results must be guaranteed.

Mathematician A.M. Lyapushev calculated various values ​​of t, which are usually given in ready-made tables.

The main advantage of sample observation among others is the ability to calculate random sampling error.

Sampling errors can be systematic or random.

Systematic- in the case when the basic principle of sampling - randomness - is violated. Random- usually arise due to the fact that the structure of the sample population always differs from the structure of the general population, no matter how correctly the selection is made, that is, despite the principle of random selection of population units, there are still discrepancies between the characteristics of the sample and the general population. The study and measurement of random errors of representativeness is the main task of the sampling method.

Typically, the error of the mean and the error of the proportion are most often calculated. The following conventions are used for calculations:

Average calculated within the population;

Average calculated within the sample population;

R- the share of this group in the general population;

w- the share of this group in the sample population.

Using conventions, the sampling errors for the mean and for the proportion can be written as follows:

The sample mean and sample proportion are random variables that can take any value depending on which population units are included in the sample. Therefore, sampling errors are also random variables and can take on different values. Therefore, the average of possible errors μ is determined .

Unlike systematic error, random error can be determined in advance, before sampling, according to limit theorems considered in mathematical statistics.

The average error is determined with a probability of 0.683. In the case of a different probability, they speak of a marginal error.

The average sampling error for the mean and for the proportion is defined as follows:


In these formulas, the variance of a characteristic is a characteristic of the general population, which is unknown during sample observation. In practice, they are replaced by similar characteristics of the sample population based on the law of large numbers, according to which the sample population accurately reproduces the characteristics of the general population in large quantities.

Formulas for determining the average error for various selection methods:

Selection method Repeated Repeatless
error of average share error error of average share error
Properly random and mechanical
Typical
Serial

μ - average error;

∆ - maximum error;

P - sample size;

N- population size;

Total variance;

w- share of this category in the total sample size:

Average of within-group variances;

Δ 2 - intergroup dispersion;

r- number of series in the sample;

R- total number of episodes.


Marginal error for all sampling methods is related to the average sampling error as follows:

Where t- confidence coefficient, functionally related to the probability with which the maximum error value is ensured. Depending on the probability, the confidence coefficient t takes the following values:

t P
0,683
1,5 0,866
2,0 0,954
2,5 0,988
3,0 0,997
4,0 0,9999

For example, the probability of error is 0.683. This means that the general average differs from the sample average in absolute value by no more than μ with a probability of 0.683, then if is the sample mean, is the general mean, then With probability 0.683.

If we want to ensure a greater probability of conclusions, we thereby increase the margins of random error.

Thus, the magnitude of the maximum error depends on the following quantities:

Fluctuations of a characteristic (direct relationship), which is characterized by the amount of dispersion;

Sample size (feedback);

Confidence probability (direct connection);

Selection method.

An example of calculating the error of the mean and the error of the proportion.

To determine the average number of children in a family, 100 families were selected from 1000 families using a random non-repetitive sampling method. The results are shown in the table:

Define:.

- with a probability of 0.997, the maximum sampling error and the boundaries within which the average number of children in a family lies;

- with a probability of 0.954, the boundaries within which the proportion of families with two children lies.

1. Determine the maximum error of the average with a probability of 0.977. To simplify the calculations, we use the method of moments:

p = 0,997 t= 3

average error of the average, 0.116 - marginal error

2,12 – 0,116 ≤ ≤ 2,12+ 0,116

2,004 ≤ ≤ 2,236

Therefore, with a probability of 0.997, the average number of children in a family in the general population, that is, among 1000 families, is in the range 2.004 - 2.236.

Marginal error— the maximum possible divergence of averages or maximum errors for a given probability of its occurrence.

1. The maximum sampling error for the average during repeated selection is calculated using the formula:

where t is the normalized deviation - the “confidence coefficient”, which depends on the probability that guarantees the maximum sampling error;

mu x - average sampling error.

2. Marginal sampling error for fraction during repeated selection is determined by the formula:

3. Maximum sampling error for the average with non-repetitive sampling:

Limit relative error sampling is defined as the percentage ratio of the marginal sampling error to the corresponding characteristic of the sample population. It is defined this way:

Small sample

Small sample theory was developed English statistician Student at the beginning of the 20th century. In 1908, he identified a special distribution that allows one to correlate t and the confidence probability F(t) even in small samples. For n greater than 100, they give the same results as the tables of the Laplace probability integral, for 30< n < 100 различия получаются незначительные. Поэтому на практике к малым выборкам относятся выборки объемом менее 30 единиц.

Marginal sampling error equal to t-fold the number of average sampling errors:

μ – average sampling error, calculated taking into account the correction for which the adjustment is made in the case non-repetitive selection;

t is the confidence coefficient that is found at a given probability level. So for P=0.997 according to the table of values ​​of the Laplace integral function t=3

Magnitude marginal sampling error can be installed with certain probability. The probability of such an error being equal to or greater than three times the average sampling error is extremely small and equal to 0.003 (1–0.997). Such unlikely events are considered practically impossible, and thereforethe probability that this difference will exceed three times the average error determines error level and amounts to no more 0,3% .

Determination of the maximum sampling error for shares

Condition:

From finished products, in order of actual randomness non-repetitive selection, 200 quintals were selected, of which 8 quintals were spoiled. Can we assume with probability 0.954 that production losses will not exceed 5% if the sample is 1:20 of its size?

Given:

  • n =200ts – sample size (sample population)
  • m =8ts - number of spoiled products
  • n:N = 1:20 – selection proportion, where N is the volume of the population (general population)
  • P = 0.954 – probability

Define: ∆ ω < 5% (согласуется ли то, что потери продукции не превысят 5%)

Solution:

1. Let’s determine the sample share - this is the share of spoiled products in the sample population:

2. Determine the volume of the general population:

N=n*20=200*20=4000(ts)– quantity of all products.

3. Let us determine the maximum sampling error for the share of products that have the corresponding characteristic, i.e. for the share of spoiled products: Δ = t*μ, Where µ - the average error of the share possessing an alternative characteristic, taking into account the correction for which the adjustment is made in the case repeatable selection; t – confidence coefficient, which is found at a given probability level P = 0.954 from the table of values ​​of the Laplace integral function: t = 2

4. Define r confidence interval limits For shares of an alternative trait in the general population, i.e. what share of spoiled products will be in the total volume: since the share of spoiled products in the sample volume is ω = 0.04, then taking into account the maximum error ∆ ω = 0.027 general share of an alternative trait(p) will take the following values:

ω-∆ ω < p < ω+∆ ω

0.04-0.027< p < 0.04+0.027

0.013 < p < 0.067

Conclusion: with probability P=0.954 it can be stated , that the proportion of spoiled productswhen sampling a larger volume, it will not go beyond the found interval (not less than 1.3% and not more than 6.7%). But there remains a possibility that the share of spoiled products may exceed 5% within a range of up to 6.7%, which, in turn, is not consistent with the statement ∆ ω< 5%.

*******

Condition:

The store manager knows from experience that 25% of customers entering the store make purchases. Let's assume that 200 customers enter the store.

Define:

  1. share of buyers who made purchases
  2. sample fraction variance
  3. standard deviation of sample fraction
  4. the probability that the sample proportion will be between 0.25 and 0.30

Solution:

As general share (p) accept sample share (ω ) and determine the upper limit of the confidence interval.
Knowing the critical point (according to the condition: the sample fraction will be in the range of 0.25-0.30), we construct a one-sided critical region (right-sided).
Using the table of values ​​of the Laplace integral function we find Z
The same option can be considered as re-selection provided that the same buyer, without purchasing the first time, returns and makes a purchase.

If the sample is considered as repeatable, it is necessary to correct the average error by a correction factor. Then, by substituting the corrected values ​​of the maximum error for the sample fraction, when determining the critical region, Z and P will change

Determination of the maximum sampling error for the average

According to data from 17 employees of a company employing 260 people, the average monthly salary was 360 USD, with s = 76 USD. What minimum amount must be deposited into the company's account in order to guarantee the payment of wages to all employees with probability 0.98?

Given:

  • n=17 - sample size (sample population)
  • N=260 - volume of the population (general population)
  • X Wed =360 - sample average
  • S=76 - sample standard deviation
  • P = 0.98 – confidence probability

Define: the minimum acceptable value of the general average (lower limit of the confidence interval).

To characterize the reliability of sample indicators, a distinction is made between the average and maximum sampling errors, which are characteristic only of sample observations. These indicators reflect the difference between sample and corresponding general indicators.

Average sampling error is determined primarily by the sample size and depends on the structure and degree of variation of the trait being studied.

The meaning of average sampling error is as follows. The calculated values ​​of the sample proportion (w) and sample mean () are random variables in nature. They can take on different values ​​depending on which specific population units are included in the sample. For example, if, when determining the average age of an enterprise’s employees, more young people are included in one sample and older workers in another, then the sample means and sampling errors will be different. Average sampling error determined by the formula:

(27) or - resampling. (28)

Where: μ – average sampling error;

σ – standard deviation of the characteristic in the general population;

n – sample size.

The magnitude of the error μ shows how much the average value of the attribute established in the sample differs from the true value of the attribute in the general population.

It follows from the formula that the sampling error is directly proportional to the standard deviation and inversely proportional to the square root of the number of units included in the sample. This means, for example, that the greater the spread of attribute values ​​in the population, that is, the greater the dispersion, the larger the sample size must be if we want to trust the results of a sample survey. And, conversely, with low dispersion, you can limit yourself to a small number of the sample population. The sampling error will be within acceptable limits.

Since with non-repetitive sampling the population size N is reduced during sampling, an additional factor is included in the formula for calculating the average sampling error

(1- ). The formula for average sampling error takes the following form:

The average error is smaller for non-repetitive sampling, which determines its wider use.

For practical conclusions, a characterization of the population based on sample results is needed. Sample averages and shares are distributed to the general population, taking into account the limit of their possible error, and with a probability level that guarantees it. Having specified a specific probability level, the value of the normalized deviation is selected and the maximum sampling error is determined.

Reliability (confidence probability) of the assessment of X based on X* called probability γ , with which inequality is realized


׀Х-Х*׀< δ, (30)

where δ is the maximum sampling error, characterizing the width of the interval in which, with probability γ, the value of the studied population parameter is located.

Trusted called the interval (X* - δ; X* + δ), which covers the parameter X under study (that is, the value of the parameter X is within this interval) with a given reliability γ.

Typically, the reliability of the estimate is specified in advance, and a number close to one is taken as γ: 0.95; 0.99 or 0.999.

The maximum error δ is related to the average error μ by the following relation: , (31)

where: t is the confidence coefficient depending on the probability P with which it can be stated that the marginal error δ will not exceed t-fold the average error μ (it is also called critical points or quantiles of the Student distribution).

As follows from the relation , the marginal error is directly proportional to the average sampling error and the confidence coefficient, which depends on the given level of reliability of the estimate.

From the formula for the average sampling error and the ratio of the maximum and average errors, we obtain:

Taking into account the confidence probability, this formula will take the form: