Monday, May 08, 2006

Stats 2

2.1 Describing the Shape of a Distribution

Stem-and-leaf display -- a stem and leaf display takes a series of real numbers and puts them in as part of the base of the graph (the stem) and the decimal part as a histogram part making them the leaf. The leaf can be sorted or not depending on if you choose to. A more detailed graph can be made by splitting the on into halves. For example put all the parts before 5.5 and all the parts after 5.5 in two different lines. A well designed stem and leaf display will be symmetrical. In some cases, when a whole number is used, like a day, a 0 would be used in plot, instead of the decimal value.

The count of the number of measurements in a class defined by a stem is called the frequency of the class. A stem and leaf display works well with smaller samples so in a larger group we would group them into the classes of frequency distribution. To start with we need to divide the groups into classes. A class length would be determined by the following Formula:

Class length = Largest Measurement - smallest measurement / k

where k is the lowest value that 2 raised to the power of k is just larger than total amount of entries. This number may need to be rounded. We can determine the frequency for a class by counting them. To determine the relative frequency of a class take the frequency and divide by the total entries. A list of all classes along with each class relative frequency is called a relative frequency distribution.

We use the frequencies to create a histogram.

Though sometimes it is best to create the frequency break points, it is also a good idea to make arbitrary breaks based or a quality. For example a city population might be broken up by ages in groups of 10. A class should always have equal lengths with the other classes in the group. However, sometimes they will have to be of unequal lengths. These are called open ended classes.

When we create these graphs we will often notice a pattern, the most common being a bell curve. One that follows the normal flow of things is called a normal bell curve. Some curves are Symmetric, these have a normal distribution. It is possible for them to have a tail to the right or left and not be symmetric.

Stem and leaf and dot plots help to find outliers or observations that are on the edge of a normal graph. It is also possible to have a double bell curve.

2.2 Describing Central Tendency

We also look at the central tendency, the representation of the center or the middle of the data. Am important measure here is the population mean, which is The average of the population measurements. It is signified by the Greek letter ยต and pronounced mew. Population mean is a population parameter, a descriptive measure of the population. Another parameter is a point estimate or a one number estimate of the value of a population parameter. This is usually an educated guess based on sample statistics. A sample statistic is a descriptive measure of a sample.

To estimate the population mean we use the sample mean, represented by an x with a ban over it, and pronounced X bar. We use a sample size that is designated by the letter N. So:

xBar= (x1+X2 + x3 + x4 ...+ xn) /n

The sample mean will not usually equal the mean, but it will put you in the ball park.

Another term to look at is the median. To find the median arrange the data in order from low to high) and then find the center of it. If an odd number of data items is used you shored be able to find center, if even, you will take the two central values and add them then then divide by two.

The third term to look at is the mode. The mode is the measurement that occurs most frequently. It is Signified by the designation M0 .
It is possible to hare two or more modes. If two exist it is called bimodal but if more it is called multimodal. If we represent the data in a histograms, the class with the highest frequency is called the modal class.

If the mean, medium and mode are all equal they are under the same central point in a bell curve and are said to be symmetrical. It is also possible for the arc to here a tail to the left on the right when plotting the curve. If it is either way for a large amount, these extremes can affect the data results. Depending on the need one of these the values will have to be used, but you cannot use all of them in this case to mean the same thing.

2.3 Measures of Variation
Range variance and standard deviation

Other central tendencies deal with estimating variation. In dealing with these we must calculate the range, the difference of the largest and the smallest elements. Though helpful it is not a very useful measurement. It is more useful to find the values of each population and measure them as different percentages. For this we use population variance and population standard deviation.

Population Variance (s2) pronounced sigma squared, is the average of the squared deviations of the individual populations measurements from the population.

Population Standard Deviation o pronounced Sigma, is the positive Square root of the population Variance.

We square the numbers because -2 is just as far away from center as +2 is. This works just as well as squaring the absolute values.

When a population gets to large we will estimate and use the sample variance and the sample standard deviation. In this case we will subtract one from the population size to got a better though larger estimate

Empirical Rule for a normal distributed population
1. 68.26% of the population are within one standard deviation
2. 95. 44% Are within two standard deviations
3. 99.73% are within three standard deviations

In general avoid rounding when doing calculations till you reach the end as this can cause errors in the results.

The empirical rule holds true for normal distributions as well, for the most part all operations having a single mound that is not skewed to the right or the left very much.

Also, when the empirical rule does not hold, we can use Chebyshev's theorem which states:

For any population with a standard deviation, then for a value of k greater than 1 at least 100(1 - 1/22)% will lie in the interval of the mean plus or minus k times the standard deviation.

While useful unless the standard deviation is quite small, we do not get very helpful information and it really is only helps in finding non-mound shaped curves or when this information is not skewed very far to the lift or the light.

For any value X in a population we can determine its Z-Score by subtracting the mean from it and dividing by the standard deviation. The sign indicates if it is positive on negative in regards to the mean, and the number indicates how many standard deviations it is away from the mean.

The coefficient of variation is the size of the standard deviation of a population or a sample relative to the population or sample mean.

Coefficient of variation = (standard deviation / mean) x 100

This is good for comparing groups that have different means and standard deviations.

2.4 Percentiles, Quartiles, And Box and whiskers displays.

We now look at percentiles by starlings with the pth percentile. If we take our measurements in increasing order, the pth percentile is a value such that p % of the measurement fall at or below the value and 100 - p% fall at or above the value. In general, unless we use high or low percentiles they are resistant to extreme values. This is best for when numbers are highly skewed to the left or right. We often will set up four quartet or percentiles for providing information to other. The second quartile is just another name for the median.

We often will describe a set of measurements with a five number summary:

the smallest
the first quartile (Q1)
the median Md
the third quartile (Q3)
the largest measurement.

The distance between Q3 and Q1 would be the middle 50% also known as interquartile range.

By knowing these five values we can graph Q1 and Q5 with a straight line, Q2 and Q4 can be made a box and the middle value a line so we can see how the values lay out. Another way of doing this is the box and whiskers display. This is a bit more complicated a display. It is sometimes called a box plot. Using Q1 and Q3 as well as Md we can do the following:

Draw a box from Q1 to Q3 that represents the middle 50% of the data..
Draw a lime at the median
We now define the inner and outer fences. We calculate the difference between Q1 and Q3. The inner fence is 1.5 times this below Q1 and above Q3. The outer fence is 3 times that below Q1 and above Q3

The inner fence is now used to draw a dotted line from box to them (making the whiskers). The outer fence as well as the inner one is used to help find outliers. Between the fences they are mild outliers; outside the outer fence is considered extreme.

2.5 Describing Qualitative Data
Bar Charts and Pie charts

When using categorical values we put things into categories. In displaying in a chart we would normally use a pie or a bar chant.

Estimating Proportions
If p is the proportion of all population elements that are in a category that interest us, then ^p is the proportion of sample elements that interest us. It should be a reasonable estimate of p.

The Pareto Charts
Pareto charts help identify quality problems, and opportunities for process improvement. We divide problems into the vital few and the trivial many so we can focus on what needs to be focused on. It pays to focus on the vital few for the most part. We create a bar chart with the defects in decreasing order. Any 'other' category comes at the end. Sometimes a line chart is laid against the bar representing the cumulative % of the problems.

2.6 Using Scatter Plots to Study Relationships between Variables
Occasionally we will use regression analysis to determine a relationship between a dependent variable and the results of experiments creating independent variables. In order to do this we will need to do a scatter plot (also called a scatter diagram). The plotting of the results can be used to see if the numbers will fit in to a linear or a curve pattern, or even no pattern at all.

2.7 Misleading Graphs and Charts
When statistics are used, they can sometimes be used to mislead. This will happen a lot when graphs are made and some things are adjusted to make things look better for the person making the graph. Common things are adjusting the scale being used (especially leaving out some area), and in bar diagrams adjusting the width of the bar to make the graph more dramatic looking for larger number. Whenever possible, study the data being presented to make sure it says what you think it does.

2.8 Weighted Means and Grouped Data

At times we will need to adjust numbers because of differences in the weight of the information. For example; a region could have 20% unemployment and another region could have 50% unemployment. Unless we can know the population of working adults in these regions, the statistics could be meaningless. Adding the percentages and adding the people available would not give us meaningful results. So we would add the people available together and use it to divide the sum of the individual regional people times the individual percentages. This weights the numbers and gives us a proper result for the mean. Other steps would need to be taken for median and mode of a similar nature.

2.9 The Geometric Mean
The mean that we looked at earlier is the arithmetic mean. There is also a geometric mean as well. This comes in handy when we have things like income over a period of time. For example, we have $10,000. In one year we increase it to $20,000 but by the end of the next year we are back to $10,000 again. The first year we make a 100% profit, the second year a 50% loss. That means 100% + (-50%) / 2 for our means or 25%. We obviously made no money, so how do we get a 25% mean. The geometric means would be the the nth root of the succession of (1 + r1) to (1-rn), and subtract one from that.

No comments:

Post a Comment