Math question for the high(er) math(s) folks

Started by scottmitchell74, August 14, 2017, 10:16:18 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

scottmitchell74

I'm not quite sure how to ask this, so please bear with me.

When I'm doing a statistical survey, or I'm trying to find a class average, or establish my Bell Curve, how many points of data do I need to where getting any more wouldn't really matter?

If I ask 10 people, randomness means I'll get a wild fluctuations. But if I ask 100 it'll be more accurate. 1000 people, more accurate still.

At what point does this flatten out to where it doesn't matter if it's 1000, 10,000, a Million? Better yet, is there a chart I can look at? Because I don't know exactly how to ask this question, I'm struggling to find an academic article/chart that can help me.

Thanks smart Maths people!  :NGaugersRule:
Spend as little as possible on what you need so you can spend as much as possible on what you want.

scotsoft

I have asked a friend who is a statistician to see if he can help out. I may not get an answer till tomorrow.

Cheers John.

NeMo

Speaking as a general scientist (one time palaeontologist, with a zoology degree, geology PhD, and now teaching science in high school!) my understanding is that there isn't a simple answer to this. It will also depend on how you are sampling (e.g., how do you avoid bias). There is also the practicality and cost factor: if the sampling method is hard, time-consuming, or expensive, a smaller sample will be preferred, and over-sampling won't improve the result any, so isn't worth doing.

There are statistical tests you do to establish your confidence that the answer you have is, for example, 95% likely to NOT be down to chance. I know about this mostly from the tests used in ecology, such as chi square, Pearson, and the Student-t test. If you're interested in these, I can send you my teaching slides and Excel spreadsheets used to show how they work. But you can find information on them in any college-level biology textbook, and probably other fields too where stats are important, such as psychology.

As a ballpark though, if you are taking unbiassed samples, then 10% of the total population should be ample.

Cheers, NeMo
(Former NGS Journal Editor)

scottmitchell74

Nemo:

Thanks so much for the answer. Let me let you (and everyone else) know more specifically what I'm talking about.

On my fire department we have 183 men. We have physical fitness tests we have to take each year. I'm a member of the station that's tasked with being in charge of the fitness program and administering these tests. We want to establish a database of performance. SO, if over 6 years, 183 men take this timed test, I'll have 1098 tests to look at. Once I break these down into age-groups, I might have 350 tests of men 20-29, 450 test of men 30-39, 200 tests of men 40-49 and the rest 50-59 (as an example, I don't know exactly how it'll break down).

So, with my age 20-29 yo fellas, with 350 tests, how comfortable will I be saying "The average time it takes a 20-29 you FF to do our Drill Tower test is X? Is 350 tests enough to absorb outliers and abnormalities? At what point is it statistically pointless/nil to have more tests? We'll continue adding to the database as years go by, but I'm wondering at what point will a strong, reliable pattern emerge. Thanks so much!
Spend as little as possible on what you need so you can spend as much as possible on what you want.

Bob G

Scott

A bell curve extends out to + infinity and - infinity, so you need a decent definition of when to stop taking samples, or you will literally go on for ever.

If you have a bell curve,
+/- 1 standard deviation of all the data covers about 68% of the sample range.
+/- 2 standard deviations of all the data covers about 95% of the sample range.
+/- 3 standard deviations of all the data covers about 99.7% of the sample range.

Usually +/- 2 standard deviations is a sufficient variance to understand the distribution.

But the spread of the data is important. The standard deviation is a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data. When the examples are tightly bunched together and the bell-shaped curve is steep, the standard deviation is small. When the examples are spread apart and the bell curve is relatively flat, that tells you you have a relatively large standard deviation.

We always used 201 observations to ensure a statistically significant dataset for accuracy about the 95th percentile, i.e. +/- 2 standard deviations, or 1001 for a statistically significant dataset for accuracy about the 99.7th percentile, i.e. +/- 3 standard deviations. Why the odd 1? because if the stats say you need 200 observations, the +1 guarantees you have achieved that level of significance.

If the data exhibit a skew, you might have to analyse the data in log space, as the log of data in a skewed distribution (with a long positively skewed tail) will appear like a normal distribution if you correlate the natural logs of the original data. Most compliance data like exhaust emissions, for example, exhibit such a skew. most data are at the lower end, and a few are highly positively skewed.

When working in my sector, for regulatory compliance requirements, regulators will accept 95% accuracy, but if you wanted to prove something in a Courtroom, you needed 99.7% accuracy.

Can't for the life of me think of a reference for this though.

HTH

Bob







scottmitchell74

Bob and Nemo:

Thanks! I knew this was the place to come. Geez...such a broad range of talents at the NGF!

:NGaugersRule: :thankyousign:
Spend as little as possible on what you need so you can spend as much as possible on what you want.

njee20

Perhaps I'm misunderstanding, but Bob and NeMo seems to be answering a slightly different question about the validity of the data? You're not asking 'what proportion of my data should I consider for statistically significant mapping?', you're simply saying 'how much data do I need?'

So if you have one 20-29 y/o male you can't use his result to assess every other 20-29 y/o male.

It will depend on a number of factors, including clustering of the data. If you have 100 tests and they vary from 59.8 seconds to 61.2 seconds that's a decent data set. If, however, your 100 points vary from 42 seconds to half an hour, with no discernible outliers you need a bigger data set.

Basically... it depends entirely on your specific use case. There is no definitive point where x is a 'complete data set' whilst x-1 isn't enough.

From what you've said though I think you have an adequate sample set, as long as you don't want to break it down to people called Frank born in August 1986.

Jon898

If I were doing this I think I'd test for and exclude the outliers and then report the median value (not the average or the mean).

There are all sorts of fancy programmes/Apps/spreadsheets to do that for you, but a quick and dirty way is to exclude any results more than 1.5 times the interquartile gap above the 3rd quartile and below the second quartile and use the remaining median.  That way if a station is unknowingly hiding the next Usain Bolt, it won't mess up the results and even a small sample size could give you a valid (or at least usable) answer.

Now my head hurts and we're out of scotch  :(

Jon

The Q

As I sit here waiting to test my next piece of electronic equipment, getting the spreadsheets ready, I can agree with BobG, as his description for the most parts ties in with our systems of measurement.
sadly My next whisky won't occur for about 8 hours...

Bealman

#9
Tough luck guys.... I'm sitting in Country Club Tasmania with one right now.  :beers:

Interesting thread.... I had to do all that stuff along time ago - like NeMo, geology degree and ended up a physics teacher.

From what I remember of it, the responses seem fine!  :uneasy:
Vision over visibility. Bono, U2.

NeMo

It sounds to me like you aren't interested in looking for patterns but to establish the how "fit" the average fireman is.

I think that what you want here, as @Bob G implies, is to use standard deviations, alongside the correct choice of average -- i.e., mean, mode, or median.

Standard deviation describes the spread of data. In other words, if the average for all firefighters had them run a certain distance in 10.2 seconds, and most of them (let's say, 95% of them) between 9.8 and 10.6 seconds, then you would have a narrow spread of data. There's little variation in fitness there, with only 5% of the fireman either slower or faster than this. This would return a small standard deviation value.

But if the average was still 10.2 seconds, but the range was much bigger, let's say from 8.2 to 12.2 seconds, then the standard deviation would be a bigger value. The spread of the data is greater, and any one fireman picked at random might be much slower or much faster than the average suggests.

In terms of sampling, assuming a bell-shaped distribution, you would simply time one fireman after another, with time to run the distances on the x-axis, and number of fireman running a given time on the y-axis. Initially you'd have only a few results and the bell shape wouldn't be apparent, but as you add more and more results, eventually the bell will become obvious. That's when your sampling is sufficient. The image below suggests what this might look like, initially with too few data points (only two fireman) but the second with many more, and a bell shape starting to form:



I don't think, a priori, you can guess the perfect sample size, though I dare say there is some maths out there to determine one. But if you sampled 10%, say, 18-20 fireman, picked completely at random, and had them run the distance, I think that would be a justifiable start. If the pattern still isn't clear, then another 10%, so the sample size is now 20%.

Your choice of mean, mode or media is important. In this situation, I think the mode is the one to quote for simplicity: MOST (i.e., the modal value) fireman can run a certain distance in X seconds. This is the average politicians tend to avoid because it's the most informative in some ways: saying MOST people earn an income of X thousands is usually less flattering to them than saying the average (mean) income is Y, which hides the fact the range might well go from a very low salary to a very high salary!

That's why the mean value could be used, but it would have to be alongside standard deviation. In other words, the average fireman runs the distance in X seconds, the standard deviation being either small or big, depending on how spread out 95% of them either side of the mean.

The median is pretty irrelevant here, I think, because all it would tell you is the halfway point between the fastest and slowest fireman.

Cheers, NeMo
(Former NGS Journal Editor)

Bealman

Quote from: Bealman on August 15, 2017, 08:00:10 AM
From what I remember of it, the responses seem fine!  :uneasy:

Apparently not!  :-[

As I said, it was a while back  :-\
Vision over visibility. Bono, U2.

msr

The OP appears to be wanting to assess the fitness of firemen, which raises the problem of bias in the data set. This is because personnel selection will have been applied in the first instance as part of the recruitment process and the sample will therefore not reflect the overall population; instead it will favour those with a particular physique and inherent fitness. The sample data set will therefore not be "random" and will have "bias", so a truncated distribution is likely. This means that the statistical distribution is unlikely to be Gaussian and the application of standard deviations may be inappropriate. There is a huge literature concerning this. See for example Freedman Statistical Models and Causal Inference. or the classic text by Yule & Kendall An Introduction to the Theory of Statistics.

NeMo

Quote from: msr on August 15, 2017, 08:18:53 AM
The OP appears to be wanting to assess the fitness of firemen, which raises the problem of bias in the data set. This is because personnel selection will have been applied in the first instance as part of the recruitment process and the sample will therefore not reflect the overall population; instead it will favour those with a particular physique and inherent fitness. The sample data set will therefore not be "random" and will have "bias", so a truncated distribution is likely. This means that the statistical distribution is unlikely to be Gaussian and the application of standard deviations may be inappropriate. There is a huge literature concerning this. See for example Freedman Statistical Models and Causal Inference. or the classic text by Yule & Kendall An Introduction to the Theory of Statistics.

I don't disagree if you were comparing the fireman against the general population. You might well expect the average fireman to be fitter than the average person. But if you sample, at random, 10 fireman from 100 firemen, you should still get something like a normal (bell-shaped) distribution because that initial bias (firemen are fitter than average) has now been eliminated, surely?

Cheers, NeMo
(Former NGS Journal Editor)

Please Support Us!
April Goal: £100.00
Due Date: Apr 30
Total Receipts: £40.23
Below Goal: £59.77
Site Currency: GBP
40% 
April Donations