14.6 Case Study: The NHANES data
In Sects. 12.10 and 13.7, the NHANES data were introduced (Centers for Disease Control and Prevention (CDC) 1988–1994; Center for Disease Control and Prevention 1996; Pruim 2015), and graphs and numerical summaries used to understand the data relevant to answering this RQ:
Among Americans, is the mean direct HDL cholesterol different for those who smoke now, and those who do not smoke now?
The data can be summarised numerically: the response variable (HDL cholesterol), the explanatory variable (current smoking status), and potential extraneous and confounding variables. Different summaries are needed for quantitative (means and standard deviations; medians and IQR) and qualitative (percentages; odds) variables (Table 14.8).
Quantity | Statistic | Overall | Non-smokers | Smokers |
---|---|---|---|---|
Sample size | 10000 | 1745 | 1466 | |
Direct HDL (mmol/L) | n | 8474 | 1668 | 1388 |
Mean | 1.36 | 1.39 | 1.31 | |
Std. dev. | 0.4 | 0.43 | 0.42 | |
Gender | n | 10000 | 1745 | 1466 |
% Female | 50.2 | 43.8 | 43.5 | |
Age (years) | n | 10000 | 1745 | 1466 |
Mean | 36.74 | 54.28 | 42.68 | |
Std. dev. | 22.4 | 16.64 | 14.79 | |
Height (cm) | n | 9647 | 1726 | 1459 |
Mean | 161.88 | 170.06 | 170.43 | |
Std. dev. | 20.19 | 9.75 | 9.27 | |
Weight (kg) | n | 9922 | 1727 | 1458 |
Mean | 70.98 | 84.5 | 80.54 | |
Std. dev. | 29.13 | 20.73 | 19.72 | |
BMI (kg/m-sq) | n | 9634 | 1726 | 1458 |
Mean | 26.66 | 29.09 | 27.7 | |
Std. dev. | 7.38 | 6.19 | 6.42 | |
Diabetes | n | 9858 | 1743 | 1466 |
% Yes | 7.7 | 15.3 | 7.2 | |
Urine volume (mL) | n | 9013 | 1723 | 1447 |
Median | 94 | 97 | 102 | |
IQR | 114 | 118.5 | 104 |
A number of interesting questions emerge from Table 14.8:
- How can the mean age of all respondents be 36.7 years, but the mean age for non-smokers and smokers both be much larger than this (54.3 and 42.7 years respectively)?
- Similarly, the percentage of females in the whole sample is 50.2%, but the percentage of females is less than this for both non-smokers and smokers (43.8% and 43.5% respectively)?
Table 14.9 summarises the relationship between current smoking status and having a diabetes diagnosis. Again, questions emerge:
- For current non-smokers, the percentage of diabetics is \(15.32\)%.
- For current smokers, the percentage of diabetics is \(7.23\)%.
The percentage of diabetics in the sample is different for non-smokers and smokers. Why? Similarly,
- For current non-smokers, the odds of finding a diabetic is \(0.181\).
- For current smokers, the odds of finding a diabetic is \(0.078\).
The odds of finding a diabetic in the sample is different for non-smokers and smokers. Why?
As noted before (Sect. 14.4), two possible reasons could explain this difference in percentages and odds in the sample:
Sampling variation: The percentages (and odds) are the same in the population, but difference in the sample occur because of the people that happened to end up with the sample. Sampling variation explains the difference in the sample percentages (and odds).
The percentages (and odds) are different in population: for non-smokers and smokers, and the difference in the sample percentages (and odds) simply reflects a difference* between non-smokers and smokers in the population.
In the next chapters, tools for deciding which of these explanations is the most likely are discussed.
Doesn’t smoke now | Smokes now | |
---|---|---|
Not diabetic | 1476 | 1360 |
Diabetic | 267 | 106 |