# ANOVA analysis and normal distribution of data - (Apr/13/2009 )

I have a data set with which I wanted to do anova. A brief description is that I have a putative mating gene and got two lines of animals WT and knock-out. So the purpose is to find the effect of knock-out in male or female. I was told the data need to have normal distribution and have similar variations before anova applies. Unfortunately after looking at the data, it seems not to be normally distributed.

I read some published papers dealing with similar data. They even did not check the normal distribution stuff before Anova.

Now I am confused. Can Anova be done without checking if the data is normally distributed? If the data don't have normal distribution, what could be wrong to get conclusion? Besides transformation, which I tried, but did not find any good one and moreover I did not see any paper doing such a thing to this kind of data, any other statistics could be done?

Thanks for any help!

In order for the probability levels from a test to be valid, the data must come from a normal distribution. if not, you have to use nonparametric test such as Mann-Whitney U test (nonparametricversion of the two group upaired t test), Wilcoxon signed rank test (paired t test), Kruskal-Wallis test (nonparametric equivalent of a one-way ANOVA).

Thank you very much for your advice. Still have some confusions.

In my data, there are actually four groups according to the female male genotypes: WT-WT, WT-knockout, knockout-WT, knock-knockout. I should check each group of data have normal distribution, shouldn't I? What if I have a limited number of data in one group (e.g. 4 )? And what if there is an extreme value in the data. Should I discard it without biological reasons?

One more stupid question about the anova analysis in publications. Are they assumed to have done normal distribution and variance check though they don't put in paper?

pcrman on Apr 13 2009, 09:01 PM said:

Sorry for the delayed reply, you may have already found the answers you need.

makiyo on Apr 14 2009, 09:28 PM said:

Look for normality as a larger group first ( ie express each data point as a deviation from its’ group mean then combine all points from all groups). If that doesn't work then treat each group individually.

makiyo on Apr 14 2009, 09:28 PM said:

With the exception of experimental errors, I don’t think any data should be ‘discarded’ (which is not to say that analysis can’t be done with some points temporarily missing so long as it is acknowledged). I finally got this through to a PI of mine following a nasty incident when we had to repeat an assay for a commercial client and discovered that the data she had (unbeknownst to me) deleted as outliers turned out to be important.

makiyo on Apr 14 2009, 09:28 PM said:

Generally if someone has gone to the trouble of checking the assumptions in their analysis (and found them to be valid) they will make a note of it in the paper otherwise it is probably safest to assume not.

makiyo on Apr 14 2009, 04:28 AM said:

In my data, there are actually four groups according to the female male genotypes: WT-WT, WT-knockout, knockout-WT, knock-knockout. I should check each group of data have normal distribution, shouldn't I? What if I have a limited number of data in one group (e.g. 4 )? And what if there is an extreme value in the data. Should I discard it without biological reasons?

Limited data means that you can not assume that it is normally distributed. Parametric tests such as an ANOVA rely on normal distributions and require a minimum of about 30 samples for it to work. However, as DRT says; you can look for normality as a whole in your data. Papers that haven't checked the normality of their data and then done an ANOVA are doing it wrong, and the reviewers should have picked that up. However, the results generated from the analysis may not be erroneous, because all these sorts of tests are an approximation of the real situation, so the results may be right, but for the wrong reason.

To me it sounds like you need a Kruskal-Wallis test, possibly followed by a post-hoc test such as Tukey's post hoc if you want to distinguish which two groups are actually significantly different, rather than just saying that one of them is different without knowing which one.

bob1 on Apr 29 2009, 03:05 AM said:

makiyo on Apr 14 2009, 04:28 AM said:

In my data, there are actually four groups according to the female male genotypes: WT-WT, WT-knockout, knockout-WT, knock-knockout. I should check each group of data have normal distribution, shouldn't I? What if I have a limited number of data in one group (e.g. 4 )? And what if there is an extreme value in the data. Should I discard it without biological reasons?

Limited data means that you can not assume that it is normally distributed. Parametric tests such as an ANOVA rely on normal distributions and require a minimum of about 30 samples for it to work. However, as DRT says; you can look for normality as a whole in your data. Papers that haven't checked the normality of their data and then done an ANOVA are doing it wrong, and the reviewers should have picked that up. However, the results generated from the analysis may not be erroneous, because all these sorts of tests are an approximation of the real situation, so the results may be right, but for the wrong reason.

To me it sounds like you need a Kruskal-Wallis test, possibly followed by a post-hoc test such as Tukey's post hoc if you want to distinguish which two groups are actually significantly different, rather than just saying that one of them is different without knowing which one.

You should also look what type of data you have: nominal, ordinal or interval data. Anova is usable for interval variables.

And if you chose a non-parametric test such as Kruskal-Wallis test, you should use also a non-parametric post-hoc test. Example is the Nemenyi test (similar to Tukey, very conservative) or Steel-test.