Research Hypothesis
All the experiments must start with a clear, significant, feasible and ethical research question (RQ). The RQ should include
an hypothesis of what do you think the outcome of the experiment will be, although in exploratory studies it might not be necessary.
Usually, the hypothesis that you support (your prediction) is the alternative hypothesis (HA), and the hypothesis that
describes the remaining possible outcomes is the null hypothesis (H1).
Types of Variables
There are different ways of classifying variables. One classification typically used in biostatistics and in many Statistical Softwares (e.g., SPSS) is the following:
Nominal, Ordinal and Quantitative.
Nominal variables are qualitative variables without order (only categorization is possible). For instance, 'Color' is a nominal
variable that can have several several categories (e.g., red, green, blue, etc.). Another Nominal variable is 'Gender' (with two or more categories: male, female, others).
Ordinal variables are qualitative variables with some sort of order. For instance, 'Satisfaction' can be an ordinal variable with the following five categories:
'Bad', 'Not so bad', 'Normal, 'Good', 'Very Good'. As you can see, there is some order implied, 'Good' is better than 'Normal' which in turn is better than 'Not so bad'.
Quantitative variables can be represented by any real number. Some Statistical Softwares such SPSS distinguish between Interval and Ratio variables, both are
quantitative. Interval variables don't have a fixed origin, they have a fixed distance but not a fixed origin. For example, Temperature in Celsius Degrees.
Ratio variables do have a fixed origin. For instance, Time is a Ratio variable because it has an absolute origin (time = 0 seconds).
How do we describe a data set?
Every time we have to report the results of a clinical study, before we describe the outcomes of the inferential statistical tests, we should describe our data set.
The way we do it will depend on the type of variables that we have. Generally speaking we should always be reporting at least one metric of central tendency and one metric of dispersion.
In other words, if our variable is Quantitative and our distribution is not skewed, we will use the Mean as a metric for central tendency and the standard deviation as a metric for dispersion.
However, if our distribution is skewed, the right metrics will be Median for central tendency and Interquartile Range for dispersion.
If our variable is Ordinal, the only central tendency metric that we have to use is the median, and the only dispersion metric we should use is the Interquartile Range.
Finally, if our variable is Nominal, we should describe the dataset by means of the mode (i.e., most frequent number).
What is a skewed distribution? and a normal distribution?
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to
the right, it is said to be skewed. Therefore, if our dataset is normally distributed it is symmetrical too.
There is a rule of thumb, if |skewness statistic|<2·|standard error|, our dataset can be considered symmetrical. Usually, skewness and standard error can be easily obtained using Statistical Softwares such SPSS.
Sample Standard Deviation or Population Standard Deviation?
The Population Standard Deviation is a parameter, which is a fixed value calculated from every individual in the population. In Excel, the formula is =STDEV.P.
A Sample Standard Deviation is a statistic. This means that it is calculated from only some of the individuals in a population. In Excel, the formula is =STDEV.S.
We should use the Sample Standard Deviation especially when we have a sample size smaller than 75. Both types of Standard Deviation are computed differently. The Sample Standard Deviation has what it's called 'Bessel's Correction', which is the 'n-1' shown in the figure below.
How can we compute the Coefficient of Variation?
The Coefficient of Variation measures the variability independently of the unit of measurement. This is especially helpful when we want to compare two different magnitudes (e.g., the variability of time vs the variability of weight).
The figure below show its computation.
What is the Confidence Interval and how to compute it?
The Confidence Interval informs us about how close the estimate (e.g., mean value) fluctuates around the true population value. It is a range of values in which we expect our estimate to fall in with a certain probability level. Most often the probability level is set to 95%, but it could also be 99% or 90%.
Let's assume we found in our experiment a 95% Confidence Interval of [23, 31], if we repeat the experiment 100 times with a similar sample, in 95 out of the 100 times the estimate value will fall inside the Confidence Interval [23,31].
The figure below show its computation.
What is the Standard Error and how to compute it?
The standard error is a measure of the 'precision' of the sample mean. It informs about how close the estimate fluctuates around the true population value.
It's not equivalent to the Standard Deviation. They are different concepts.
The figure below show the Standard Error formula.
What are the Type I and Type II errors in biostatistics?
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is actually true in the population; a type II error (false-negative) occurs if the investigator fails to reject a null hypothesis that is actually false in the population.
Tips to know which statistical test we can use
Follow this recipe:
- STEP 1: Think of your research question. ¿are you looking for association, relationship and/or group prediction?
- STEP 2: Identify Independent variables (IV), Dependent variables (DV) and Covariates (CV).
- Dependent variable: what you really measure (e.g., weight, height, satisfaction,…).
- Independent variable: how you classify the measurement. This can be in a 'related (or paired)' or 'independent (unpaired)' way. These are typically grouping variables (e.g., time, treatment, …).
- Covariates: variables that we think are affecting (somehow) the outcome of the study.
- STEP 3: Identify data type of each variable (Nominal, Ordinal, Quantitative). Identify number of levels of nominal and ordinal variables.
- STEP 4: Take a look at the magic table.
Examples of Study Designs and Inferential Statistics
Some comments about simple linear correlation
Use Pearson Correlation only for quantitative normally-distributed variables. r is the correlation coefficient and r2 is the determination coefficient.
Use Spearman Correlation for ordinal or quantitative variables. ρ is the correlation coefficient and ρ2 is the determination coefficient.
Note that the Determination coefficient is the amount of variance of Y that can be explained by the variance of X.
When we analyze the coefficient of determination (r2, ρ2) we speak of 'degree of relationship' or just
'correlation', but do not use the term 'agreement' unless we also analyze the regression coefficients. In the figure below we can see that both plots show the same correlation but the agreement between A and B is
different in the left and right plot.
Post-hoc tests
Now imagine we performed an ANOVA test (or equivalent non-parametric test) with 3 factors and obtained p<0,05...
What pairwise comparison is significant? Dataset 1 vs Dataset 2? Dataset 1 vs Dataset 3?
Use Bonferroni in Repeated measures ANOVA and Friedman test and use Tukey's in One-way ANOVA and Kruskal-Wallis test. Bonferroni is the most conservative option.