1 How-to in R

We will focus on applying these models: how to recognize and to choose the right one, how to calculate such models and how to interpret the output. We will use R fr that matter.

In the companion text, we stated already that developers of R packages for linear models strive for some uniformity in how to define models. It has generally this from:

model <- linearModelFunction(data = ..., 
                             formula = y ~ 'fixed effects' + ('random effects'), 
                            family = ..., extra model-specific arguments)

The y in the left hand side of the formula is obviously the response, the part that needs to be modelled. Usually this is a vector, but this could take other forms. We will see examples later on.

The fixed effects and random effects are explained in the companion text and keep their role in the GLMM frame work.

The part that is not explained in the companion text is the family = ... part. This part indicates

how the residuals of the model are distributed and
what transformation (or rather its inverse) will map the prediction of the model on the observations.

Linear (mixed) models apply normal distribution and no transformation is used, hence family is usually omitted from the commands.

Generalized (mixed) models distinguish themselves precisely by this distribution and this transformation (or link). Depending on the situation and the data, various distributions and links can be applied.

The fitting process for these non-normal distributions introduces several difficulties: from the choice of the most suitable distribution and link, to suitable algorithms to solve the model, to the interpretation of the outcome of such models. We will gently introduce these problems and solutions to them, but we will start with the descriptions of common situations where the normal distribution is a bad choice to describe the residuals.

1.1 When do we have to look beyond the normal distribution?

The normal distribution is very common and useful. It stretches out in both directions (-infinity to +infinity, it is symmetric and has most of its mass around its mean. This is a suitable likelihood function when dealing with observations on continuous scales that can take any value. The normal distribution will often be an appropriate choice. However, in life sciences it is not uncommon to encounter observations that do not behave in this way.

The most common situation is where the observations can not (physically) stretch the entire range of real numbers. It is not uncommon that observations cannot become negative (mass, height, number of offspring, blood pressure, the time it takes to get to a certain stage…). This will not cause a problem when the actual observations stay far from this physical barrier (body weight of adult people on various diets will never be close to zero even in the most extreme case) but often it is (diameters of lesions on an organ with or without treatment).

Sometimes the observations are constrained between two barriers. Proportions are only possible between 0 and 1. Number of affected animals can not be negative, but can also not go beyond the number of animals your started with.

The symmetry of the normal distribution is also often an issue. Asymmetry occurs often when studying populations. The wood volume per tree have lots of small values and far less but some very high values and will therefore be very skewed. The time to remain alive after an infection with a virus can be very short in many cases, but will have a few survivors that live longer, sometimes even stay alive during the entire observation period.

Some data sets contain only integers. The number of stomata on a square mm or the number of apples on a tree may not cause problems as those integers are sufficiently high and can be modelled as continuous. Yet this is not the case for the number of strokes a patient gets within 5 years, or the size of a rabbit litter.

We can have even more weird distributions, yet still common life science studies. Observations that are made in classes (‘very bad’, ‘bad’, ‘mediocre’ or ‘good’) is such an example. The length of emerging seedlings, where an important proportion of the seeds even never emerged at all, gives data that are a mixture of zeros and strict positive values that may have a mean of 8 cm but contain very few between 0.1 and 3 cm.

The distribution of the data (or actually the residuals) will determine which GLMM needs to be applied. That choice is conveyed to the modelling software via the family argument. Let’s have a look at some of these families and how the work.