2  The family and its role in GLMM

The family argument is usually provided by a function. For instance for the classic linear model the default family is provided as:

family = gaussian(link = "identity")

hence the function gaussian with the argument link set to “identity”.

Other families are available e.g. an alternative could be

family = poisson(link = "log")

To understand why a distribution and a link is needed we have to go back to how linear models work.

For classic linear models, a given observation yi can be represented by its modelled mean μi plus a residual drawn from a normal distribution yiμi+normal(0,σ) or yinormal(μi,σ) The μi is itself modelled by general constant α, and series of coefficients of which some may be multipliers (β) for an attribute x of the sample where yi is observed on. μ_i=α+βixi For instance, a volume of a 15-year old tree (yi = 3.62) is represented by 3.62=μi+ei=3.34+0.28

For generalized linear models, this base structure remains but is generalized to:

yithefamily(μi,distribution_parameters)

So whatever distribution for the likelihood of the residuals with its particular parameters, the modelling is done of the link-transformed μi but still according to a linear combination of coefficients and the attributes of the observed samples.

link(μi)=α+βixi

In the classic case the link function was just the identity function, so no transformation at all. In other words in the classic model the actual μ’s are modelled.

For example, rabbit litter counts, where the mothers are given one of two diets could be modelled via a poisson regression with log as link function. This would give:

littersizeipoisson(μi) (the poisson distribution has only the mean as parameter, so no other to be defined here) and

log(μi)=α+βis on alternative diet) with ‘is on alternative diet’ is either 0 or 1. Hence the logs of mean litter sizes are modelled and β would be the average additional log(kit) that the alternative diet adds to the average log(kit) of litters on the reference diet. Or with a bit of maths background (remember that log(a)log(b)=log(a/b)) you get that β is the multiplier to go from the actual number of litter size on the reference diet to the actual sizes of litters on the alternative diet. The model provides a proportional increase rather than an added effect.

This log link function is handy because it avoids that the model goes in the negative domain (there are no negative litter sizes) but at the same time it complicates the interpretation. That complication can be easy to handle (back transformation by exponentiation in this case) but in more elaborate models interpretation of the model outcome will become an issue.

Another common link function is the logit function. It is an interesting transformation because it keeps prediction within a range between 0 and 1. The logit is defined as:

logit(p)=ln(p1p)

To show what its role is, suppose that the trees we mentioned earlier can get infected by a fungus. The older the tree gets, the higher the chances on infection. We collect 200 leaves of each tree and count the number infected ones and calculate the proportion of infected leaves.

In the figure below on the left, the original observations are plotted. They are confined between 0 and 1: no less then 0 leaves can be infected and if all leaves are infected, the proportion becomes 1 and not higher. With a bit of good will, we could just fit a linear regression through the dots. However, that would mean that a tree of 20 years old has a probability of >1 to be infected. For a 2-year old tree, that probability would be negative. With a more appropriate generalized linear model we avoid these problems. For the modelling, a shift to the logit scale allows to model the impact of age with a simple linear model (middle graph). Yet, to interpret the outcome on an understandable scale and to make sure that that the proportion is not increasing above 1 when trees would become 20 or 25 years old, the inverse of the logit is applied, leading to the s-curve on the right.

(a) logit linear model (b) inverse logit (c)

The inverse logit you could have derived yourself, but for convenience:

inverse_logit(μ)=exp(μ)1+exp(μ)

2.1 Why make it so complicated?

This is all fine, but what is the difference with applying the transformation directly on the data and proceed with a general linear (mixed) model? Well, in that approach we would hope that the variables become normally distributed. That could be approximately a valid assumption in some case, but not in many others, e.g a discrete distribution will not become continuous by transforming it. In GLMM it are the means or coefficients that are transformed by the link function, not the original observations. That makes a difference. The mean of logs of something is not equal to the log of the mean.

log(mean(c(1,4,6)))
[1] 1.299283
mean(c(log(1),log(4),log(6)))
[1] 1.059351

And that shows in a comparison between two very simple models on these data.

dd <- c(1,4,6)
mean(dd)
[1] 3.666667

The direct mean gives 3.67

The respective alternative approaches:

# lm on log transformed data
mod.lm <- lm(log(dd)~1)
coefficients(mod.lm)
(Intercept) 
   1.059351 
#backtransformed:
exp(coefficients(mod.lm))
(Intercept) 
   2.884499 
# glm with log link
mod.glm <- glm(dd~1, family = poisson(link="log"))
coefficients(mod.glm)
(Intercept) 
   1.299283 
# what gives backtransformed:
exp(coefficients(mod.glm))
(Intercept) 
   3.666667 

Hence, the glm conserved the mean on the raw values.

Sometimes it does make sense to transform the data, when the transformed scale has a clear meaning and directly interpretable. For instance, the square root of an area or the cube-root of a volumes gives a length. The -log of the molarity of H+ ions is called a pH.

If you haven’t grasped everything in the previous paragraph, no worries. The important part to remember that we will have to provide a family with its likelihood and a link function, and that the fact that we work on the transformed scale of that link function will require us to back-transform which can be cumbersome. A problem we will solve later.