Tim CD Lucas

What_size_predictive_intervals_should_we_use

2022-08-13T00:00:00-07:00

What size predictive intervals should we use?

In some of my papers I’ve used 80% predictive intervals instead of the standard 95% predictive intervals. I didn’t say why in the papers and I was thinking through it again the other day so I thought I’d write it down. The focus here is on prediction intervals (the range of predicted values supported by the model) rather than confidence/credible intervals (the range of parameter values supported by the model). Some of the arguments might transfer, but probably not all of them.

I think most people know that 95% is an arbitrary number plucked out of thin air by an unsavoury racist one hundred years ago. So there’s no rule that says we must use 95%, but how then do we decide which value to use? I’m fairly convinced that not all prediction intervals are created equal. A 40% prediction interval (assuming it’s not presented alongside other intervals) is an odd metric that represents a high density interval that the true value probably isn’t in (<50% chance).

Given that, here are some things to consider when chosing an interval. I’ll use the binary decision of 80% versus 95% just to illustrate things. And I will ignore the fact that of course we often want to use the full distribution rather than summarise a distribution as a single interval; there are so many cases where communicating a full distribution is not feasible. If you can use 2 or 3 intervals, or density plots of the full distribution, do that; if you must summarise a distribution as a single interval, perhaps consider these points.

How do policy makers interpret intervals?

This point is probably the most subjective, but possibly the most important. Despite our best efforts, I think most people, even trained scientists, think of a 95% prediction interval as “the true value is almost certainly in this interval”. Obviously, this isn’t true. 1 in 20 95% prediction intervals will not cover the true value. Furthermore, we are talking about prediction intervals, and we will almost certainly be making hundreds of predictions. Therefore, we must expect many of our prediction intervals not to cover their respective true values.

I think perhaps this problem doesn’t exist for an 80% interval. I don’t think people look at an 80% prediction interval and think “the true value is almost certainly in this interval”. It’s more like a “best guess” interval. Perhaps this is more useful. But to reiterate, this is a very subjective point and I have no hard evidence for this point. The counter argument is that perhaps people can’t switch between intervals easily, and that therefore we should somehow settle on a single interval; given it’s history, 95% would be the clear winner in this case.

Wide and confident or thin and unsure?

This point is similar to that above but less subjective. It is a question of what is really useful in a prediction interval. Do we want a very wide interval that we’re 95% sure the true value is within, or do we want a smaller interval that we are 80% sure contains the true value.

Prediction intervals can quickly become massive and controlling this size can sometimes usefully guide our decisions. For example, imagine you are diagnosed with a terminal disease and told that your 95% predictive survival interval is between 1 and 15 years (I’m thinking for myself as a 30 something. I guess the details will change depending on your age). What can you actually do with this information? On the lower end you have enough time to sort your affairs, visit some family and take a great last trip. On the upper end you have enough time to do anything really; start a new career, see your kids grow up, die from a variety of other causes.

Imagine instead you were given an 80% interval of 3 to 6 years. This “probably true” interval is quite useful. You have some time, you don’t have prioritise only 3 or 4 things to do before you die. But starting a new career may well be a waste of time. Of course, all the survival times that were in the 95% interval are still possible, but this “probably true” interval focusses on a reasonable range of very likely values.

Relatedly, prediction intervals typically get wider faster as you increase the probability interval (look at the shape of a normal distribution for intuition). In a normal distribution, a 95% interval is more than 53% bigger than an 80% interval. Is the extra 15% confidence that the true value is in the interval, worth the 50% increase in width? I’d say often it isn’t.

Confident and wrong or unsure and right?

There are many reasons why we would expect an 80% prediction interval to have better calibration than a 95% prediction interval. Model mispecification and approximations or MCMC methods that break down in the tails are two examples. I would prefer a well calibrated 80% interval to a poorly calibrated 95% interval.

With respect to this point, we can consider a few related situations. Did we just choose one interval from the outset and only check the calibration of that one interval? If so, I think it is reasonable to consider which interval is likely to be better calibrated. As long as we don’t claim that we have evidence that the model is well calibrated across all intervals, we have tested one aspect of the model and found it to be adequate. Perhaps instead, we are testing the calibration of the model with respect to both 80% and 95% prediction intervals. How is it reasonable to behave if we find the model is well calibrated for 80% intervals and badly calibrated for 95% intervals? Again, I think it is totally fine to recommend that users of the mode use the 80% interval. This is similar to saying “linear regression works well as long as you don’t extrapolate far outside the range of the covariates”. We are guiding the user as to when the model does and does not work and again this is totally fine.

Confident but badly estimated coverage or unsure but well estimated coverage?

My last point is that estimating the calibration of a model is easier when the interval is smaller. In the same way as having few cases vs controls makes our effective sample size small, having large prediction intervals gives us fewer data points where the prediction intervals do not cover the true value. For example, if our dataset contains 200 datapoints, a well calibrated model will have around 10 datapoints where the 95% prediction interval does not cover the true value. In contrast, the 80% prediction interval would give us 40 failures, a much more reasonable sample size. So if we are working with modest sample sizes, would we prefer a 95% prediction interval, where our estimates of coverage are very noisey, or an 80% prediction interval with much tighter estimates of coverage. In the above example with n = 200, our 95% confidence interval of our coverage, if we observed exactly 5% of intervals to not cover their true values, would be 2.4% - 9%. I would consider 2.4% or 9% to imply fairly poor calibration, so in this case we are really unsure whether our model is well calibrated or not. In contrast, if we observed exactly 20% of intervals to not cover their true values, our confidence intervals for the coverage of the 80% interval would be 14% - 26%. So we’re pretty sure the coverage for the 80% prediction interval is ok.

Conclusions

So in conclusion, perhaps we should think about what intervals we use a bit more. Different considerations will apply in different situations, and almost certainly there are different considerations for prediction intervals and confidence/credible intervals. As in the intro, we should avoid summarise full distributions when we can, but often we can’t.

How_people_trick_themselves_into_thinking_they_can_predict_the_stock_market

2022-07-15T00:00:00-07:00

How people trick themselves into thinking they can predict the stock market

I keep accumulating ideas of things to make into YouTube videos or long careful tutorials or whatever else. And then I never do them. So the new plan is to just do quick blog posts when I think about something. If one day I come back around and cover the same material in more detail then good! Otherwise at least it’s out there.

There is a whole genre of YouTube videos showing how to predict the stock market. Lots of these videos use relatively complex models, namely lstm neutral networks. Most of these videos use very simple data, namely the price of the same stock at previous times. These videos typically conclude (and splash loudly on the thumbnail) that they can predict the stock market with appreciable accuracy.

A few simple arguments, that have been well made elsewhere, indicates that they are probably wrong. If they can predict the stock market, they will be off making millions, not making YouTube videos. If they can easily predict the stock market, then so can everyone else. If everyone can predict the stock market, any predictable signal from the data will become priced in and the whole process will quickly fall apart.

So the question is, why does these people’s analyses say they can predict the stock market, when we can be pretty sure that they, in fact, can’t. Like many things, the ‘why’ is more interesting than the fact itself.

I think there’s at least four reasons, some of which are quite subtle. These reasons are interesting both directly for people interested in the stock market, but also interesting for anyone interested in forecasting or predictive modeling more generally. For the benefit of readers with short attention spans, I’ll start with the most interesting, subtle reason, that I haven’t seen discussed at length before. I’ll then circle back to the more obvious answers.

Data leakage from stock choice

If I asked you to tell me everything you know about Tesla (especially if you are interested enough in the stock market to be reading this post) you might answer something like “electric cars, Elon Musk, massive stock price growth”. Notably, most of the videos trying to predict the stock price use one of Tesla, Apple or an index fund like spy. With all of these stocks, we have (probably unconciously) used our knowledge of the present to select them. These are stocks that have, on average, gone up.

This simple fact means the prices of these stocks suddenly are predictable to an extent. A model that predicts a small, positive change in the stock price will do better than random. However, stocks go up, until they stop going up. If you took a random stock, or many random stocks, and tried to predict the price in the future we wouldn’t have this small guarenteed predictive ability. Similarly, if you took the model that predicts a small positive change in the stock price and applied it, long term, to Apple or Tesla, your predictive accuracy would depend entirely on whether these stocks keep going up or not. Eventually they’ll come down, all companies eventually go busy.

This point is quite subtle and I haven’t seen people make it before. But it applies beyond the stock market. For example, if you predict something about species population size or epidemic size, but only use species or diseases that haven’t gone extinct, your models will perform better than they would in the real world.

Predicting value, not change

I think the actual biggest reason that people on youtube trick themselves into thinking they can predict the stock market is that they often try to predict stock price rather than the change in the stock price. Stock prices change through time, but generally not hugely. Other ways of saying this is that the stock price today depends a lot on the stock price yesterday, or that the stock price is an autoregressive process.

So for a stock that has changed price a lot over a long period, such as Tesla that is now worth much more than it was 10 years ago, a model that predicts tomorrows price as being the same as todays price, will have very high apparent “predictive ability”. When the price went from $1 to $1.1, you predict $1. When the price went from $100 to $110, you predict $100. The correlation between your predictions ($1 and $100) and the truth ($1.1 and $110) is high.

However, the problem with this is that any benefit in predicting the market gained by this property, is exactly cancelled out by the fact that to make money off a stock today, you have to have bought it yesterday. The value of a stock doesn’t have any bearing on your profit, only the change in the value of the stock.

Using graphs as metric

A related problem is that many videos fit models, make predictions of the stock price then plot the predictions against the truth in a typical time-series line plot. The lines on these plots often follow each other and look quite convincing. However, this approach fails in the same way as the issue of predicting value, not change. A model that predicts that the stock price tomorrow is the same as the stock price today will look pretty good on these plots. But they are unfortunately utterly useless in terms of making money on the stock market.

Using data from the future

The final, least subtle point, is that it’s easy to accidentally use data from the future. I think most of the youtube videos are actually quite careful on this point, but I thought I’d include it for completeness.

It is off course obvious that if you use tomorrows stock price as a predictor in your model, you will be able to predict tomorrow’s stock price! However, there is a risk of accidentally using more subtle information from the future. If you are using other stocks as predictors, you need to make sure you are using todays stock price not tomorrows. Imagine you are predicting the change in price of Pepsi stocks, but using changes in Coca-cola stock prices as a predictor. If the government announces a sugar tax, both stock prices will fall (I guess). If you accidentally use tomorrows change in Coca-cola stock price, you are accidentally telling your model that there will a sugar tax announced tomorrow, but this is information you would not have in a real model. Relatedly, in the process of calculating compound variables such as moving averages, open-close, high-low you can accidentally use future information. A model that buys when the price hits the weekly high is using information from the future as you don’t know what the high is until the end of the week, at which point you’ve missed the high you were hoping to buy at.

Conceptually this is mostly quite simple. The problem is that it is easy to mess it up in your code if you are not careful.

Final thoughts

So overall, it’s actually essentially impossible to predict the stock market, in a useful way, using stock prices, desktops and tens of minutes of effort. You are always going to be slower than the high-frequency trading algorithms, and anything simple will have already been done. It’s still fun to try, but it’s mostly a humbling experience that tests your ability to implement everything carefully. It’s also a fun exercise in knowing when to give up, when to conclude that the variable of interest is impossible to predict given the available data. This is a topic I’d like to write more about.

Interestingly, if I remember correectly, predicting the change in volume (the number of shares bought and sold) is possible. Unfortunately, it’s not easy to know how to make money with that information. But it’s still perhaps a fun game to play.

Deep_learning_wont_improve_malaria_mapping

2022-06-23T00:00:00-07:00

Why deep learning will probably never improve malaria mapping…

… or SDMs or prognostic models or …

Every couple of months I start thinking about deep learning again. A few days and a few headaches later I conclude that it is not useful for most of the work I do. So I thought I’d write down my thought process this time to try and save my poor aching brain from another period of thought.

I’ve tried to be careful with the title wording.

Why - I’ll give my reasons, it’s not just a hunch.

deep learning - Deep neural nets but also other methods. Anything where the learning involves a transformation of a transformation of a transformation of a … of the data. But importantly not including other machine learning methods which I use all the time because they often improve predictive accuracy in my work.

probably - an uncertain forecast

never - even if data size and compute increases 100x and even after 50 postdoc years of effort

improve - predictive accuracy. Other elements of statistical modelling are not the topic of this post.

malaria mapping or SDMs or prognostic models or… - I fully acknowledge the sucesses of deep learning. But I don’t think they’ll help in my work.

So the first thing to note is that there’s two ways that depth in neural networks and other methods is commonly used. The first is the case of having multiple dense hidden layers (i.e. all nodes in each layer is connected to all nodes in the next layer). As far as I understand this type of architecture is not that important to the success of deep learning. I don’t quite understand the benefits of this architecture compared to a shallow but very wide neural network (one hidden layer with a lot of nodes) but they are commonly used so they must be useful. However, the important thing here is that ultimately, the only thing that this architecture provides is increased flexibility, or nonlinearity, in the model. However, something like a RandomForest or boosted regression trees can have unlimited nonlinearity. So this architecture isn’t providing anything particularly unusual. Furthermore I’ve had a pretty good go at using deep, dense neural networks to map malaria with very little success. Tree based methods are very efficient with the data provided. Due to their greedy estimation, they put all their focus on areas of parameter space that determine the output. Neural networks are much less good at this. So overall, I’m fairly confident that dense multilayer neural networks will never give much better predictions than tree based methods. As we get more data, these architectures may do about as well as tree based methods, but not significantly better.

The architectures that have really driven the success of deep learning as we know it are convolutional neural networks. These are the neural networks used for image and video analysis including image classification, image segmentation, self driving cars etc. These methods generally exploit the spatial (and/or temporal) structure of the images or videos used to train them. And specifically they make good predictions by learning good ways to represent the data. At the top of network, there will be nodes that learn that a line of low values next to a line of high values is an “edge”. In the middle layers, there will be nodes that learn that two horizontal edges and two vertical edges makes a rectangle. And at the bottom of the network will be nodes that learn that a rectangle and some circles is maybe a lorry. This type of data, and these types of representations of the data just don’t exist in most of the subjects I have worked on. In malaria mapping, having an “edge” between cold and hot areas tells you very little about the risk of malaria*. In prognostic modelling, there is often no covariates that have any sort of image structure at all. You might have covariates like age and pre-existing conditions, and these covariates might have interactions. But this idea of edges and rectangles just isn’t relevant.

Perhaps another way to think about this is that deep learning methods have mostly excelled in situations where humans can (more or less) easily perform the task but representing the problem in a way that the computer can usefully use is difficult. A three year old can identify a cat, but despite years of computer vision experts hand creating features such as “circles” and “triangles”, computers still couldn’t identify a cat in an image. We have to let the model learn how to represent the data.

Most of the problems I work on have the opposite situation. We can easily represent the data in a totally acceptable way; one column for age, one column for each pre-existing conditions, done. But even experts in the particular field often couldn’t effectively use these data to make good, quantitative predictions. How much malaria will there be in an area with a mean temperature of 28 degrees and 120 days of rain? Quite a lot, I guess, but that’s about the best I can do. So in these problems, the task for the machine learning model is much more about seperating signal from noise, and also about finding nonlinear relationships and relatively simple interactions, and using these to make accurate predictions. As far as I can see deep learning doesn’t provide anything for this task that tree-based methods can’t already do, and they’re just less efficient with the data.

This felt like it would be a much longer post as I was, once again, grappling with what a deep neural network is really doing (this time I was trying to think of good ways to fit convolutional RandomForests). But oh well. I’ll chuck it up on my webpage and see if it generates some disscussion.

One thing I’ll note is that I really know very little about the architectures used in language models like GPT-3. I’ve read a fair bit about long short term memory networks and they definitely aren’t relevant to most of the areas I have worked on. Maybe there’s something here that will be useful though. Similarly I don’t know about the deep neural networks used in reinforcement learning for robotics. Maybe I’ll come back and edit this is 6 months when I’ve done more reading.

*The fact that some of the areas are hot (high malaria) and some of the areas are cooler (low malaria) may well tell you that in aggregate there will be intermediate malaria risk in the area. But this simpler fact can be handled much more directly with disaggregation regression, which is precisely what I have been working on for five years.

Disaggregation_regression_workshop

2022-02-12T00:00:00-08:00

Disaggregation regression workshop

The recording of this workshop is here. https://www.youtube.com/watch?v=frKbnV5PxH4

I am now running it on both the 31st of March 2022 and the 7th of April 2022. Both 3-5pm GMT. There may be some space left so please email if you are interested.

Do you have areal data (county, LSA, ADMIN2 etc.) but want to make predictions at a higher resolution (5km x 5km raster etc)? If so disaggregation regression might be for you.

I am running a free 2hr workshop on what disaggregation regression is and how to fit models using the R package disaggregation by Anita Nandi, myself and other contributors. If you would like to attend please just send me an email tim.lucas@leicester.ac.uk. I’ll email round a teams link nearer the time.

If you’d like to read more about disaggregation regression before then you could try:

A simulation study of disaggregation regression for spatial disease mapping. R Arambepola, TCD Lucas, AK Nandi, PW Gething, E Cameron. Statistics in Medicine 41 (1), 1-16

Disaggregation: an R package for Bayesian spatial disaggregation modelling. AK Nandi, TCD Lucas, R Arambepola, P Gething, DJ Weiss. arXiv preprint arXiv:2001.04847

The disaggregation R package

Parental_leave

2020-07-12T00:00:00-07:00

Working while on parental leave

I recently published my first sole author paper link. It is a review of methods for interpreting machine learning models (translucent boxes rather than black boxes). I thought I’d write about the process because most of the paper was written while I was on shared parental leave. So I thought I’d discuss whether you should work while on parental leave and give some tips on working while on parental leave (many of which are useful for working as a parent more generally).

I think a lot of the terms used in the British law governing leave is gendered and probably doesn’t match up well for trans people. I’ll try to explicitly say “the person that gave birth” to refer to anyone who is taking leave after giving birth and just “partner” for anyone who is taking parental leave but didn’t give birth. I’d also like to be very clear right at the beginning that this post isn’t me saying “you should work while on parental leave! Here’s how to be productive!” Hopefully that will be clear.

So, should you work while on parental leave? The short answer is that if don’t want to or don’t have the energy or the time, then don’t. My parental leave was from when my son was 7 months until 12 months. I imagine that most people that give birth would really struggle to work during the first 6 months after birth (though I’m also aware that some countries have terrible parental leave allowances). The physical recovery and terrible sleep patterns makes that fairly impossible. Also, partners that take short amounts of leave straight after the birth (2 weeks is common here in the UK) should really be doing everything they can to help and not skipping out to read emails. However, by the time your baby is 6 months, I think quite a few people might actively want to do some work.

There are plenty of good reasons to want to work while on parental leave. Parental leave is emotionally gruelling and any way you can find to help yourself during that period is to be recommended. Having something to think about other than babies can be fantastically useful. It’s easy to go a week virtually without speaking to adults and thinking about nothing but looking after your baby. I live in a small town outside of Oxford (where I worked at the time) due to lower house prices, but this means I don’t know anyone within a short walk from my house. While looking after my first son, I met up with people for a chat on a weekly basis, but for my second son I didn’t meet up with anyone. Furthermore, babycare is a weird combination of incredibly difficult and all-consuming while also being basically boring (depends on your personality I’m sure). So having a “project” to think about can be a really useful thing.

Secondly, a small project can give a wonderful sense of achievement and progress. There’s very few milestones with a baby; every few months they do something new. But day-to-day, week-to-week , the measure of success is basically “did I manage to keep my child alive today”. That single, unchanging question is not a good way to give yourself that sense of pride and success. So again, using a doable, achievable project, with small, regular goals and a sense of progress and achievement can be a fantastic boost for your mental health.

So then some tips. Firstly, you need to be able to do 90% of the project on your phone. While on parental leave I very rarely got an hour free to get my laptop out and start working. I worked while my son slept on me but refused to let me leave the room and on the approximately 300 walks I took through Bure Park nature reserve to get my son to sleep.

Secondly, you need to choose the right project. The project needs to be largely achievable on a phone (as above); lengthy coding sessions are difficult, field work not gonna happen. Therefore I chose to write a review. It was a review that I felt I could write without hours and hours of research (I find that difficult on my phone but others might find it quite doable). The project should also be chunkable into very small periods of work that give a sense of achievement. I counted writing a paragraph in a day a great success. A paragraph in two days was well above what I expected of myself. No progress for a week wasn’t rare. But each drafted paragraph felt like an accomplishment. Each time I edited a single paragraph I counted it as an achievement.

There are some technical things that make working on your phone easier. You want the files on your phone to sync directly with your computer. When you do get half an hour to work on your computer, you don’t want to waste that time working out if your files are up to date or take 5 minutes emailing the file to yourself or whatever. You want it just there. I use the Dropbox app which has a text editor in it. Writing in plain text (probably markdown) is also really useful because you can make the text big. Word is totally useless on a phone. Something like markdown is also useful because you can leave comments to yourself so you can quickly get back to what your were doing. I wrote my plan in comments and then filled in the full paragraphs underneath. You want the plan and the text in the same document because switching documents is annoying on a phone. Keeping the plan once you’ve started writing is useful so that you know what a paragraph was supposed to say, even if you didn’t do your best drafting on that paragraph. When editing I also left a comment to tell myself which paragraphs I had edited. I still use these ideas for doing work on my phone while traveling or in spare minutes here and there. I often plan out presentations (beamer) or small documents on my phone. To a much lesser extent I’ve also planned software, writing the function names and what the inputs and outputs for each function should be.

So, to reiterate, I am not trying to say “you should work while on parental leave and here is how”. However, if you actively wish to work, as part of your own mental health management, maybe some of these ideas might help. I am 100% sure that for me, writing this paper during leave was beneficial to my mental health. I can very easily imagine that it wouldn’t be for many other people.

Mixed Models

2019-10-15T00:00:00-07:00

A primer on Bayesian mixed-effects models.

A primer on Bayesian mixed-effects models

knitr::opts_chunk$set(cache = TRUE, fig.width = 8, fig.height = 5)

set.seed(191016)

#install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)

library(dplyr)
library(ggplot2)
library(INLA)
library(malariaAtlas)

Intro

This primer is an introduction to mixed effects models. I’m presenting it by using Bayesian mixed-effects models but that’s because they are easier to understand. The hope is that from here it will be relatively easy to understand frequentist mixed-effects models. Or at least, have the intuition of what the models are doing. I still don’t understand the nuts and bolts of frequentist mixed-effects models.

The aim of the primer is to explain the real fundementals of what mixed effects-models are, why you might use them and how they do what they do. This last bit (the how), is what is missed from many courses because the how in frequentist mixed-effects models is complicated. However, in Bayesian mixed-effects models, the how is very simple, and follows on entirely smoothly from any other Bayesian analysis. In this case, I think understanding how they work also makes the what and the why easier to understand.

As an overview, we will look at some data and define some mathematical models to answer some questions of interest. Then we will fit those same models in a least squares framework, a normal Bayesian framework and finally a mixed-effect framework.

Download the data

We’re going to get data using the malariaAtlas package The data will be prevalence surveys from Asia. To keep things simple we are going to completely ignore the sample size for each survey. Instead we will simply do a log(x + 0.1) transform (that will approximately normalise things) and use that as our response.

d <- getPR(continent = 'Asia', species = 'Pf')

## Confirming availability of PR data for: Asia...

## PR points are available for Asia.

## Attempting to download PR point data for Afghanistan, Indonesia, India, Yemen, Cambodia, Bangladesh, Vietnam, Pakistan, Philippines, Nepal, Thailand, China, Tajikistan, Myanmar, Laos, Malaysia, Sri Lanka, Iraq, Saudi Arabia, Turkey, Timor-Leste, Bhutan ...

## Data downloaded for Asia.

names(d)

##  [1] "dhs_id"                    "site_id"                  
##  [3] "site_name"                 "latitude"                 
##  [5] "longitude"                 "rural_urban"              
##  [7] "country"                   "country_id"               
##  [9] "continent_id"              "month_start"              
## [11] "year_start"                "month_end"                
## [13] "year_end"                  "lower_age"                
## [15] "upper_age"                 "examined"                 
## [17] "positive"                  "pr"                       
## [19] "species"                   "method"                   
## [21] "rdt_type"                  "pcr_type"                 
## [23] "malaria_metrics_available" "location_available"       
## [25] "permissions_info"          "citation1"                
## [27] "citation2"                 "citation3"

dtime <- d %>%
          filter(!is.na(examined), !is.na(year_start)) %>%
          mutate(log_pr = log(pr + 0.1)) %>%
          select(country, year_start, log_pr, pr)

We will ask two broad questions.

What was the malaria prevalence in Asia and in each country 2005 - 2008 (ignoring any remaining temporal trends).
How did malaria change through time in Asia and in each country.

To keep this clear we will make two seperate datasets.

dmean <- dtime %>% filter(year_start > 1999, year_start < 2005)

So that we can plot our predictions nicely we should make some predictive data.

dmean_pred <- data.frame(country = unique(dmean$country))
dtime_pred <- expand.grid(country = unique(dtime$country), year_start = 1985:2018)

Let’s summarise and plot the data.

dmean$country %>% table

## .
##  Afghanistan   Bangladesh       Bhutan     Cambodia        China
##           64            0            0          187           25
##        India    Indonesia         Iraq         Laos     Malaysia
##           76          124            0           20            0
##      Myanmar        Nepal     Pakistan  Philippines Saudi Arabia
##           26            0            0            0            0
##    Sri Lanka   Tajikistan     Thailand  Timor-Leste       Turkey
##            0            2           72           11            8
##      Vietnam        Yemen
##           67           26

dmean$year %>% table

## .
## 2000 2001 2002 2003 2004
##   79  111  192  209  117

Question one: what was the mean malaria prevalence per country in the period 2000 - 2004

Note that some countries like Tajikistan and Turkey have very little data. How do we estimate their mean?
Also note, the data is very unbalanced. How do we estimate the Asia total without the estimate being dominated by Indonesia?

dmean <- dmean %>%
           group_by(country) %>%
           mutate(n = n())

ggplot(dmean, aes(x = country, y = pr, colour = n < 10)) +
  geom_boxplot() +
  geom_point() +
  ggtitle('Malaria prevalence by country in 2000-2004')

ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
  geom_boxplot() +
  geom_point() +
  ggtitle('Log malaria prevalence by country in 2000-2004')

Discuss mathematical models and estimate with least squares

We can look at the structure of our mathematical models, and the way we estimate the parameters completely seperately. So we can think of the structure of a model and then estimate it with least square (lm()) as a simple way to start getting intuition about what things look like.

Starting with the first question, we can start with a model with one, global intercept.

y = β₀

m1 <- lm(log_pr ~ 1, data = dmean)
coefficients(m1)

## (Intercept)
##    -1.85726

ggplot(dmean, aes(x = country, y = log_pr)) +
  geom_boxplot() +
  geom_point() +
  ggtitle('Log malaria prevalence and Asia global mean') +
  geom_abline(slope = 0, intercept = m1$coef[1])

As our aim is actually to estimate the mean malaria prevalence for each country, we need country to go in as a categorical variable.

y = β₀ + β.country

It may be helpful to think about this in the explicit way it is encoded. We have 13 countries. The ideal model would be 1 global mean and 13 country specific parameters.

y = β₀ + β₁.AFG + β₂.KHM + β₃.CHN + …

(I’m using ISO3 codes here. KHM is Cambodia or Khmer) Internally, R converts the 1 categorical variable into binary variables. Variable 1 is “is this row in AFG?”, variable 2 is “is this row in KHM?” etc.

So as these variables have a 1 if the row is in a given country and a zero otherwise, a prediction for Afghanistan will be zeroes for all the terms except β₀ and β₁.

Unfortunately we now have to make a quick detour. This parameterisation is unidentifiable (the data cannot tell us the answer because there are multiple answers that fit the data equally likely). If we think about the same model with just two countries, how could the model know whether the intercept is high or both country-level parameters are high? When we switch to mixed-effects models we will have a global intercept and 13 country specific parameters. But for now we will have a global intercept and 12 country level parameters. The first country is taken as the “reference class” and combined with the global intercept. Mostly, we can think about the models in the same way however.

So now we can estimate this model with least squares

m2 <- lm(log_pr ~ country, data = dmean)
coefficients(m2)

##        (Intercept)    countryCambodia       countryChina
##         -1.9885722          0.2247165         -0.3045652
##       countryIndia   countryIndonesia        countryLaos
##          0.1992390          0.1799838          0.5971321
##     countryMyanmar  countryTajikistan    countryThailand
##          0.5630916         -0.1027141         -0.1637883
## countryTimor-Leste      countryTurkey     countryVietnam
##          0.0803858         -0.3140129          0.1394583
##       countryYemen
##         -0.0461457

pred2 <- data.frame(dmean_pred, pred = predict(m2, newdata = dmean_pred))


ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
  geom_boxplot() +
  geom_point() +
  geom_point(data = pred2, aes(country, pred), colour = 'black', size = 4) +
  ggtitle('Log malaria prevalence and country specific means')

Now switch to Bayes and remind ourselves what priors are.

Bayesian mixed modelling is essentially taking the above model structures and doing clever things with priors. First we’ll do more standard things with priors to remind ourselves what they mean.

A prior is how we tell the model what is plausible based on our knowledge before looking at the data. The intercept in our model is the average malaria prevalence across Asia (in log space). Is prevalence of 1 (in prevalence space) reasonable? No! So our prior should tell the model that this is very unlikely.

So first let’s fit our model in a Bayesian framework with INLA. The priors on fixed effects here are normal distributions with a mean and precision (1/sqrt(sd)). For our first model we are putting very wide priors on the parameters which should give us parameter estimates very similar to the least squares estimate.

In the first model, the global intercept was dominated by countries with lots of data like Indonesia. These data aren’t independant because we expect the data within Indonesia to be more similar than the data between Indonesia and other countries. If we were independantly sampling each person in Asia, China and India would have a lot more data than Cambodia! When people talk about autocorrelation in the data and mixed-models this is what they are referring to. While removing this autocorrelation is good, most of the statistical power will go into learning country level intercepts, not the global mean.

# easiest way to predict with INLA is to put the prediction data in with NAs in the Y column.

dmean_both <- bind_rows(dmean, dmean_pred)

pred_ii <- which(is.na(dmean_both$log_pr))

# Very vague priors first.

priors <- list(mean.intercept = -2, prec.intercept = 1e-4,
               mean = 0, prec = 1e-4)

bm1 <- inla(log_pr ~ country, data = dmean_both,
            control.fixed = priors,
            control.predictor = list(compute = TRUE))

predb1 <- data.frame(dmean_pred, pred = bm1$summary.fitted.values[pred_ii, 1])


ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
  geom_boxplot() +
  geom_point() +
  geom_point(data = predb1, aes(country, pred), colour = 'black', size = 4) +
  ggtitle('Log malaria prevalence. Bayesian means with vague priors.')

Now let’s say that we think all countries are fairly similar. To encode that in the prior we say that the β_i’s should be small. INLA works with precision (1/variance) so high precision is a tight prior around 0.

This is “pooling”. Our estimates for countries with not much data will be helped by information from the other countries.

Our estimates for the global mean will be dominated by countries with lots of data. But we won’t have put all our statistical power into learning the country level parameters.

priors <- list(mean.intercept = -2, prec.intercept = 1e-4,
               mean = 0, prec = 100)

bm2 <- inla(log_pr ~ country, data = dmean_both,
            control.fixed = priors,
            control.predictor = list(compute = TRUE))

predb2 <- data.frame(dmean_pred, pred = bm2$summary.fitted.values[pred_ii, 1])


ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
  geom_boxplot() +
  geom_point() +
  geom_point(data = predb1, aes(country, pred), colour = 'black', size = 4, alpha = 0.3) +
  geom_point(data = predb2, aes(country, pred), colour = 'black', size = 4) +
  ggtitle('Log malaria prevalence. Strong, pooling priors.') +
  labs(subtitle = 'Least squares estimates in grey. Estimates pulled towards global mean.')

THE CRUX

So if we think that all countries are quite similar, we should put a strong prior on the country level parameters. For countries with little data this means our estimates are close to the global mean. This is “pooling”. But it also means the global estimate will be dominated by countries with lots of data.

If we think that countries are quite dissimilar, we should put a weak prior on the country level parameters. For countries with little data, our estimates will be noisey, but maybe that’s better than them being biased towards the mean. Our global estimate won’t be dominated by any one country.

The problem then is how similar are countries. Often, we don’t know. So how do we set our priors sensibly. The answer is mixed-effects models.

Mixed-effects model

Our models above looked like this: y = β₀ + β₁.AFG + β₂.KHM + β₃.CHN + … β₀ ∼ Norm*(−2, 10000) *β*_*i* ∼ *Norm(0, 0.001)

We are now saying “we don’t know what number to choose instead of 0.001”. So, along with the rest of the model we will estimate it. We don’t know how different the different countries are, so we will let the data tell us.

To do this, we switch the 0.001 for a new variable, σ and put a prior on sigma. y = β₀ + β₁.AFG + β₂.KHM + β₃.CHN + … β₀ ∼ Norm*(−2, 10000) *β*_*i* ∼ *Norm(0, σ) σ ∼ some prior distribution Mixed-effects models are also called hierarchical models for this reason, the prior on the prior is hierarchical.

So, now if the countries that do have lots of data are very different from each other, the model will learn that σ must be quite big. Therefore the countries with little data will not be pulled towards the mean much. If the countries with lots of data are very similar, then a country with little data should be pulled towards the mean. If the few data points lie far from the global mean then probably it’s just by chance.

Setting hyperpriors can be awkward. Note that σ must be positive so we need a prior that reflects that.

Recently Penalised complexity priors have been developed and they are much more intuitive. You choose a “tail value”: What is the largest value of σ that is reasonable? You then tell the model that the probability that σ is greater than that value is a small probability (1% or something).

So for now we’ll say P(σ > 0.1)=1%.

priors <- list(mean.intercept = -2, prec.intercept = 1e-4)
hyperprior <-  list(prec = list(prior="pc.prec", param = c(0.1, 0.01)))

f <- log_pr ~ f(country, model = 'iid', hyper = hyperprior)  
mm1 <- inla(f, data = dmean_both,
            control.fixed = priors,
            control.predictor = list(compute = TRUE))

mm1$summary.hyperpar

##                                              mean         sd 0.025quant
## Precision for the Gaussian observations  4.938214  0.2654638   4.433231
## Precision for country                   32.062486 11.9888290  14.787373
##                                          0.5quant 0.975quant      mode
## Precision for the Gaussian observations  4.932598   5.476931  4.922921
## Precision for country                   30.018967  61.137516 26.332045

1 / mm1$summary.hyperpar$mean[2]

## [1] 0.0311891

predm1 <- data.frame(dmean_pred, pred = mm1$summary.fitted.values[pred_ii, 1])

ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
  geom_boxplot() +
  geom_point() +
  geom_point(data = predb1, aes(country, pred), colour = 'black', size = 4, alpha = 0.3) +
  geom_point(data = predm1, aes(country, pred), colour = 'black', size = 4) +
  ggtitle('Log malaria prevalence by country in 2000-2004 (pooling priors)')

We can look at the prior and posterior for the hyperparameter that governs the strength of the prior. We can look at this on the internal precision scale used by INLA. We said that the prior probability that the precision is less that 3.1 is 1%. On the sd scale we said the prior probability that the sd is greater than o.1 is 1%. I’ve plotted these in red lines. They don’t seem quite right but at least on the right side.

I’m also not 100% sure that my scaling here is correct. The posterior (black) looks very wide and high.

# These plots mess up knitr...
#plot(mm1, plot.lincomb = FALSE, plot.random.effects = TRUE,
#    plot.fixed.effects = FALSE, plot.predictor = FALSE,
#    plot.prior = TRUE)
#abline(v = 10, col = 'red')


# Plot the posterior on precision scale
plot(mm1$marginals.hyperpar$`Precision for country`, type="l", xlim=c(0, 80))

# Plot the prior then add posterior
kappa <- exp(seq(-5, 15, len=10000))
prior.new = inla.pc.dprec(kappa, 0.1, 0.01)

plot(kappa, prior.new, col = 'blue', type = 'l', xlim = c(0, 200), ylim = c(0, 0.002))
lines(mm1$marginals.hyperpar$`Precision for country`)
abline(v = 3.1, col = 'red')

# Plot the posterior on sd scale
sd_scale <- mm1$marginals.hyperpar$`Precision for country`
sd_scale[, 'x'] <- 1/sqrt(sd_scale[, 'x'])

plot(sd_scale, type="l", xlim=c(0, 1))

# Plot prior on sd scale.
plot(1/sqrt(kappa), prior.new, col = 'blue', type = 'l', xlim = c(0, 1), ylim = c(0, 0.002))
lines(sd_scale)
abline(v = 0.1, col = 'red')

Bit more on priors

Between working out what scale you’re using (variance, sd or precision) and between the intuition being difficult, these priors can be difficult to think about and to choose your values.

The way I’ve found to go about this is just plotting distributions. What does a normal with sd of 1 look like? If a country had an iid estimated iid effect of 1, is that plausible? Do something simple like the rough intercept + 1 and transform back into the natural scale.

For example, lets start by thinking of N(0, 1). It would be quite easy to get values around -2.5 and 2.5 from this. So with an intercept of something like -1.5 this gives us values ranging from exp(−1.5 − 2.5)=0.1 on the prevalence scale, which is reasonable, and exp(−1.5 + 2.5)=2.7 at which point we realise that we should be using logit not log, and that probably we don’t want a country being estimated prevalence above 1 and that N(0, 1) is really very flexible. Our prior of 0.1 being on the upper end of likely is therefore kind of reasonable.

INLA has these penalised complexity priors and they are quite nice. If you end up using other Bayesian packages you may well have to use other priors. Gamma distributions and half normals on SD are common. Same thing though, plot some distributions and see how reasonable it is.

And see this paper. In particular Figure 4. https://arxiv.org/abs/1709.01449

Question two: What were the malaria trends in Asia and in each country.

dtime$country %>% table

## .
##  Afghanistan   Bangladesh       Bhutan     Cambodia        China
##          224          364           23          211          102
##        India    Indonesia         Iraq         Laos     Malaysia
##          219         1117           11           76           15
##      Myanmar        Nepal     Pakistan  Philippines Saudi Arabia
##           38            0           56          350            2
##    Sri Lanka   Tajikistan     Thailand  Timor-Leste       Turkey
##           18            8          105           11            8
##      Vietnam        Yemen
##          150          136

dtime$year %>% table

## .
## 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
##   64   64   49   12   39   33   50  114   60  136   52  206  125   79   41
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
##   79  111  192  209  117  216  175  650  223   10   35    1   34   60    4
## 2015
##    4

Note that Bhutan and Iraq have one data point each. Start thinking how you would estimate a temporal trend in those countries. As above, how do we estimate a temporal trend without it being dominated by the trend in Indonesia. The above mixed-effects model was called a random intercepts model. The “random” component was the iid country effect and we were estimating many intercepts. Now we will look at a random slopes models. The regression slopes will become our random component.

ggplot(dtime, aes(x = year_start, y = pr)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~ country, scales = 'free_y', ncol = 3) +
  ggtitle('Malaria prevalence by country through time')

ggplot(dtime, aes(x = year_start, y = log_pr)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~ country, scales = 'free_y', ncol = 3) +
  ggtitle('Log malaria prevalence by country through time')

Going back to least squares.

For question two we need to include a “year_start” term. The simplest model we can usefully do is a global year term and ignore country level lines.

y = β₀ + β₁yea**r

m3 <- lm(log_pr ~ year_start, data = dtime)
pred3 <- data.frame(dtime_pred, pred = predict(m3, newdata = dtime_pred))



ggplot(dtime, aes(x = year_start, y = log_pr)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~ country, ncol = 3) +
  geom_line(data = pred3, aes(y = pred)) +
  ggtitle('Log malaria prevalence by country through time: only one slope')

We could instead estimate a seperate intercept for each model but still only one slope. As above we would want this intercept to be a random effect but for now it isn’t. y = β₀ + β₁year* + *β*₂.*AFG* + *β*₃.*KHM* + *β*₄.*CH**N + …

m4 <- lm(log_pr ~ year_start + country, data = dtime)
pred4 <- data.frame(dtime_pred, pred = predict(m4, newdata = dtime_pred))



ggplot(dtime, aes(x = year_start, y = log_pr)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~ country, ncol = 3) +
  geom_line(data = pred4  , aes(y = pred)) +
  ggtitle('Log malaria prevalence. Seperate intercepts, one slope.')

Or we could estimate one slope and one intercept for each country as well as a global slope and global intercept. This model makes sense and let’s us answer the questions we are asking.

y = β₀ + β₁year* + *β*₂.*AFG* + *β*₃.*KHM* + *β*₄.*CHN* + … + *β*₅.*AFG*.*year + β₆.KHM.year* + *β*₇.*CHN*.*year + … Again, the variables AFG etc. are 1 if the data is in Afghanistan and 0 otherwise. So β₅.AFG.year* will be zero if the datapoint is not in Afghanistan and will be *β*₅.*year if the datapoint is inside Afghanistan.

We can fit this model with least squares.

m5 <- lm(log_pr ~ country + year_start:country , data = dtime)
pred5 <- data.frame(dtime_pred, pred = predict(m5, newdata = dtime_pred))

## Warning in predict.lm(m5, newdata = dtime_pred): prediction from a rank-
## deficient fit may be misleading

ggplot(dtime, aes(x = year_start, y = log_pr)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~ country, ncol = 3, scale = 'free_y') +
  geom_line(data = pred5, aes(y = pred)) +
  ggtitle('Log malaria prevalence. Seperate intercepts and slopes.')

But the estimates in countries like Timor Leste are not very good. I don’t believe malaria is decreasing in Timor Leste 10x faster than in other countries. And I don’t believe Turkey is going through an epidemic. So as above, we have a model structure that is useful, but the way we are estimating our parameters is not very good. So first off let’s switch to a Bayesian model.

dtime_both <- bind_rows(dtime, dtime_pred)
pred_ii <- which(is.na(dtime_both$log_pr))


# This is all just messing around getting the priors together.
#  Not going to think about this too much, but we think malaria is going down.
names <- paste('country', unique(dtime$country), sep = '')
pmean <- c(rep(list(0), length(names)), -0.5) # 1 is the mean for intercepts,-0.5 is the mean for slopes.
names(pmean) <- c(names, 'default')
# 100 is the precision for the intercepts, 50 is the precision for the slopes.
pprec <- c(rep(list(0.1), length(names)), 0.001)
names(pprec) <- c(names, 'default')
priors <- list(mean.intercept = -2, prec.intercept = 1e-4,
               mean = pmean, prec = pprec)


b3 <- inla(log_pr ~ country + year_start:country , data = dtime_both,
            control.fixed = priors,
            control.predictor = list(compute = TRUE))
predb3 <- data.frame(dtime_pred, pred = b3$summary.fitted.values[pred_ii, 1])



ggplot(dtime, aes(x = year_start, y = log_pr)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~ country, ncol = 3, scale = 'fixed') +
  geom_line(data = pred5, aes(y = pred), alpha = 0.3) +
  #geom_line(data = pred3, aes(y = pred), colour = 'blue', alpha = 0.3) +
  geom_line(data = predb3, aes(y = pred)) +
  ggtitle('Log malaria prevalence. Pooling priors') +
  ylim(-3, -0.5)

## Warning: Removed 26 rows containing missing values (geom_point).

So as above, these priors have pushed both the slopes and intercepts to be much closer to the global mean. Again, same as above, we don’t know how similar the intercepts and slopes are to each other, so we let the data tell us.

Random Slopes

As before we have a model: y = β₀ + β₁year* + *β*₂.*AFG* + *β*₃.*KHM* + *β*₄.*CHN* + … + *β*₅.*AFG*.*year + β₆.KHM.year* + *β*₇.*CHN*.*year + … And we have priors that we don’t know how strong they should be. β_3 − 4 ∼ Norm*(0, *σ*_{*i**n**t**e**r**c**e**p**t*}) *σ*_{*i**n**t**e**r**c**e**p**t*} ∼ some prior distribution *β*_5 − 7 ∼ *Norm(0, σ_slope) σ_slope ∼ some prior distribution

And as above we can use penalised complexity priors.

dtime_both$country2 <- dtime_both$country # INLA needs us to copy this column

# We will put weak priors on the fixed effects. They can do what they want.
priors <- list(mean = list(year_start = -0.5, default = -2),
               prec = 1e-5)

hyper.intercept <- list(prec = list(prior="pc.prec", param = c(0.1, 0.01)))
hyper.slope <- list(prec = list(prior="pc.prec", param = c(0.1, 0.01)))

# For the formula we need year_start in for our global term.
# The global intercept is just the intercept and is included by default
f <- log_pr ~ year_start +
       f(country, model = 'iid', hyper = hyper.intercept) +
       f(country2, year_start, model = 'iid', hyper = hyper.slope)

mm2 <- inla(f, data = dtime_both,
            control.fixed = priors,
            control.predictor = list(compute = TRUE))


predmm2 <- data.frame(dtime_pred, pred = mm2$summary.fitted.values[pred_ii, 1])



ggplot(dtime, aes(x = year_start, y = log_pr)) +
  geom_point(alpha = 0.4) +
  facet_wrap(~ country, ncol = 3, scale = 'fixed') +
  geom_line(data = pred5, aes(y = pred), alpha = 0.3) +
  geom_line(data = predmm2, aes(y = pred)) +
  ggtitle('Log malaria prevalence by country through time. Random intercepts and slopes') +
  ylim(-3, -0.5)

## Warning: Removed 26 rows containing missing values (geom_point).

Now just some messing around to explore what we have fitted. Here is a plot of the random slopes we have fitted.

hist(mm2$summary.fixed$mean[2] + mm2$summary.random$country2$mean)

mm2$summary.hyperpar

##                                                 mean           sd
## Precision for the Gaussian observations 5.931078e+00 1.476931e-01
## Precision for country                   5.919402e+04 1.222777e+06
## Precision for country2                  5.794590e+07 2.128258e+07
##                                           0.025quant     0.5quant
## Precision for the Gaussian observations 5.645668e+00 5.929447e+00
## Precision for country                   1.287891e+02 3.612827e+03
## Precision for country2                  2.617741e+07 5.475057e+07
##                                           0.975quant         mode
## Precision for the Gaussian observations 6.226221e+00 5.926438e+00
## Precision for country                   3.463451e+05 2.206538e+02
## Precision for country2                  1.085916e+08 4.872095e+07

The estimated mean for the precision of the random slope component is 5e7. Therefore sd is $1/\sqrt{5e7} = 0.0001$. The data has told us that the declines in each country is pretty similar. Therefore the crazy slope in Timor-Leste is totaly unjustified.

Recap and practical advice

So, we have fitted a model for prevalence and a model for prevalence through time. In both cases we have many countries, and therefore many parameters. We want to put priors on these many parameters but don’t know how strong to make them. So we use a mixed-effect model to put a hyperprior on the prior.

These parameters can be intercepts or regression slopes. Everything works the same way but this can be confusing in the programming syntax. This is what we refer to as random intercepts and random slopes models.

So when are these models suitable? Given that the sole thing they do is change the estimates of these many parameters, we should focus on whether that makes sense in a particular case.

We need to estimate σ, the between group variance. We therefore need many groups for this estimate to be any good. The number of countries we have here is on the lower side.
If each group has loads of data, the prior will be ignored. So the benefit of mixed-effects models is reduced if every group has lots of data.
These models can be used for different reasons. Perhaps we are estimating some global fixed effect but want to account for autocorrelation. Perhaps we are interested in the individual group estimates, but want to share information between groups.

Frequentist mixed models.

I really don’t understand frequentist models. They sort of do the same thing (estimating the variance of the random effect) but without priors. I dunno. Standard library is lme4 and you would do the above models like this.

library(lme4)

f1 <- log_pr ~ (1 | country)
mm3 <- lmer(f1, data = dmean)

coefficients(mm3)

## $country
##             (Intercept)
## Afghanistan   -1.984396
## Cambodia      -1.765873
## China         -2.250998
## India         -1.793234
## Indonesia     -1.810567
## Laos          -1.455991
## Myanmar       -1.473185
## Tajikistan    -1.973589
## Thailand      -2.142186
## Timor-Leste   -1.905138
## Turkey        -2.192474
## Vietnam       -1.850997
## Yemen         -2.020359
##
## attr(,"class")
## [1] "coef.mer"

fixef(mm3)

## (Intercept)
##   -1.893768

f2 <- log_pr ~ year_start + (year_start | country)

mm4 <- lmer(f2, data = dtime)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : unable to evaluate scaled gradient

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge: degenerate Hessian with 1 negative
## eigenvalues

coefficients(mm4)

## $country
##              (Intercept)  year_start
## Afghanistan     23.86207 -0.01298811
## Bangladesh      23.74080 -0.01288552
## Bhutan          23.86637 -0.01299368
## Cambodia        23.37922 -0.01257206
## China           24.04296 -0.01314883
## India           23.26472 -0.01247322
## Indonesia       23.66311 -0.01281644
## Iraq            24.24638 -0.01332219
## Laos            22.99958 -0.01224425
## Malaysia        23.68770 -0.01283863
## Myanmar         22.96536 -0.01221494
## Pakistan        23.85269 -0.01297997
## Philippines     23.99380 -0.01310051
## Saudi Arabia    23.95024 -0.01306647
## Sri Lanka       23.95157 -0.01306773
## Tajikistan      23.88800 -0.01301237
## Thailand        23.74605 -0.01288867
## Timor-Leste     23.58755 -0.01275329
## Turkey          24.00437 -0.01311285
## Vietnam         23.67971 -0.01283811
## Yemen           23.36557 -0.01257216
##
## attr(,"class")
## [1] "coef.mer"

fixef(mm4)

## (Intercept)  year_start
##  23.7018017  -0.0128519

R is complaining about not being able to fit the model properly. I don’t know why.

Check this paper for more. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5970551/

Stats280 Twitter Stats Course

2017-11-15T00:00:00-08:00

An attempt at writing a full statistics course for twitter.

Measlesdataviz

2015-05-13T00:00:00-07:00

Working through visualising the effects of the measles vaccine.

Visualising the effects of the measles vaccine

First there was the Wall Street Journal vizualisation

Then @RobertAllison__ redrew the plot.

Then @biomickwatson recreated the plot. Finally, @benjaminlmoore recreated the plot in ggplot2.

So I thought I’d have a go as well. I’ve downloaded the incidence data from the Tycho website. http://www.tycho.pitt.edu/l1advanced.php You have to register and stuff. I also deleted the first two rows with titles in.

Before I start, my aims:

No funky colour ramps. Let the data speak for itself.
Distinguish between missing data and zeros.
I’m considering reordering the states. Perhaps largest states at the top? Or high measles burden at top.

The code

The code is available as an Rmarkdown document on github.

So, read in data. Then shamelessly copy code from @biomickwatson to get to a decent starting point.

To go from weekly data to annual, I am taking the mean across the year (with NAs removed). As pointed out by @RobertAllison__, if you sum the data, the NAs introduce a bias. So I am taking the mean, then multiplying back by 52. To get expected cases per 100,000, per year.

library(gplots)

m <- read.csv('MEASLES_Incidence_1928-2003_20150409110701.csv', stringsAsFactors = FALSE)

# yoink. Cheers @biomickwatson
m[m == "-"] <- NA
for (i in 2:NCOL(m)) {
   m[, i] <- as.numeric(m[, i])
}

m <- m[m$YEAR>=1930,]
y <- aggregate(m[,3:NCOL(m)], by=list(year=m[,1]), function(x) 52*mean(x, na.rm = TRUE))

for (i in 1:NCOL(y)) {
   y[is.nan(y[, i]), i] <- NA
}


y <- y[order(y$year),]


row.labels <- rep("", 72)
row.labels[c(1,11,21,31,41,51,61,71)] <- c("1930","1940","1950","1960","1970",
                                           "1980","1990","2000")

cols <- colorRampPalette(c("red", "blue"))(100)
bks <- seq(0, max(y[, -1], na.rm = TRUE), length.out = 101)


par(cex.main=0.8)

heatmap.2(as.matrix(t(y[,2:NCOL(y)])), Rowv=NULL, Colv=NULL,
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        col=cols, breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white",
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961
             \n(data from Project Tycho)')

OK. NA’s are white. Other colours are ramped. That’s good. The colour ramp here is funny because I’m using @biomickwatson’s values which match the cases data rather than the incidence data.

I like the labels on the right so I’ll leave that.

Now to get some good colours. I might try and leave NA’s white and have a ramp that doesn’t include white. RColorBrewer asseeeeemble.

Also going to change to 2 letter state names. The csv is just a list copied from the web.

library(RColorBrewer)

stNames <- read.csv('stateNames.csv', header = FALSE, stringsAsFactors = FALSE)

names(y)[2:52] <- stNames[,2]


cols <- colorRampPalette(brewer.pal(8, 'Reds'))(100)
bks <- seq(0, max(y[, -1], na.rm = TRUE), length.out = 101)

par(cex.main=0.8)
heatmap.2(as.matrix(t(y[,2:NCOL(y)])), Rowv=NULL, Colv=NULL,
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white", col=cols,
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961', na.color = grey(0.8))

As suggested by @bulboussquidge I’ll try just clipping the few high values to a something+ category.

hist(as.matrix(y[,1:NCOL(y)]))

y2 <- y[, 2:NCOL(y)]
sum(y2[!is.na(y2)] > 2500)

y2[y2 > 2500] <- 2500

cols <- colorRampPalette(brewer.pal(8, 'Reds'))(100)
bks <- seq(0, max(y2, na.rm = TRUE), length.out = 101)

par(cex.main=0.8)
heatmap.2(as.matrix(t(y2)), Rowv=NULL, Colv=NULL,
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white", col=cols,
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961', na.color = grey(0.8))

Only 3 data points are affected. I’m torn here.

Now I want to try organising the data by measles burden. The areas with lots of measles are the important bit, so I think that makes sense. I think I’ll just do mean (with NAs removed) and order by size. Certainly a useful thing could be to order by number of cases rather than incidence. But I don’t want to go get the other dataset.

means <- apply(y2, 2, function(x) mean(x, na.rm = TRUE))

y3 <- y2[, rev(order(means))]



par(cex.main=0.8)
heatmap.2(as.matrix(t(y3)), Rowv=NULL, Colv=NULL,
        dendrogram="none", trace="none", key=FALSE,
        labCol=row.labels, cexCol=1, lhei=c(0.15,1), lwid=c(0.1,1), margins=c(5,12),
        breaks=bks, colsep=1:72, srtCol=0, rowsep=1:57, sepcolor="white", col=cols,
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles cases in US states 1930-2001\nVaccine introduced 1961', na.color = grey(0.85))

I think the reordering is an improvement. It’s interesting at least.

Finally, I just want to tweak a few things. This turns out to be a complete pain. I had ago hacking heatmap.2(). The new function is saved in customHeatmap.R.

source('customHeatmap.R')

customHeatmap(as.matrix(t(y3)), Rowv=NULL, Colv=NULL, lmat = rbind(c(0,3),c(2,1),c(0,4)),
        dendrogram="none", trace="none", key=TRUE,
        labCol=row.labels, lhei=c(0.15,1,0.25), lwid=c(0.1,1), margins=c(3,6),
        breaks=bks, colsep=1:72, rowsep=1:57, sepcolor="white", col=cols,
        add.expr=lines(c(32,32),c(0,1000),lwd=2),
        main='Measles incidence in US states', na.color = grey(0.8),
        density.info = 'none', RowLabColors = grey(0.4), cexCol = 1.3, key.title = '',
        cexRow = 0.65, ColLabColors = grey(0.4), key.xlab = 'Cases per 100,000', titleColor = grey(0.4), key.par = list(col = grey(0.6), lwd = 0.1 )
        )

At this point, I’m bored of hacking. I’m just going to make the last few changes in inkscape.

Which gives me:

Still not perfect. But I’m bored now.

Lmvsanova

2015-01-18T00:00:00-08:00

Power of different linear models.

If you want to test for a change over a variable we can consider some different ways to do it.

We can collect the data so that it is spread out along the x axis or clumped at either end. We can analyse the x axis as a continuous variable or a discrete variable (binning the x if it is spread out.) These four options are displayed below with boxplots implying the data is analysed as discrete x values. Note that only two sets of data are simulated, one clumped and one spread out.

So to examine the power of these approaches here’s a function that simulates some data (from a linear model with normal error) and then calculates and extracts p-values (sorry) for the four cases shown above.

calcLM <- function(){

x1 <- runif(30)
y1 <- x1 * 2 + rnorm(30)
x2 <- rep(c(0,1), 15)
y2 <- x2 * 2 + rnorm(30)

coef <- c( summary(lm(y1 ~ x1))$coef[8],
           summary(lm(y2 ~ x2))$coef[8],
           summary(lm(y1 ~ x1 > 0.5))$coef[8],
           summary(lm(y2 ~ as.factor(x2)))$coef[8]
          )
}

Then let’s run the simulation 1000 times.

p <- t(replicate(1000, calcLM())) %>% data.frame

colnames(p) <- c('continuousSpread', 'continuousClumped', 'discreteSpread', 'discreteClumped')

pLong <- melt(p, variable.name = 'model', value.name = 'p')

And plot the results.

ggplot(pLong, aes(x = model, y = p)) +
  scale_y_log10() +
  geom_violin() +
  ggtitle('p values from different models.')

So, using data from the edges of our range of x values gives us more power (lower p-values). Also, it’s interesting to note that doing a discrete ANOVA with the x values as a factor is identical to treating this as a continuous linear model. This will not be true if you have more than two groups though. Furthermore, if you actually want to use this as a linear model, you will have to do an extra step to scale the coefficients if you do an ANOVA rather than a linear model.

So… that was kinda fun. And another chance to get to know ggplot2 better. Some code is suppressed here but you can see the full knitr document here.

Zoon

2014-12-04T00:00:00-08:00

Reproducible science and ZOÖN Internship.

Reproducibility in science (without getting into semantics) is the ability of other scientists to reproduce your results. The first step of that is being able to check what you have done. Did you make a mistake with your algebra? Does running the same experiment give wildly different results? As the use of computational methods in ecology increases, we are in a position where we should be able to quickly and easily reproduce the research in an entire paper. First I rerun your code, and check that the outputs match those in your paper (should be easy). Then I check the code for errors (less easy).

However, even the first step is often hampered. Code is not included in a paper, or is hidden in an unsuitable format in the supplementary material, which is hosted neither carefully nor with longevity in mind. When code is included, the data needed to run an analysis is often not. Other times, a script is included, but is a mess with different bits of analysis and output all jumbled together.

Species distribution modelling (SDM) uses data on where a species lives to predict the whole distribution of the species. In short, a species is likely to exist in areas with environmental conditions similar to those we have seen it in before. So, as long as your data is shared, I should be able to reproduce your results with minimal effort. However, even field defining papers are completely unreplicable. For example Elith et al. (2006) benchmarks how good a number of different models are and has been cited some 3,000 times. However the paper is totally unreplicable. It would be great would be to add more recently developed methods to this benchmark. If a new method can’t outperform the current ones, then it is not very useful. But with the previous benchmark being unreplicable, this is not possible.

An example use of SDM, mapping the climatic niche of leishmaniases. Pigott et al. 2014. DOI: http://dx.doi.org/10.7554/eLife.02851.007

The Internship

Over the past months I have been working on an internship creating an R package for reproducible SDMs. The package is called ZOÖN and can be found on github with more information here. The ideas behind ZOÖN have been developed over the last year, with consultation of SDM users at every step (i.e. before I started). It is hoped that this constant discussion will avoid pitfalls of writing software that is then never used. It was decided that while there are great SDM packages out there (biomod2, maxent etc.) there was still a gap for a higher level package, that aids the running, sharing and reproducing of whole SDM analyses, including data collection, data cleaning and outputs. However, as this is a fast moving field, an inflexible package, written and maintained by a small group of developers, would quickly become out of date. So instead the plan is to use web-hosted ‘modules’ that are quick and easy to program (compared to a full R package). ZOÖN will pull these modules from the web and run an analysis. This also means wrappers for other packages can easily be written.

Presenting the package at a workshop

The goal of the internship was to write a working prototype R package which I think I have succeeded in. The working package is on github and can be installed in R with devtools::install_github("zoonproject/zoon"). Although there is some work to be done, the core package works. Whole SDM workflows can be run with one command. The output then contains all the data needed to run the analysis and a record of the call (the inputted text command) used to run the analysis. As the modules are all online, an analysis can be rerun simply be having access to this output (one R object.) In the case of analyses using online data (from GBIF for example) only the call is needed to rerun an analysis. Furthermore, while still in early development, there is already a very simple way to upload an analysis to Figshare.

To run a SDM with the package you must specify at least five ‘modules’. One each that: collects occurrence data, collects environmental data, processes the data, runs a model, gives some output. To flesh out the variety of analyses possible will require much more work writing modules (there are plans for a hackathon to get this going.) However, as wrapping existing packages is easy, there are already modules for running all the models available in Biomod2, collecting data from GBIF, worldclim and NCEP and creating basic maps and uploading analyses to Figshare to name a few.

Two distributions of Eresus sandaliatus created using ZOÖN

So in three months I think I have laid the ground work for a package that really simplifies sharing analyses while making it easy for new methods to be incorporated into current analyses.

Open science

This project has been conducted in a very open manner which I have really enjoyed. The code can be found on github as soon as it is written. And the code is licensed to make it useable by anyone. Most of us are on twitter and happy to discuss the research. And as discussed above, regular contact with a user panel means we are not locked in our dark computer lab, working in isolation.

Lessons

Through this internship I have learned an awful lot about the nuts and bolts of R. Writing a package is a really good way to get to know the language better. I can totally recommend R packages and Advanced R for more information.

I have also become much more comfortable with handling a large (ish) software project. Git is now second nature, using Github to record issues has become an invaluable tool and the benefits of unit testing have become clearer.

On a less tangible front it has been really interesting to see how differently people approach approach the community side of software development. Without users, your software is worthless. This project relies on it’s community for more than just a user base. We are hoping for users to contribute code in the form of modules. So community development has been important from the beginning. However I really liked working while talking to potential end users (although a workshop six weeks into a project is terrifying.) I don’t think this approach is easy, but I definitely think it’s worth putting effort into building a community around your software.

And now it just remains to see how the project develops and whether the software becomes commonly used.