Jekyll2023-10-09T03:20:28-07:00https://timcdlucas.github.io/feed.xmlTim CD LucasAcademic websiteTim CD LucasWhat_size_predictive_intervals_should_we_use2022-08-13T00:00:00-07:002022-08-13T00:00:00-07:00https://timcdlucas.github.io/What_size_predictive_intervals_should_we_use<hr />
<h1 id="what-size-predictive-intervals-should-we-use">What size predictive intervals should we use?</h1>
<p>In some of <a href="https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssc.12484">my</a> <a href="https://www.sciencedirect.com/science/article/pii/S1877584520300356">papers</a> I’ve used 80% predictive intervals instead of the standard 95% predictive intervals.
I didn’t say why in the papers and I was thinking through it again the other day so I thought I’d write it down.
The focus here is on prediction intervals (the range of predicted values supported by the model) rather than confidence/credible intervals (the range of parameter values supported by the model).
Some of the arguments might transfer, but probably not all of them.</p>
<p>I think most people know that 95% is an arbitrary number plucked out of thin air by an unsavoury racist one hundred years ago.
So there’s no rule that says we must use 95%, but how then do we decide which value to use?
I’m fairly convinced that not all prediction intervals are created equal.
A 40% prediction interval (assuming it’s not presented alongside other intervals) is an odd metric that represents a high density interval that the true value probably <em>isn’t</em> in (<50% chance).</p>
<p>Given that, here are some things to consider when chosing an interval.
I’ll use the binary decision of 80% versus 95% just to illustrate things.
And I will ignore the fact that of course we often want to use the full distribution rather than summarise a distribution as a single interval; there are so many cases where communicating a full distribution is not feasible.
If you can use 2 or 3 intervals, or density plots of the full distribution, do that; if you must summarise a distribution as a single interval, perhaps consider these points.</p>
<h2 id="how-do-policy-makers-interpret-intervals">How do policy makers interpret intervals?</h2>
<p>This point is probably the most subjective, but possibly the most important.
Despite our best efforts, I think most people, even trained scientists, think of a 95% prediction interval as “the true value is almost certainly in this interval”.
Obviously, this isn’t true.
1 in 20 95% prediction intervals will not cover the true value.
Furthermore, we are talking about prediction intervals, and we will almost certainly be making hundreds of predictions.
Therefore, we must expect many of our prediction intervals not to cover their respective true values.</p>
<p>I think perhaps this problem doesn’t exist for an 80% interval.
I don’t think people look at an 80% prediction interval and think “the true value is almost certainly in this interval”.
It’s more like a “best guess” interval.
Perhaps this is more useful.
But to reiterate, this is a very subjective point and I have no hard evidence for this point.
The counter argument is that perhaps people can’t switch between intervals easily, and that therefore we should somehow settle on a single interval; given it’s history, 95% would be the clear winner in this case.</p>
<h2 id="wide-and-confident-or-thin-and-unsure">Wide and confident or thin and unsure?</h2>
<p>This point is similar to that above but less subjective.
It is a question of what is really useful in a prediction interval.
Do we want a very wide interval that we’re 95% sure the true value is within, or do we want a smaller interval that we are 80% sure contains the true value.</p>
<p>Prediction intervals can quickly become massive and controlling this size can sometimes usefully guide our decisions.
For example, imagine you are diagnosed with a terminal disease and told that your 95% predictive survival interval is between 1 and 15 years (I’m thinking for myself as a 30 something. I guess the details will change depending on your age).
What can you actually do with this information?
On the lower end you have enough time to sort your affairs, visit some family and take a great last trip.
On the upper end you have enough time to do anything really; start a new career, see your kids grow up, die from a variety of other causes.</p>
<p>Imagine instead you were given an 80% interval of 3 to 6 years.
This “probably true” interval is quite useful.
You have some time, you don’t have prioritise only 3 or 4 things to do before you die.
But starting a new career may well be a waste of time.
Of course, all the survival times that were in the 95% interval are still possible, but this “probably true” interval focusses on a reasonable range of very likely values.</p>
<p>Relatedly, prediction intervals typically get wider faster as you increase the probability interval (look at the shape of a normal distribution for intuition).
In a normal distribution, a 95% interval is more than 53% bigger than an 80% interval.
Is the extra 15% confidence that the true value is in the interval, worth the 50% increase in width?
I’d say often it isn’t.</p>
<h2 id="confident-and-wrong-or-unsure-and-right">Confident and wrong or unsure and right?</h2>
<p>There are many reasons why we would expect an 80% prediction interval to have better calibration than a 95% prediction interval.
Model mispecification and approximations or MCMC methods that break down in the tails are two examples.
I would prefer a well calibrated 80% interval to a poorly calibrated 95% interval.</p>
<p>With respect to this point, we can consider a few related situations.
Did we just choose one interval from the outset and only check the calibration of that one interval?
If so, I think it is reasonable to consider which interval is likely to be better calibrated.
As long as we don’t claim that we have evidence that the model is well calibrated across all intervals, we have tested one aspect of the model and found it to be adequate.
Perhaps instead, we are testing the calibration of the model with respect to both 80% and 95% prediction intervals.
How is it reasonable to behave if we find the model is well calibrated for 80% intervals and badly calibrated for 95% intervals?
Again, I think it is totally fine to recommend that users of the mode use the 80% interval.
This is similar to saying “linear regression works well as long as you don’t extrapolate far outside the range of the covariates”.
We are guiding the user as to when the model does and does not work and again this is totally fine.</p>
<h2 id="confident-but-badly-estimated-coverage-or-unsure-but-well-estimated-coverage">Confident but badly estimated coverage or unsure but well estimated coverage?</h2>
<p>My last point is that estimating the calibration of a model is easier when the interval is smaller.
In the same way as having few cases vs controls makes our effective sample size small, having large prediction intervals gives us fewer data points where the prediction intervals do not cover the true value.
For example, if our dataset contains 200 datapoints, a well calibrated model will have around 10 datapoints where the 95% prediction interval does not cover the true value.
In contrast, the 80% prediction interval would give us 40 failures, a much more reasonable sample size.
So if we are working with modest sample sizes, would we prefer a 95% prediction interval, where our estimates of coverage are very noisey, or an 80% prediction interval with much tighter estimates of coverage.
In the above example with n = 200, our 95% confidence interval of our coverage, if we observed exactly 5% of intervals to not cover their true values, would be 2.4% - 9%.
I would consider 2.4% or 9% to imply fairly poor calibration, so in this case we are really unsure whether our model is well calibrated or not.
In contrast, if we observed exactly 20% of intervals to not cover their true values, our confidence intervals for the coverage of the 80% interval would be 14% - 26%.
So we’re pretty sure the coverage for the 80% prediction interval is ok.</p>
<h2 id="conclusions">Conclusions</h2>
<p>So in conclusion, perhaps we should think about what intervals we use a bit more.
Different considerations will apply in different situations, and almost certainly there are different considerations for prediction intervals and confidence/credible intervals.
As in the intro, we should avoid summarise full distributions when we can, but often we can’t.</p>Tim CD LucasHow_people_trick_themselves_into_thinking_they_can_predict_the_stock_market2022-07-15T00:00:00-07:002022-07-15T00:00:00-07:00https://timcdlucas.github.io/how_people_trick_themselves_into_thinking_they_can_predict_the_stock_market<hr />
<h1 id="how-people-trick-themselves-into-thinking-they-can-predict-the-stock-market">How people trick themselves into thinking they can predict the stock market</h1>
<p>I keep accumulating ideas of things to make into YouTube videos or long careful tutorials or whatever else.
And then I never do them.
So the new plan is to just do quick blog posts when I think about something.
If one day I come back around and cover the same material in more detail then good!
Otherwise at least it’s out there.</p>
<p>There is a whole genre of YouTube videos showing how to predict the stock market.
Lots of these videos use relatively complex models, namely lstm neutral networks.
Most of these videos use very simple data, namely the price of the same stock at previous times.
These videos typically conclude (and splash loudly on the thumbnail) that they can predict the stock market with appreciable accuracy.</p>
<p>A few simple arguments, that have been well made elsewhere, indicates that they are probably wrong.
If they can predict the stock market, they will be off making millions, not making YouTube videos.
If they can easily predict the stock market, then so can everyone else.
If everyone can predict the stock market, any predictable signal from the data will become priced in and the whole process will quickly fall apart.</p>
<p>So the question is, why does these people’s analyses say they can predict the stock market, when we can be pretty sure that they, in fact, can’t.
Like many things, the ‘why’ is more interesting than the fact itself.</p>
<p>I think there’s at least four reasons, some of which are quite subtle.
These reasons are interesting both directly for people interested in the stock market, but also interesting for anyone interested in forecasting or predictive modeling more generally.
For the benefit of readers with short attention spans, I’ll start with the most interesting, subtle reason, that I haven’t seen discussed at length before.
I’ll then circle back to the more obvious answers.</p>
<h2 id="data-leakage-from-stock-choice">Data leakage from stock choice</h2>
<p>If I asked you to tell me everything you know about Tesla (especially if you are interested enough in the stock market to be reading this post) you might answer something like “electric cars, Elon Musk, massive stock price growth”.
Notably, most of the videos trying to predict the stock price use one of Tesla, Apple or an index fund like spy.
With all of these stocks, we have (probably unconciously) used our knowledge of the present to select them.
These are stocks that have, on average, gone up.</p>
<p>This simple fact means the prices of these stocks suddenly are predictable to an extent.
A model that predicts a small, positive change in the stock price will do better than random.
However, stocks go up, until they stop going up.
If you took a random stock, or many random stocks, and tried to predict the price in the future we wouldn’t have this small guarenteed predictive ability.
Similarly, if you took the model that predicts a small positive change in the stock price and applied it, long term, to Apple or Tesla, your predictive accuracy would depend entirely on whether these stocks keep going up or not.
Eventually they’ll come down, all companies eventually go busy.</p>
<p>This point is quite subtle and I haven’t seen people make it before.
But it applies beyond the stock market.
For example, if you predict something about species population size or epidemic size, but only use species or diseases that haven’t gone extinct, your models will perform better than they would in the real world.</p>
<h2 id="predicting-value-not-change">Predicting value, not change</h2>
<p>I think the actual biggest reason that people on youtube trick themselves into thinking they can predict the stock market is that they often try to predict stock price rather than the change in the stock price.
Stock prices change through time, but generally not hugely.
Other ways of saying this is that the stock price today depends a lot on the stock price yesterday, or that the stock price is an autoregressive process.</p>
<p>So for a stock that has changed price a lot over a long period, such as Tesla that is now worth much more than it was 10 years ago, a model that predicts tomorrows price as being the same as todays price, will have very high apparent “predictive ability”.
When the price went from <span>$</span>1 to <span>$</span>1.1, you predict <span>$</span>1.
When the price went from <span>$</span>100 to <span>$</span>110, you predict <span>$</span>100.
The correlation between your predictions (<span>$</span>1 and <span>$</span>100) and the truth (<span>$</span>1.1 and <span>$</span>110) is high.</p>
<p>However, the problem with this is that any benefit in predicting the market gained by this property, is exactly cancelled out by the fact that to make money off a stock today, you have to have bought it yesterday.
The value of a stock doesn’t have any bearing on your profit, only the change in the value of the stock.</p>
<h2 id="using-graphs-as-metric">Using graphs as metric</h2>
<p>A related problem is that many videos fit models, make predictions of the stock price then plot the predictions against the truth in a typical time-series line plot.
The lines on these plots often follow each other and look quite convincing.
However, this approach fails in the same way as the issue of predicting value, not change.
A model that predicts that the stock price tomorrow is the same as the stock price today will look pretty good on these plots.
But they are unfortunately utterly useless in terms of making money on the stock market.</p>
<h2 id="using-data-from-the-future">Using data from the future</h2>
<p>The final, least subtle point, is that it’s easy to accidentally use data from the future.
I think most of the youtube videos are actually quite careful on this point, but I thought I’d include it for completeness.</p>
<p>It is off course obvious that if you use tomorrows stock price as a predictor in your model, you will be able to predict tomorrow’s stock price!
However, there is a risk of accidentally using more subtle information from the future.
If you are using other stocks as predictors, you need to make sure you are using todays stock price not tomorrows.
Imagine you are predicting the change in price of Pepsi stocks, but using changes in Coca-cola stock prices as a predictor.
If the government announces a sugar tax, both stock prices will fall (I guess).
If you accidentally use tomorrows change in Coca-cola stock price, you are accidentally telling your model that there will a sugar tax announced tomorrow, but this is information you would not have in a real model.
Relatedly, in the process of calculating compound variables such as moving averages, open-close, high-low you can accidentally use future information.
A model that buys when the price hits the weekly high is using information from the future as you don’t know what the high is until the end of the week, at which point you’ve missed the high you were hoping to buy at.</p>
<p>Conceptually this is mostly quite simple.
The problem is that it is easy to mess it up in your code if you are not careful.</p>
<h2 id="final-thoughts">Final thoughts</h2>
<p>So overall, it’s actually essentially impossible to predict the stock market, in a useful way, using stock prices, desktops and tens of minutes of effort.
You are always going to be slower than the high-frequency trading algorithms, and anything simple will have already been done.
It’s still fun to try, but it’s mostly a humbling experience that tests your ability to implement everything carefully.
It’s also a fun exercise in knowing when to give up, when to conclude that the variable of interest is impossible to predict given the available data.
This is a topic I’d like to write more about.</p>
<p>Interestingly, if I remember correectly, predicting the change in volume (the number of shares bought and sold) is possible.
Unfortunately, it’s not easy to know how to make money with that information.
But it’s still perhaps a fun game to play.</p>Tim CD LucasDeep_learning_wont_improve_malaria_mapping2022-06-23T00:00:00-07:002022-06-23T00:00:00-07:00https://timcdlucas.github.io/deep_learning_wont_improve_malaria_mapping<hr />
<h1 id="why-deep-learning-will-probably-never-improve-malaria-mapping">Why deep learning will probably never improve malaria mapping…</h1>
<p>… or SDMs or prognostic models or …</p>
<p>Every couple of months I start thinking about deep learning again.
A few days and a few headaches later I conclude that it is not useful for most of the work I do.
So I thought I’d write down my thought process this time to try and save my poor aching brain from another period of thought.</p>
<p>I’ve tried to be careful with the title wording.</p>
<p>Why - I’ll give my reasons, it’s not just a hunch.</p>
<p>deep learning - Deep neural nets but also other methods. Anything where the learning involves a transformation of a transformation of a transformation of a … of the data. But importantly not including other machine learning methods which I use all the time because they often improve predictive accuracy in my work.</p>
<p>probably - an uncertain forecast</p>
<p>never - even if data size and compute increases 100x and even after 50 postdoc years of effort</p>
<p>improve - predictive accuracy. Other elements of statistical modelling are not the topic of this post.</p>
<p>malaria mapping or SDMs or prognostic models or… - I fully acknowledge the sucesses of deep learning. But I don’t think they’ll help in my work.</p>
<p>So the first thing to note is that there’s two ways that depth in neural networks and other methods is commonly used.
The first is the case of having multiple dense hidden layers (i.e. all nodes in each layer is connected to all nodes in the next layer).
As far as I understand this type of architecture is not that important to the success of deep learning.
I don’t quite understand the benefits of this architecture compared to a shallow but very wide neural network (one hidden layer with a lot of nodes) but they are commonly used so they must be useful.
However, the important thing here is that ultimately, the only thing that this architecture provides is increased flexibility, or nonlinearity, in the model.
However, something like a RandomForest or boosted regression trees can have unlimited nonlinearity.
So this architecture isn’t providing anything particularly unusual.
Furthermore I’ve had a pretty good go at using deep, dense neural networks to map malaria with very little success.
Tree based methods are very efficient with the data provided.
Due to their greedy estimation, they put all their focus on areas of parameter space that determine the output.
Neural networks are much less good at this.
So overall, I’m fairly confident that dense multilayer neural networks will never give much better predictions than tree based methods.
As we get more data, these architectures may do about as well as tree based methods, but not significantly better.</p>
<p>The architectures that have really driven the success of deep learning as we know it are convolutional neural networks.
These are the neural networks used for image and video analysis including image classification, image segmentation, self driving cars etc.
These methods generally exploit the spatial (and/or temporal) structure of the images or videos used to train them.
And specifically they make good predictions by learning good ways to represent the data.
At the top of network, there will be nodes that learn that a line of low values next to a line of high values is an “edge”.
In the middle layers, there will be nodes that learn that two horizontal edges and two vertical edges makes a rectangle.
And at the bottom of the network will be nodes that learn that a rectangle and some circles is maybe a lorry.
This type of data, and these types of representations of the data just don’t exist in most of the subjects I have worked on.
In malaria mapping, having an “edge” between cold and hot areas tells you very little about the risk of malaria*.
In prognostic modelling, there is often no covariates that have any sort of image structure at all.
You might have covariates like age and pre-existing conditions, and these covariates might have interactions.
But this idea of edges and rectangles just isn’t relevant.</p>
<p>Perhaps another way to think about this is that deep learning methods have mostly excelled in situations where humans can (more or less) easily perform the task but representing the problem in a way that the computer can usefully use is difficult.
A three year old can identify a cat, but despite years of computer vision experts hand creating features such as “circles” and “triangles”, computers still couldn’t identify a cat in an image.
We have to let the model learn how to represent the data.</p>
<p>Most of the problems I work on have the opposite situation.
We can easily represent the data in a totally acceptable way; one column for age, one column for each pre-existing conditions, done.
But even experts in the particular field often couldn’t effectively use these data to make good, quantitative predictions.
How much malaria will there be in an area with a mean temperature of 28 degrees and 120 days of rain?
Quite a lot, I guess, but that’s about the best I can do.
So in these problems, the task for the machine learning model is much more about seperating signal from noise, and also about finding nonlinear relationships and relatively simple interactions, and using these to make accurate predictions.
As far as I can see deep learning doesn’t provide anything for this task that tree-based methods can’t already do, and they’re just less efficient with the data.</p>
<p>This felt like it would be a much longer post as I was, once again, grappling with what a deep neural network is really doing (this time I was trying to think of good ways to fit convolutional RandomForests).
But oh well. I’ll chuck it up on my webpage and see if it generates some disscussion.</p>
<p>One thing I’ll note is that I really know very little about the architectures used in language models like GPT-3.
I’ve read a fair bit about long short term memory networks and they definitely aren’t relevant to most of the areas I have worked on.
Maybe there’s something here that will be useful though.
Similarly I don’t know about the deep neural networks used in reinforcement learning for robotics.
Maybe I’ll come back and edit this is 6 months when I’ve done more reading.</p>
<p>*The fact that some of the areas are hot (high malaria) and some of the areas are cooler (low malaria) may well tell you that in aggregate there will be intermediate malaria risk in the area. But this simpler fact can be handled much more directly with disaggregation regression, which is precisely what I have been working on for five years.</p>Tim CD LucasDisaggregation_regression_workshop2022-02-12T00:00:00-08:002022-02-12T00:00:00-08:00https://timcdlucas.github.io/disaggregation_regression_workshop<hr />
<p>Disaggregation regression workshop<br /><br /></p>
<h1 id="disaggregation-regression-workshop">Disaggregation regression workshop</h1>
<p>The recording of this workshop is here. <a href="https://www.youtube.com/watch?v=frKbnV5PxH4">https://www.youtube.com/watch?v=frKbnV5PxH4</a></p>
<p>I am now running it on both the 31st of March 2022 and the 7th of April 2022.
Both 3-5pm GMT.
There may be some space left so please email if you are interested.</p>
<p>Do you have areal data (county, LSA, ADMIN2 etc.) but want to make predictions at a higher resolution (5km x 5km raster etc)?
If so disaggregation regression might be for you.</p>
<p>I am running a free 2hr workshop on what disaggregation regression is and how to fit models using the R package disaggregation by Anita Nandi, myself and other contributors. If you would like to attend please just send me an email tim.lucas@leicester.ac.uk. I’ll email round a teams link nearer the time.</p>
<p>If you’d like to read more about disaggregation regression before then you could try:</p>
<p><a href="https://scholar.google.com/citations?view_op=view_citation&hl=en&user=WfpSfMAAAAAJ&cstart=20&pagesize=80&citation_for_view=WfpSfMAAAAAJ:R3hNpaxXUhUC">A simulation study of disaggregation regression for spatial disease mapping. R Arambepola, TCD Lucas, AK Nandi, PW Gething, E Cameron. Statistics in Medicine 41 (1), 1-16</a></p>
<p><a href="https://scholar.google.com/citations?view_op=view_citation&hl=en&user=WfpSfMAAAAAJ&cstart=20&pagesize=80&citation_for_view=WfpSfMAAAAAJ:qUcmZB5y_30C">Disaggregation: an R package for Bayesian spatial disaggregation modelling. AK Nandi, TCD Lucas, R Arambepola, P Gething, DJ Weiss. arXiv preprint arXiv:2001.04847</a></p>
<p><a href="https://github.com/aknandi/disaggregation">The disaggregation R package</a></p>Tim CD LucasDisaggregation regression workshopParental_leave2020-07-12T00:00:00-07:002020-07-12T00:00:00-07:00https://timcdlucas.github.io/parental_leave<hr />
<p>Working while on parental leave<br /><br /></p>
<h1 id="working-while-on-parental-leave">Working while on parental leave</h1>
<!--I published a paper that I wrote on my phone during parental leave.
caveat gender. person that gave birth as person that gave birth. partner as their partner.
caveat overworking
outline. should you work, how can you work?
--->
<p>I recently published my first sole author paper <a href="https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/ecm.1422">link</a>.
It is a review of methods for interpreting machine learning models (translucent boxes rather than black boxes).
I thought I’d write about the process because most of the paper was written while I was on shared parental leave.
So I thought I’d discuss whether you should work while on parental leave and give some tips on working while on parental leave (many of which are useful for working as a parent more generally).</p>
<p>I think a lot of the terms used in the British law governing leave is gendered and probably doesn’t match up well for trans people.
I’ll try to explicitly say “the person that gave birth” to refer to anyone who is taking leave after giving birth and just “partner” for anyone who is taking parental leave but didn’t give birth.
I’d also like to be very clear right at the beginning that this post isn’t me saying “you should work while on parental leave! Here’s how to be productive!”
Hopefully that will be clear.</p>
<!--
I'm mostly talking about 6-12 months rather than 1-6. person that gave birth in second half of a year and partners in first 6 months. that said I know nothing about people during full year parental leave.
I found it very useful to have something that wasn't baby related.
connection back to normal life.
exacerbated by the fact that I don't know people in the town I live in.
a sense of progress. baby progress is so slow.
different brain process
--->
<p>So, should you work while on parental leave?
The short answer is that if don’t want to or don’t have the energy or the time, then don’t.
My parental leave was from when my son was 7 months until 12 months.
I imagine that most people that give birth would really struggle to work during the first 6 months after birth (though I’m also aware that some countries have terrible parental leave allowances).
The physical recovery and terrible sleep patterns makes that fairly impossible.
Also, partners that take short amounts of leave straight after the birth (2 weeks is common here in the UK) should really be doing everything they can to help and not skipping out to read emails.
However, by the time your baby is 6 months, I think quite a few people might actively want to do some work.</p>
<p>There are plenty of good reasons to want to work while on parental leave.
Parental leave is emotionally gruelling and any way you can find to help yourself during that period is to be recommended.
Having something to think about other than babies can be fantastically useful.
It’s easy to go a week virtually without speaking to adults and thinking about nothing but looking after your baby.
I live in a small town outside of Oxford (where I worked at the time) due to lower house prices, but this means I don’t know anyone within a short walk from my house.
While looking after my first son, I met up with people for a chat on a weekly basis, but for my second son I didn’t meet up with anyone.
Furthermore, babycare is a weird combination of incredibly difficult and all-consuming while also being basically boring (depends on your personality I’m sure).
So having a “project” to think about can be a really useful thing.</p>
<p>Secondly, a small project can give a wonderful sense of achievement and progress.
There’s very few milestones with a baby; every few months they do something new.
But day-to-day, week-to-week , the measure of success is basically “did I manage to keep my child alive today”.
That single, unchanging question is not a good way to give yourself that sense of pride and success.
So again, using a doable, achievable project, with small, regular goals and a sense of progress and achievement can be a fantastic boost for your mental health.</p>
<!--
how to work?
90% has to be on your phone. work while child goes from asleep to deep sleep etc. or while carrying them or bug gying them.
choose the right project. on phone so writing. can do little bits of code planning but that doesn't really yield a completable project.
not tooooo much research. I can't do good research on my phone. maybe others can.
immediate up date between phone and computer.
use markdown.
-->
<p>So then some tips.
Firstly, you need to be able to do 90% of the project on your phone.
While on parental leave I very rarely got an hour free to get my laptop out and start working.
I worked while my son slept on me but refused to let me leave the room and on the approximately 300 walks I took through Bure Park nature reserve to get my son to sleep.</p>
<p>Secondly, you need to choose the right project.
The project needs to be largely achievable on a phone (as above); lengthy coding sessions are difficult, field work not gonna happen.
Therefore I chose to write a review.
It was a review that I felt I could write without hours and hours of research (I find that difficult on my phone but others might find it quite doable).
The project should also be chunkable into very small periods of work that give a sense of achievement.
I counted writing a paragraph in a day a great success.
A paragraph in two days was well above what I expected of myself.
No progress for a week wasn’t rare.
But each drafted paragraph felt like an accomplishment.
Each time I edited a single paragraph I counted it as an achievement.</p>
<p>There are some technical things that make working on your phone easier.
You want the files on your phone to sync directly with your computer.
When you do get half an hour to work on your computer, you don’t want to waste that time working out if your files are up to date or take 5 minutes emailing the file to yourself or whatever.
You want it just there.
I use the Dropbox app which has a text editor in it.
Writing in plain text (probably markdown) is also really useful because you can make the text big.
Word is totally useless on a phone.
Something like markdown is also useful because you can leave comments to yourself so you can quickly get back to what your were doing.
I wrote my plan in comments and then filled in the full paragraphs underneath.
You want the plan and the text in the same document because switching documents is annoying on a phone.
Keeping the plan once you’ve started writing is useful so that you know what a paragraph was supposed to say, even if you didn’t do your best drafting on that paragraph.
When editing I also left a comment to tell myself which paragraphs I had edited.
I still use these ideas for doing work on my phone while traveling or in spare minutes here and there.
I often plan out presentations (beamer) or small documents on my phone.
To a much lesser extent I’ve also planned software, writing the function names and what the inputs and outputs for each function should be.</p>
<p>So, to reiterate, I am not trying to say “you should work while on parental leave and here is how”.
However, if you actively wish to work, as part of your own mental health management, maybe some of these ideas might help.
I am 100% sure that for me, writing this paper during leave was beneficial to my mental health.
I can very easily imagine that it wouldn’t be for many other people.</p>Tim CD LucasWorking while on parental leaveMixed Models2019-10-15T00:00:00-07:002019-10-15T00:00:00-07:00https://timcdlucas.github.io/mixed-models<hr />
<p>A primer on Bayesian mixed-effects models. <br /><br /><img src="/images/bayes_strong-1.png" /></p>
<h1 id="a-primer-on-bayesian-mixed-effects-models">A primer on Bayesian mixed-effects models</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>knitr::opts_chunk$set(cache = TRUE, fig.width = 8, fig.height = 5)
set.seed(191016)
#install.packages("INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"), dep=TRUE)
library(dplyr)
library(ggplot2)
library(INLA)
library(malariaAtlas)
</code></pre></div></div>
<h1 id="intro">Intro</h1>
<p>This primer is an introduction to mixed effects models. I’m presenting
it by using Bayesian mixed-effects models but that’s because they are
easier to understand. The hope is that from here it will be relatively
easy to understand frequentist mixed-effects models. Or at least, have
the intuition of what the models are doing. I still don’t understand the
nuts and bolts of frequentist mixed-effects models.</p>
<p>The aim of the primer is to explain the real fundementals of what mixed
effects-models are, why you might use them and <em>how</em> they do what they
do. This last bit (the how), is what is missed from many courses because
the how in frequentist mixed-effects models is complicated. However, in
Bayesian mixed-effects models, the how is very simple, and follows on
entirely smoothly from any other Bayesian analysis. In this case, I
think understanding how they work also makes the what and the why easier
to understand.</p>
<p>As an overview, we will look at some data and define some mathematical
models to answer some questions of interest. Then we will fit those same
models in a least squares framework, a normal Bayesian framework and
finally a mixed-effect framework.</p>
<h1 id="download-the-data">Download the data</h1>
<p>We’re going to get data using the malariaAtlas package The data will be
prevalence surveys from Asia. To keep things simple we are going to
completely ignore the sample size for each survey. Instead we will
simply do a log(x + 0.1) transform (that will approximately normalise
things) and use that as our response.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>d <- getPR(continent = 'Asia', species = 'Pf')
## Confirming availability of PR data for: Asia...
## PR points are available for Asia.
## Attempting to download PR point data for Afghanistan, Indonesia, India, Yemen, Cambodia, Bangladesh, Vietnam, Pakistan, Philippines, Nepal, Thailand, China, Tajikistan, Myanmar, Laos, Malaysia, Sri Lanka, Iraq, Saudi Arabia, Turkey, Timor-Leste, Bhutan ...
## Data downloaded for Asia.
names(d)
## [1] "dhs_id" "site_id"
## [3] "site_name" "latitude"
## [5] "longitude" "rural_urban"
## [7] "country" "country_id"
## [9] "continent_id" "month_start"
## [11] "year_start" "month_end"
## [13] "year_end" "lower_age"
## [15] "upper_age" "examined"
## [17] "positive" "pr"
## [19] "species" "method"
## [21] "rdt_type" "pcr_type"
## [23] "malaria_metrics_available" "location_available"
## [25] "permissions_info" "citation1"
## [27] "citation2" "citation3"
dtime <- d %>%
filter(!is.na(examined), !is.na(year_start)) %>%
mutate(log_pr = log(pr + 0.1)) %>%
select(country, year_start, log_pr, pr)
</code></pre></div></div>
<h1 id="we-will-ask-two-broad-questions">We will ask two broad questions.</h1>
<ul>
<li>What was the malaria prevalence in Asia and in each country 2005 -
2008 (ignoring any remaining temporal trends).</li>
<li>How did malaria change through time in Asia and in each country.</li>
</ul>
<p>To keep this clear we will make two seperate datasets.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dmean <- dtime %>% filter(year_start > 1999, year_start < 2005)
</code></pre></div></div>
<p>So that we can plot our predictions nicely we should make some
predictive data.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dmean_pred <- data.frame(country = unique(dmean$country))
dtime_pred <- expand.grid(country = unique(dtime$country), year_start = 1985:2018)
</code></pre></div></div>
<h1 id="lets-summarise-and-plot-the-data">Let’s summarise and plot the data.</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dmean$country %>% table
## .
## Afghanistan Bangladesh Bhutan Cambodia China
## 64 0 0 187 25
## India Indonesia Iraq Laos Malaysia
## 76 124 0 20 0
## Myanmar Nepal Pakistan Philippines Saudi Arabia
## 26 0 0 0 0
## Sri Lanka Tajikistan Thailand Timor-Leste Turkey
## 0 2 72 11 8
## Vietnam Yemen
## 67 26
dmean$year %>% table
## .
## 2000 2001 2002 2003 2004
## 79 111 192 209 117
</code></pre></div></div>
<h2 id="question-one-what-was-the-mean-malaria-prevalence-per-country-in-the-period-2000---2004">Question one: what was the mean malaria prevalence per country in the period 2000 - 2004</h2>
<ul>
<li>Note that some countries like Tajikistan and Turkey have very
little data. How do we estimate their mean?</li>
<li>Also note, the data is very unbalanced. How do we estimate the Asia
total without the estimate being dominated by Indonesia?</li>
</ul>
<!-- -->
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dmean <- dmean %>%
group_by(country) %>%
mutate(n = n())
ggplot(dmean, aes(x = country, y = pr, colour = n < 10)) +
geom_boxplot() +
geom_point() +
ggtitle('Malaria prevalence by country in 2000-2004')
</code></pre></div></div>
<p><img src="/images//data_plot1-1.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
geom_boxplot() +
geom_point() +
ggtitle('Log malaria prevalence by country in 2000-2004')
</code></pre></div></div>
<p><img src="/images//data_plot1-2.png" alt="" /></p>
<h2 id="discuss-mathematical-models-and-estimate-with-least-squares">Discuss mathematical models and estimate with least squares</h2>
<p>We can look at the structure of our mathematical models, and the way we
estimate the parameters completely seperately. So we can think of the
structure of a model and then estimate it with least square (<code class="language-plaintext highlighter-rouge">lm()</code>) as
a simple way to start getting intuition about what things look like.</p>
<p>Starting with the first question, we can start with a model with one,
global intercept.</p>
<p><em>y</em> = <em>β</em><sub>0</sub></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>m1 <- lm(log_pr ~ 1, data = dmean)
coefficients(m1)
## (Intercept)
## -1.85726
ggplot(dmean, aes(x = country, y = log_pr)) +
geom_boxplot() +
geom_point() +
ggtitle('Log malaria prevalence and Asia global mean') +
geom_abline(slope = 0, intercept = m1$coef[1])
</code></pre></div></div>
<p><img src="/images//m1-1.png" alt="" /></p>
<p>As our aim is actually to estimate the mean malaria prevalence for each
country, we need country to go in as a categorical variable.</p>
<p><em>y</em> = <em>β</em><sub>0</sub> + <em>β</em>.<em>c<strong>o</strong>u<strong>n</strong>t<strong>r</strong>y</em></p>
<p>It may be helpful to think about this in the explicit way it is encoded.
We have 13 countries. The ideal model would be 1 global mean and 13
country specific parameters.</p>
<p><em>y</em> = <em>β</em><sub>0</sub> + <em>β</em><sub>1</sub>.<em>A<strong>F</strong>G</em> + <em>β</em><sub>2</sub>.<em>K<strong>H</strong>M</em> + <em>β</em><sub>3</sub>.<em>C<strong>H</strong>N</em> + …</p>
<p>(I’m using ISO3 codes here. KHM is Cambodia or Khmer) Internally, R
converts the 1 categorical variable into binary variables. Variable 1 is
“is this row in AFG?”, variable 2 is “is this row in KHM?” etc.</p>
<p>So as these variables have a 1 if the row is in a given country and a
zero otherwise, a prediction for Afghanistan will be zeroes for all the
terms except <em>β</em><sub>0</sub> and <em>β</em><sub>1</sub>.</p>
<p>Unfortunately we now have to make a quick detour. This parameterisation
is unidentifiable (the data cannot tell us the answer because there are
multiple answers that fit the data equally likely). If we think about
the same model with just two countries, how could the model know whether
the intercept is high or both country-level parameters are high? When we
switch to mixed-effects models we will have a global intercept and 13
country specific parameters. But for now we will have a global intercept
and 12 country level parameters. The first country is taken as the
“reference class” and combined with the global intercept. Mostly, we can
think about the models in the same way however.</p>
<p>So now we can estimate this model with least squares</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>m2 <- lm(log_pr ~ country, data = dmean)
coefficients(m2)
## (Intercept) countryCambodia countryChina
## -1.9885722 0.2247165 -0.3045652
## countryIndia countryIndonesia countryLaos
## 0.1992390 0.1799838 0.5971321
## countryMyanmar countryTajikistan countryThailand
## 0.5630916 -0.1027141 -0.1637883
## countryTimor-Leste countryTurkey countryVietnam
## 0.0803858 -0.3140129 0.1394583
## countryYemen
## -0.0461457
pred2 <- data.frame(dmean_pred, pred = predict(m2, newdata = dmean_pred))
ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
geom_boxplot() +
geom_point() +
geom_point(data = pred2, aes(country, pred), colour = 'black', size = 4) +
ggtitle('Log malaria prevalence and country specific means')
</code></pre></div></div>
<p><img src="/images//m2-1.png" alt="" /></p>
<h2 id="now-switch-to-bayes-and-remind-ourselves-what-priors-are">Now switch to Bayes and remind ourselves what priors are.</h2>
<p>Bayesian mixed modelling is essentially taking the above model
structures and doing clever things with priors. First we’ll do more
standard things with priors to remind ourselves what they mean.</p>
<p>A prior is how we tell the model what is plausible based on our
knowledge before looking at the data. The intercept in our model is the
average malaria prevalence across Asia (in log space). Is prevalence of
1 (in prevalence space) reasonable? No! So our prior should tell the
model that this is very unlikely.</p>
<p>So first let’s fit our model in a Bayesian framework with INLA. The
priors on fixed effects here are normal distributions with a mean and
precision (1/sqrt(sd)). For our first model we are putting very wide
priors on the parameters which should give us parameter estimates very
similar to the least squares estimate.</p>
<p>In the first model, the global intercept was dominated by countries with
lots of data like Indonesia. These data aren’t independant because we
expect the data within Indonesia to be more similar than the data
between Indonesia and other countries. If we were independantly sampling
each person in Asia, China and India would have a lot more data than
Cambodia! When people talk about autocorrelation in the data and
mixed-models this is what they are referring to. While removing this
autocorrelation is good, most of the statistical power will go into
learning country level intercepts, not the global mean.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># easiest way to predict with INLA is to put the prediction data in with NAs in the Y column.
dmean_both <- bind_rows(dmean, dmean_pred)
pred_ii <- which(is.na(dmean_both$log_pr))
# Very vague priors first.
priors <- list(mean.intercept = -2, prec.intercept = 1e-4,
mean = 0, prec = 1e-4)
bm1 <- inla(log_pr ~ country, data = dmean_both,
control.fixed = priors,
control.predictor = list(compute = TRUE))
predb1 <- data.frame(dmean_pred, pred = bm1$summary.fitted.values[pred_ii, 1])
ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
geom_boxplot() +
geom_point() +
geom_point(data = predb1, aes(country, pred), colour = 'black', size = 4) +
ggtitle('Log malaria prevalence. Bayesian means with vague priors.')
</code></pre></div></div>
<p><img src="/images//bayes-1.png" alt="" /></p>
<p>Now let’s say that we think all countries are fairly similar. To encode
that in the prior we say that the <em>β</em><sub><em>i</em></sub>’s should be small.
INLA works with precision (1/variance) so high precision is a tight
prior around 0.</p>
<p>This is “pooling”. Our estimates for countries with not much data will
be helped by information from the other countries.</p>
<p>Our estimates for the global mean will be dominated by countries with
lots of data. But we won’t have put all our statistical power into
learning the country level parameters.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>priors <- list(mean.intercept = -2, prec.intercept = 1e-4,
mean = 0, prec = 100)
bm2 <- inla(log_pr ~ country, data = dmean_both,
control.fixed = priors,
control.predictor = list(compute = TRUE))
predb2 <- data.frame(dmean_pred, pred = bm2$summary.fitted.values[pred_ii, 1])
ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
geom_boxplot() +
geom_point() +
geom_point(data = predb1, aes(country, pred), colour = 'black', size = 4, alpha = 0.3) +
geom_point(data = predb2, aes(country, pred), colour = 'black', size = 4) +
ggtitle('Log malaria prevalence. Strong, pooling priors.') +
labs(subtitle = 'Least squares estimates in grey. Estimates pulled towards global mean.')
</code></pre></div></div>
<p><img src="/images//bayes_strong-1.png" alt="" /></p>
<h1 id="the-crux">THE CRUX</h1>
<p>So if we think that all countries are quite similar, we should put a
strong prior on the country level parameters. For countries with little
data this means our estimates are close to the global mean. This is
“pooling”. But it also means the global estimate will be dominated by
countries with lots of data.</p>
<p>If we think that countries are quite dissimilar, we should put a weak
prior on the country level parameters. For countries with little data,
our estimates will be noisey, but maybe that’s better than them being
biased towards the mean. Our global estimate won’t be dominated by any
one country.</p>
<p>The problem then is <em>how similar are countries</em>. Often, we don’t know.
So how do we set our priors sensibly. The answer is mixed-effects
models.</p>
<h1 id="mixed-effects-model">Mixed-effects model</h1>
<p>Our models above looked like this:
<em>y</em> = <em>β</em><sub>0</sub> + <em>β</em><sub>1</sub>.<em>A<strong>F</strong>G</em> + <em>β</em><sub>2</sub>.<em>K<strong>H</strong>M</em> + <em>β</em><sub>3</sub>.<em>C<strong>H</strong>N</em> + …
<em>β</em><sub>0</sub> ∼ <em>N<strong>o</strong>r<strong>m*(−2, 10000)
*β*<sub>*i*</sub> ∼ *N</strong>o<strong>r</strong>m</em>(0, 0.001)</p>
<p>We are now saying “we don’t know what number to choose instead of
0.001”. So, along with the rest of the model we will estimate it. We
don’t know how different the different countries are, so we will let the
data tell us.</p>
<p>To do this, we switch the 0.001 for a new variable, <em>σ</em> and put a prior
on sigma.
<em>y</em> = <em>β</em><sub>0</sub> + <em>β</em><sub>1</sub>.<em>A<strong>F</strong>G</em> + <em>β</em><sub>2</sub>.<em>K<strong>H</strong>M</em> + <em>β</em><sub>3</sub>.<em>C<strong>H</strong>N</em> + …
<em>β</em><sub>0</sub> ∼ <em>N<strong>o</strong>r<strong>m*(−2, 10000)
*β*<sub>*i*</sub> ∼ *N</strong>o<strong>r</strong>m</em>(0, <em>σ</em>)
<em>σ</em> ∼ some prior distribution
Mixed-effects models are also called hierarchical models for this
reason, the prior on the prior is hierarchical.</p>
<p>So, now if the countries that do have lots of data are very different
from each other, the model will learn that <em>σ</em> must be quite big.
Therefore the countries with little data will not be pulled towards the
mean much. If the countries with lots of data are very similar, then a
country with little data should be pulled towards the mean. If the few
data points lie far from the global mean then probably it’s just by
chance.</p>
<p>Setting hyperpriors can be awkward. Note that <em>σ</em> must be positive so we
need a prior that reflects that.</p>
<p>Recently Penalised complexity priors have been developed and they are
much more intuitive. You choose a “tail value”: What is the largest
value of <em>σ</em> that is reasonable? You then tell the model that the
probability that <em>σ</em> is greater than that value is a small probability
(1% or something).</p>
<p>So for now we’ll say <em>P</em>(<em>σ</em> > 0.1)=1%.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>priors <- list(mean.intercept = -2, prec.intercept = 1e-4)
hyperprior <- list(prec = list(prior="pc.prec", param = c(0.1, 0.01)))
f <- log_pr ~ f(country, model = 'iid', hyper = hyperprior)
mm1 <- inla(f, data = dmean_both,
control.fixed = priors,
control.predictor = list(compute = TRUE))
mm1$summary.hyperpar
## mean sd 0.025quant
## Precision for the Gaussian observations 4.938214 0.2654638 4.433231
## Precision for country 32.062486 11.9888290 14.787373
## 0.5quant 0.975quant mode
## Precision for the Gaussian observations 4.932598 5.476931 4.922921
## Precision for country 30.018967 61.137516 26.332045
1 / mm1$summary.hyperpar$mean[2]
## [1] 0.0311891
predm1 <- data.frame(dmean_pred, pred = mm1$summary.fitted.values[pred_ii, 1])
ggplot(dmean, aes(x = country, y = log_pr, colour = n < 10)) +
geom_boxplot() +
geom_point() +
geom_point(data = predb1, aes(country, pred), colour = 'black', size = 4, alpha = 0.3) +
geom_point(data = predm1, aes(country, pred), colour = 'black', size = 4) +
ggtitle('Log malaria prevalence by country in 2000-2004 (pooling priors)')
</code></pre></div></div>
<p><img src="/images//mixed_model-1.png" alt="" /></p>
<p>We can look at the prior and posterior for the hyperparameter that
governs the strength of the prior. We can look at this on the internal
precision scale used by INLA. We said that the prior probability that
the precision is <em>less</em> that 3.1 is 1%. On the sd scale we said the
prior probability that the sd is <em>greater</em> than o.1 is 1%. I’ve plotted
these in red lines. They don’t seem quite right but at least on the
right side.</p>
<p>I’m also not 100% sure that my scaling here is correct. The posterior
(black) looks very wide and high.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># These plots mess up knitr...
#plot(mm1, plot.lincomb = FALSE, plot.random.effects = TRUE,
# plot.fixed.effects = FALSE, plot.predictor = FALSE,
# plot.prior = TRUE)
#abline(v = 10, col = 'red')
# Plot the posterior on precision scale
plot(mm1$marginals.hyperpar$`Precision for country`, type="l", xlim=c(0, 80))
</code></pre></div></div>
<p><img src="/images//model_plots-1.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Plot the prior then add posterior
kappa <- exp(seq(-5, 15, len=10000))
prior.new = inla.pc.dprec(kappa, 0.1, 0.01)
plot(kappa, prior.new, col = 'blue', type = 'l', xlim = c(0, 200), ylim = c(0, 0.002))
lines(mm1$marginals.hyperpar$`Precision for country`)
abline(v = 3.1, col = 'red')
</code></pre></div></div>
<p><img src="/images//model_plots-2.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Plot the posterior on sd scale
sd_scale <- mm1$marginals.hyperpar$`Precision for country`
sd_scale[, 'x'] <- 1/sqrt(sd_scale[, 'x'])
plot(sd_scale, type="l", xlim=c(0, 1))
</code></pre></div></div>
<p><img src="/images//model_plots-3.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Plot prior on sd scale.
plot(1/sqrt(kappa), prior.new, col = 'blue', type = 'l', xlim = c(0, 1), ylim = c(0, 0.002))
lines(sd_scale)
abline(v = 0.1, col = 'red')
</code></pre></div></div>
<p><img src="/images//model_plots-4.png" alt="" /></p>
<h2 id="bit-more-on-priors">Bit more on priors</h2>
<p>Between working out what scale you’re using (variance, sd or precision)
and between the intuition being difficult, these priors can be difficult
to think about and to choose your values.</p>
<p>The way I’ve found to go about this is just plotting distributions. What
does a normal with sd of 1 look like? If a country had an iid estimated
iid effect of 1, is that plausible? Do something simple like the rough
intercept + 1 and transform back into the natural scale.</p>
<p>For example, lets start by thinking of N(0, 1). It would be quite easy
to get values around -2.5 and 2.5 from this. So with an intercept of
something like -1.5 this gives us values ranging from
exp(−1.5 − 2.5)=0.1
on the prevalence scale, which is reasonable, and
exp(−1.5 + 2.5)=2.7
at which point we realise that we should be using logit not log, and
that probably we don’t want a country being estimated prevalence above 1
and that N(0, 1) is really very flexible. Our prior of 0.1 being on the
upper end of likely is therefore kind of reasonable.</p>
<p>INLA has these penalised complexity priors and they are quite nice. If
you end up using other Bayesian packages you may well have to use other
priors. Gamma distributions and half normals on SD are common. Same
thing though, plot some distributions and see how reasonable it is.</p>
<p>And see this paper. In particular Figure 4.
<a href="https://arxiv.org/abs/1709.01449">https://arxiv.org/abs/1709.01449</a></p>
<h1 id="question-two-what-were-the-malaria-trends-in-asia-and-in-each-country">Question two: What were the malaria trends in Asia and in each country.</h1>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dtime$country %>% table
## .
## Afghanistan Bangladesh Bhutan Cambodia China
## 224 364 23 211 102
## India Indonesia Iraq Laos Malaysia
## 219 1117 11 76 15
## Myanmar Nepal Pakistan Philippines Saudi Arabia
## 38 0 56 350 2
## Sri Lanka Tajikistan Thailand Timor-Leste Turkey
## 18 8 105 11 8
## Vietnam Yemen
## 150 136
dtime$year %>% table
## .
## 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
## 64 64 49 12 39 33 50 114 60 136 52 206 125 79 41
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 79 111 192 209 117 216 175 650 223 10 35 1 34 60 4
## 2015
## 4
</code></pre></div></div>
<p>Note that Bhutan and Iraq have one data point each. Start thinking how
you would estimate a temporal trend in those countries. As above, how do
we estimate a temporal trend without it being dominated by the trend in
Indonesia. The above mixed-effects model was called a random intercepts
model. The “random” component was the iid country effect and we were
estimating many intercepts. Now we will look at a random slopes models.
The regression slopes will become our random component.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ggplot(dtime, aes(x = year_start, y = pr)) +
geom_point(alpha = 0.4) +
facet_wrap(~ country, scales = 'free_y', ncol = 3) +
ggtitle('Malaria prevalence by country through time')
</code></pre></div></div>
<p><img src="/images//data_plot2-1.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ggplot(dtime, aes(x = year_start, y = log_pr)) +
geom_point(alpha = 0.4) +
facet_wrap(~ country, scales = 'free_y', ncol = 3) +
ggtitle('Log malaria prevalence by country through time')
</code></pre></div></div>
<p><img src="/images//data_plot2-2.png" alt="" /></p>
<h3 id="going-back-to-least-squares">Going back to least squares.</h3>
<p>For question two we need to include a “year_start” term. The simplest
model we can usefully do is a global year term and ignore country level
lines.</p>
<p><em>y</em> = <em>β</em><sub>0</sub> + <em>β</em><sub>1</sub><em>y<strong>e</strong>a**r</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>m3 <- lm(log_pr ~ year_start, data = dtime)
pred3 <- data.frame(dtime_pred, pred = predict(m3, newdata = dtime_pred))
ggplot(dtime, aes(x = year_start, y = log_pr)) +
geom_point(alpha = 0.4) +
facet_wrap(~ country, ncol = 3) +
geom_line(data = pred3, aes(y = pred)) +
ggtitle('Log malaria prevalence by country through time: only one slope')
</code></pre></div></div>
<p><img src="/images//time_models-1.png" alt="" /></p>
<p>We could instead estimate a seperate intercept for each model but still
only one slope. As above we would want this intercept to be a random
effect but for now it isn’t.
<em>y</em> = <em>β</em><sub>0</sub> + <em>β</em><sub>1</sub><em>y<strong>e</strong>a<strong>r* + *β*<sub>2</sub>.*A</strong>F<strong>G* + *β*<sub>3</sub>.*K</strong>H<strong>M* + *β*<sub>4</sub>.*C</strong>H**N</em> + …</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>m4 <- lm(log_pr ~ year_start + country, data = dtime)
pred4 <- data.frame(dtime_pred, pred = predict(m4, newdata = dtime_pred))
ggplot(dtime, aes(x = year_start, y = log_pr)) +
geom_point(alpha = 0.4) +
facet_wrap(~ country, ncol = 3) +
geom_line(data = pred4 , aes(y = pred)) +
ggtitle('Log malaria prevalence. Seperate intercepts, one slope.')
</code></pre></div></div>
<p><img src="/images//m4-1.png" alt="" /></p>
<p>Or we could estimate one slope and one intercept for each country as
well as a global slope and global intercept. This model makes sense and
let’s us answer the questions we are asking.</p>
<p><em>y</em> = <em>β</em><sub>0</sub> + <em>β</em><sub>1</sub><em>y<strong>e</strong>a<strong>r* + *β*<sub>2</sub>.*A</strong>F<strong>G* + *β*<sub>3</sub>.*K</strong>H<strong>M* + *β*<sub>4</sub>.*C</strong>H<strong>N* + … + *β*<sub>5</sub>.*A</strong>F<strong>G*.*y</strong>e<strong>a</strong>r</em> + <em>β</em><sub>6</sub>.<em>K<strong>H</strong>M</em>.<em>y<strong>e</strong>a<strong>r* + *β*<sub>7</sub>.*C</strong>H<strong>N*.*y</strong>e<strong>a</strong>r</em> + …
Again, the variables AFG etc. are 1 if the data is in Afghanistan and 0
otherwise. So <em>β</em><sub>5</sub>.<em>A<strong>F</strong>G</em>.<em>y<strong>e</strong>a<strong>r* will be zero if the
datapoint is not in Afghanistan and will be *β*<sub>5</sub>.*y</strong>e<strong>a</strong>r</em>
if the datapoint is inside Afghanistan.</p>
<p>We can fit this model with least squares.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>m5 <- lm(log_pr ~ country + year_start:country , data = dtime)
pred5 <- data.frame(dtime_pred, pred = predict(m5, newdata = dtime_pred))
## Warning in predict.lm(m5, newdata = dtime_pred): prediction from a rank-
## deficient fit may be misleading
ggplot(dtime, aes(x = year_start, y = log_pr)) +
geom_point(alpha = 0.4) +
facet_wrap(~ country, ncol = 3, scale = 'free_y') +
geom_line(data = pred5, aes(y = pred)) +
ggtitle('Log malaria prevalence. Seperate intercepts and slopes.')
</code></pre></div></div>
<p><img src="/images//m5-1.png" alt="" /></p>
<p>But the estimates in countries like Timor Leste are not very good. I
don’t believe malaria is decreasing in Timor Leste 10x faster than in
other countries. And I don’t believe Turkey is going through an
epidemic. So as above, we have a model structure that is useful, but the
way we are estimating our parameters is not very good. So first off
let’s switch to a Bayesian model.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dtime_both <- bind_rows(dtime, dtime_pred)
pred_ii <- which(is.na(dtime_both$log_pr))
# This is all just messing around getting the priors together.
# Not going to think about this too much, but we think malaria is going down.
names <- paste('country', unique(dtime$country), sep = '')
pmean <- c(rep(list(0), length(names)), -0.5) # 1 is the mean for intercepts,-0.5 is the mean for slopes.
names(pmean) <- c(names, 'default')
# 100 is the precision for the intercepts, 50 is the precision for the slopes.
pprec <- c(rep(list(0.1), length(names)), 0.001)
names(pprec) <- c(names, 'default')
priors <- list(mean.intercept = -2, prec.intercept = 1e-4,
mean = pmean, prec = pprec)
b3 <- inla(log_pr ~ country + year_start:country , data = dtime_both,
control.fixed = priors,
control.predictor = list(compute = TRUE))
predb3 <- data.frame(dtime_pred, pred = b3$summary.fitted.values[pred_ii, 1])
ggplot(dtime, aes(x = year_start, y = log_pr)) +
geom_point(alpha = 0.4) +
facet_wrap(~ country, ncol = 3, scale = 'fixed') +
geom_line(data = pred5, aes(y = pred), alpha = 0.3) +
#geom_line(data = pred3, aes(y = pred), colour = 'blue', alpha = 0.3) +
geom_line(data = predb3, aes(y = pred)) +
ggtitle('Log malaria prevalence. Pooling priors') +
ylim(-3, -0.5)
## Warning: Removed 26 rows containing missing values (geom_point).
</code></pre></div></div>
<p><img src="/images//bayes_slopes-1.png" alt="" /></p>
<p>So as above, these priors have pushed both the slopes and intercepts to
be much closer to the global mean. Again, same as above, we don’t know
how similar the intercepts and slopes are to each other, so we let the
data tell us.</p>
<h2 id="random-slopes">Random Slopes</h2>
<p>As before we have a model:
<em>y</em> = <em>β</em><sub>0</sub> + <em>β</em><sub>1</sub><em>y<strong>e</strong>a<strong>r* + *β*<sub>2</sub>.*A</strong>F<strong>G* + *β*<sub>3</sub>.*K</strong>H<strong>M* + *β*<sub>4</sub>.*C</strong>H<strong>N* + … + *β*<sub>5</sub>.*A</strong>F<strong>G*.*y</strong>e<strong>a</strong>r</em> + <em>β</em><sub>6</sub>.<em>K<strong>H</strong>M</em>.<em>y<strong>e</strong>a<strong>r* + *β*<sub>7</sub>.*C</strong>H<strong>N*.*y</strong>e<strong>a</strong>r</em> + …
And we have priors that we don’t know how strong they should be.
<em>β</em><sub>3 − 4</sub> ∼ <em>N<strong>o</strong>r<strong>m*(0, *σ*<sub>*i**n**t**e**r**c**e**p**t*</sub>)
*σ*<sub>*i**n**t**e**r**c**e**p**t*</sub> ∼ some prior distribution
*β*<sub>5 − 7</sub> ∼ *N</strong>o<strong>r</strong>m</em>(0, <em>σ</em><sub><em>s<strong>l</strong>o<strong>p</strong>e</em></sub>)
<em>σ</em><sub><em>s<strong>l</strong>o<strong>p</strong>e</em></sub> ∼ some prior distribution</p>
<p>And as above we can use penalised complexity priors.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>dtime_both$country2 <- dtime_both$country # INLA needs us to copy this column
# We will put weak priors on the fixed effects. They can do what they want.
priors <- list(mean = list(year_start = -0.5, default = -2),
prec = 1e-5)
hyper.intercept <- list(prec = list(prior="pc.prec", param = c(0.1, 0.01)))
hyper.slope <- list(prec = list(prior="pc.prec", param = c(0.1, 0.01)))
# For the formula we need year_start in for our global term.
# The global intercept is just the intercept and is included by default
f <- log_pr ~ year_start +
f(country, model = 'iid', hyper = hyper.intercept) +
f(country2, year_start, model = 'iid', hyper = hyper.slope)
mm2 <- inla(f, data = dtime_both,
control.fixed = priors,
control.predictor = list(compute = TRUE))
predmm2 <- data.frame(dtime_pred, pred = mm2$summary.fitted.values[pred_ii, 1])
ggplot(dtime, aes(x = year_start, y = log_pr)) +
geom_point(alpha = 0.4) +
facet_wrap(~ country, ncol = 3, scale = 'fixed') +
geom_line(data = pred5, aes(y = pred), alpha = 0.3) +
geom_line(data = predmm2, aes(y = pred)) +
ggtitle('Log malaria prevalence by country through time. Random intercepts and slopes') +
ylim(-3, -0.5)
## Warning: Removed 26 rows containing missing values (geom_point).
</code></pre></div></div>
<p><img src="/images//Mixed_slopes-1.png" alt="" /></p>
<p>Now just some messing around to explore what we have fitted. Here is a
plot of the random slopes we have fitted.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hist(mm2$summary.fixed$mean[2] + mm2$summary.random$country2$mean)
</code></pre></div></div>
<p><img src="/images//plots1-1.png" alt="" /></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mm2$summary.hyperpar
## mean sd
## Precision for the Gaussian observations 5.931078e+00 1.476931e-01
## Precision for country 5.919402e+04 1.222777e+06
## Precision for country2 5.794590e+07 2.128258e+07
## 0.025quant 0.5quant
## Precision for the Gaussian observations 5.645668e+00 5.929447e+00
## Precision for country 1.287891e+02 3.612827e+03
## Precision for country2 2.617741e+07 5.475057e+07
## 0.975quant mode
## Precision for the Gaussian observations 6.226221e+00 5.926438e+00
## Precision for country 3.463451e+05 2.206538e+02
## Precision for country2 1.085916e+08 4.872095e+07
</code></pre></div></div>
<p>The estimated mean for the precision of the random slope component is
5e7. Therefore sd is $1/\sqrt{5e7} = 0.0001$. The data has told us that
the declines in each country is pretty similar. Therefore the crazy
slope in Timor-Leste is totaly unjustified.</p>
<h1 id="recap-and-practical-advice">Recap and practical advice</h1>
<p>So, we have fitted a model for prevalence and a model for prevalence
through time. In both cases we have many countries, and therefore many
parameters. We want to put priors on these many parameters but don’t
know how strong to make them. So we use a mixed-effect model to put a
hyperprior on the prior.</p>
<p>These parameters can be intercepts or regression slopes. Everything
works the same way but this can be confusing in the programming syntax.
This is what we refer to as random intercepts and random slopes models.</p>
<p>So when are these models suitable? Given that the sole thing they do is
change the estimates of these many parameters, we should focus on
whether that makes sense in a particular case.</p>
<ol>
<li>We need to estimate <em>σ</em>, the between group variance. We therefore
need many groups for this estimate to be any good. The number of
countries we have here is on the lower side.</li>
<li>If each group has loads of data, the prior will be ignored. So the
benefit of mixed-effects models is reduced if every group has lots
of data.</li>
<li>These models can be used for different reasons. Perhaps we are
estimating some global fixed effect but want to account
for autocorrelation. Perhaps we are interested in the individual
group estimates, but want to share information between groups.</li>
</ol>
<h1 id="frequentist-mixed-models">Frequentist mixed models.</h1>
<p>I really don’t understand frequentist models. They sort of do the same
thing (estimating the variance of the random effect) but without priors.
I dunno. Standard library is lme4 and you would do the above models like
this.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(lme4)
f1 <- log_pr ~ (1 | country)
mm3 <- lmer(f1, data = dmean)
coefficients(mm3)
## $country
## (Intercept)
## Afghanistan -1.984396
## Cambodia -1.765873
## China -2.250998
## India -1.793234
## Indonesia -1.810567
## Laos -1.455991
## Myanmar -1.473185
## Tajikistan -1.973589
## Thailand -2.142186
## Timor-Leste -1.905138
## Turkey -2.192474
## Vietnam -1.850997
## Yemen -2.020359
##
## attr(,"class")
## [1] "coef.mer"
fixef(mm3)
## (Intercept)
## -1.893768
f2 <- log_pr ~ year_start + (year_start | country)
mm4 <- lmer(f2, data = dtime)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge: degenerate Hessian with 1 negative
## eigenvalues
coefficients(mm4)
## $country
## (Intercept) year_start
## Afghanistan 23.86207 -0.01298811
## Bangladesh 23.74080 -0.01288552
## Bhutan 23.86637 -0.01299368
## Cambodia 23.37922 -0.01257206
## China 24.04296 -0.01314883
## India 23.26472 -0.01247322
## Indonesia 23.66311 -0.01281644
## Iraq 24.24638 -0.01332219
## Laos 22.99958 -0.01224425
## Malaysia 23.68770 -0.01283863
## Myanmar 22.96536 -0.01221494
## Pakistan 23.85269 -0.01297997
## Philippines 23.99380 -0.01310051
## Saudi Arabia 23.95024 -0.01306647
## Sri Lanka 23.95157 -0.01306773
## Tajikistan 23.88800 -0.01301237
## Thailand 23.74605 -0.01288867
## Timor-Leste 23.58755 -0.01275329
## Turkey 24.00437 -0.01311285
## Vietnam 23.67971 -0.01283811
## Yemen 23.36557 -0.01257216
##
## attr(,"class")
## [1] "coef.mer"
fixef(mm4)
## (Intercept) year_start
## 23.7018017 -0.0128519
</code></pre></div></div>
<p>R is complaining about not being able to fit the model properly. I don’t
know why.</p>
<p>Check this paper for more.
<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5970551/">https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5970551/</a></p>Tim CD LucasA primer on Bayesian mixed-effects models.Stats280 Twitter Stats Course2017-11-15T00:00:00-08:002017-11-15T00:00:00-08:00https://timcdlucas.github.io/stats280-twitter-stats-course<hr />
<p>An attempt at writing a full statistics course for twitter.<br /><br /><img src="/images/stats280.png" /></p>
<iframe class="wakeletEmbed" width="100%" height="760px" src="https://embed.wakelet.com/wakes/42d2cac4-4fb7-4be9-b476-d9df88d11c23/list" style="border: none" allow="autoplay"></iframe>
<!-- Please only call https://embed-assets.wakelet.com/wakelet-embed.js once per page -->
<script src="https://embed-assets.wakelet.com/wakelet-embed.js" charset="UTF-8"></script>Tim CD LucasAn attempt at writing a full statistics course for twitter.Measlesdataviz2015-05-13T00:00:00-07:002015-05-13T00:00:00-07:00https://timcdlucas.github.io/measlesdataviz<hr />
<p>Working through visualising the effects of the measles vaccine. <br /><br /><img src="/images/measlesTimeseries.png" /></p>
<h1 id="visualising-the-effects-of-the-measles-vaccine">Visualising the effects of the measles vaccine</h1>
<p>First there was the Wall Street Journal <a href="http://graphics.wsj.com/infectious-diseases-and-vaccines/">vizualisation</a></p>
<p>Then <a href="http://www.twitter.com/RobertAllison__">@RobertAllison__</a> redrew the <a href="http://blogs.sas.com/content/sastraining/2015/02/17/how-to-make-infectious-diseases-look-better/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+sasblogs+%28SAS+Blogs%29">plot</a>.</p>
<p>Then <a href="www.twitter.com/biomickwatson">@biomickwatson</a> recreated the <a href="https://biomickwatson.wordpress.com/2015/04/09/recreating-a-famous-visualisation/">plot</a>. Finally, <a href="http://www.twitter.com/benjaminlmoore">@benjaminlmoore</a> recreated the <a href="http://www.r-bloggers.com/recreating-the-vaccination-heatmaps-in-r/">plot</a> in ggplot2.</p>
<p>So I thought I’d have a go as well. I’ve downloaded the <em>incidence</em> data from the Tycho website. <a href="http://www.tycho.pitt.edu/l1advanced.php">http://www.tycho.pitt.edu/l1advanced.php</a> You have to register and stuff. I also deleted the first two rows with titles in.</p>
<p>Before I start, my aims:</p>
<ul>
<li>No funky colour ramps. Let the data speak for itself.</li>
<li>Distinguish between missing data and zeros.</li>
<li>I’m considering reordering the states. Perhaps largest states at the top? Or high measles burden at top.</li>
</ul>
<h3 id="the-code">The code</h3>
<p>The code is available as an Rmarkdown document on <a href="https://github.com/timcdlucas/statsforbios/tree/master/measles">github</a>.</p>
<p>So, read in data. Then shamelessly copy code from @biomickwatson to get to a decent starting point.</p>
<p>To go from weekly data to annual, I am taking the mean across the year (with NAs removed). As pointed out by <a href="http://blogs.sas.com/content/sastraining/2015/02/17/how-to-make-infectious-diseases-look-better/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+sasblogs+%28SAS+Blogs%29">@RobertAllison__</a>, if you sum the data, the NAs introduce a bias. So I am taking the mean, then multiplying back by 52. To get expected cases per 100,000, per year.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">library</span><span class="p">(</span><span class="n">gplots</span><span class="p">)</span>
<span class="n">m</span> <span class="o"><-</span> <span class="n">read</span><span class="p">.</span><span class="n">csv</span><span class="p">(</span><span class="s">'MEASLES_Incidence_1928-2003_20150409110701.csv'</span><span class="p">,</span> <span class="n">stringsAsFactors</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">)</span>
<span class="c1"># yoink. Cheers @biomickwatson
</span><span class="n">m</span><span class="p">[</span><span class="n">m</span> <span class="o">==</span> <span class="s">"-"</span><span class="p">]</span> <span class="o"><-</span> <span class="n">NA</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="ow">in</span> <span class="mi">2</span><span class="p">:</span><span class="n">NCOL</span><span class="p">(</span><span class="n">m</span><span class="p">))</span> <span class="p">{</span>
<span class="n">m</span><span class="p">[,</span> <span class="n">i</span><span class="p">]</span> <span class="o"><-</span> <span class="k">as</span><span class="p">.</span><span class="n">numeric</span><span class="p">(</span><span class="n">m</span><span class="p">[,</span> <span class="n">i</span><span class="p">])</span>
<span class="p">}</span>
<span class="n">m</span> <span class="o"><-</span> <span class="n">m</span><span class="p">[</span><span class="n">m</span><span class="err">$</span><span class="n">YEAR</span><span class="o">>=</span><span class="mi">1930</span><span class="p">,]</span>
<span class="n">y</span> <span class="o"><-</span> <span class="n">aggregate</span><span class="p">(</span><span class="n">m</span><span class="p">[,</span><span class="mi">3</span><span class="p">:</span><span class="n">NCOL</span><span class="p">(</span><span class="n">m</span><span class="p">)],</span> <span class="n">by</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="n">year</span><span class="o">=</span><span class="n">m</span><span class="p">[,</span><span class="mi">1</span><span class="p">]),</span> <span class="n">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="mi">52</span><span class="o">*</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">na</span><span class="p">.</span><span class="n">rm</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">))</span>
<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="ow">in</span> <span class="mi">1</span><span class="p">:</span><span class="n">NCOL</span><span class="p">(</span><span class="n">y</span><span class="p">))</span> <span class="p">{</span>
<span class="n">y</span><span class="p">[</span><span class="ow">is</span><span class="p">.</span><span class="n">nan</span><span class="p">(</span><span class="n">y</span><span class="p">[,</span> <span class="n">i</span><span class="p">]),</span> <span class="n">i</span><span class="p">]</span> <span class="o"><-</span> <span class="n">NA</span>
<span class="p">}</span>
<span class="n">y</span> <span class="o"><-</span> <span class="n">y</span><span class="p">[</span><span class="n">order</span><span class="p">(</span><span class="n">y</span><span class="err">$</span><span class="n">year</span><span class="p">),]</span>
<span class="n">row</span><span class="p">.</span><span class="n">labels</span> <span class="o"><-</span> <span class="n">rep</span><span class="p">(</span><span class="s">""</span><span class="p">,</span> <span class="mi">72</span><span class="p">)</span>
<span class="n">row</span><span class="p">.</span><span class="n">labels</span><span class="p">[</span><span class="n">c</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">11</span><span class="p">,</span><span class="mi">21</span><span class="p">,</span><span class="mi">31</span><span class="p">,</span><span class="mi">41</span><span class="p">,</span><span class="mi">51</span><span class="p">,</span><span class="mi">61</span><span class="p">,</span><span class="mi">71</span><span class="p">)]</span> <span class="o"><-</span> <span class="n">c</span><span class="p">(</span><span class="s">"1930"</span><span class="p">,</span><span class="s">"1940"</span><span class="p">,</span><span class="s">"1950"</span><span class="p">,</span><span class="s">"1960"</span><span class="p">,</span><span class="s">"1970"</span><span class="p">,</span>
<span class="s">"1980"</span><span class="p">,</span><span class="s">"1990"</span><span class="p">,</span><span class="s">"2000"</span><span class="p">)</span>
<span class="n">cols</span> <span class="o"><-</span> <span class="n">colorRampPalette</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="s">"red"</span><span class="p">,</span> <span class="s">"blue"</span><span class="p">))(</span><span class="mi">100</span><span class="p">)</span>
<span class="n">bks</span> <span class="o"><-</span> <span class="n">seq</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">max</span><span class="p">(</span><span class="n">y</span><span class="p">[,</span> <span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">na</span><span class="p">.</span><span class="n">rm</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">),</span> <span class="n">length</span><span class="p">.</span><span class="n">out</span> <span class="o">=</span> <span class="mi">101</span><span class="p">)</span>
<span class="n">par</span><span class="p">(</span><span class="n">cex</span><span class="p">.</span><span class="n">main</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">heatmap</span><span class="p">.</span><span class="mi">2</span><span class="p">(</span><span class="k">as</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">y</span><span class="p">[,</span><span class="mi">2</span><span class="p">:</span><span class="n">NCOL</span><span class="p">(</span><span class="n">y</span><span class="p">)])),</span> <span class="n">Rowv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span> <span class="n">Colv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span>
<span class="n">dendrogram</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">trace</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">FALSE</span><span class="p">,</span>
<span class="n">labCol</span><span class="o">=</span><span class="n">row</span><span class="p">.</span><span class="n">labels</span><span class="p">,</span> <span class="n">cexCol</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">lhei</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.15</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">lwid</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">margins</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">12</span><span class="p">),</span>
<span class="n">col</span><span class="o">=</span><span class="n">cols</span><span class="p">,</span> <span class="n">breaks</span><span class="o">=</span><span class="n">bks</span><span class="p">,</span> <span class="n">colsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">72</span><span class="p">,</span> <span class="n">srtCol</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">rowsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">57</span><span class="p">,</span> <span class="n">sepcolor</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span>
<span class="n">add</span><span class="p">.</span><span class="n">expr</span><span class="o">=</span><span class="n">lines</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">32</span><span class="p">),</span><span class="n">c</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1000</span><span class="p">),</span><span class="n">lwd</span><span class="o">=</span><span class="mi">2</span><span class="p">),</span>
<span class="n">main</span><span class="o">=</span><span class="s">'Measles cases in US states 1930-2001</span><span class="se">\n</span><span class="s">Vaccine introduced 1961
</span><span class="se">\n</span><span class="s">(data from Project Tycho)'</span><span class="p">)</span></code></pre></figure>
<figure>
<img src="/images/setup-1.png" />
<figcaption> </figcaption>
</figure>
<p>OK. NA’s are white. Other colours are ramped. That’s good. The colour ramp here is funny because I’m using @biomickwatson’s values which match the cases data rather than the incidence data.</p>
<p>I like the labels on the right so I’ll leave that.</p>
<p>Now to get some good colours. I might try and leave NA’s white and have a ramp that doesn’t include white. RColorBrewer asseeeeemble.</p>
<p>Also going to change to 2 letter state names. The csv is just a list copied from the <a href="http://www.50states.com/abbreviations.htm#.VSgJpXXd89Y">web</a>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">library</span><span class="p">(</span><span class="n">RColorBrewer</span><span class="p">)</span>
<span class="n">stNames</span> <span class="o"><-</span> <span class="n">read</span><span class="p">.</span><span class="n">csv</span><span class="p">(</span><span class="s">'stateNames.csv'</span><span class="p">,</span> <span class="n">header</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">,</span> <span class="n">stringsAsFactors</span> <span class="o">=</span> <span class="n">FALSE</span><span class="p">)</span>
<span class="n">names</span><span class="p">(</span><span class="n">y</span><span class="p">)[</span><span class="mi">2</span><span class="p">:</span><span class="mi">52</span><span class="p">]</span> <span class="o"><-</span> <span class="n">stNames</span><span class="p">[,</span><span class="mi">2</span><span class="p">]</span>
<span class="n">cols</span> <span class="o"><-</span> <span class="n">colorRampPalette</span><span class="p">(</span><span class="n">brewer</span><span class="p">.</span><span class="n">pal</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="s">'Reds'</span><span class="p">))(</span><span class="mi">100</span><span class="p">)</span>
<span class="n">bks</span> <span class="o"><-</span> <span class="n">seq</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">max</span><span class="p">(</span><span class="n">y</span><span class="p">[,</span> <span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">na</span><span class="p">.</span><span class="n">rm</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">),</span> <span class="n">length</span><span class="p">.</span><span class="n">out</span> <span class="o">=</span> <span class="mi">101</span><span class="p">)</span>
<span class="n">par</span><span class="p">(</span><span class="n">cex</span><span class="p">.</span><span class="n">main</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">heatmap</span><span class="p">.</span><span class="mi">2</span><span class="p">(</span><span class="k">as</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">y</span><span class="p">[,</span><span class="mi">2</span><span class="p">:</span><span class="n">NCOL</span><span class="p">(</span><span class="n">y</span><span class="p">)])),</span> <span class="n">Rowv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span> <span class="n">Colv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span>
<span class="n">dendrogram</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">trace</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">FALSE</span><span class="p">,</span>
<span class="n">labCol</span><span class="o">=</span><span class="n">row</span><span class="p">.</span><span class="n">labels</span><span class="p">,</span> <span class="n">cexCol</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">lhei</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.15</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">lwid</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">margins</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">12</span><span class="p">),</span>
<span class="n">breaks</span><span class="o">=</span><span class="n">bks</span><span class="p">,</span> <span class="n">colsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">72</span><span class="p">,</span> <span class="n">srtCol</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">rowsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">57</span><span class="p">,</span> <span class="n">sepcolor</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">col</span><span class="o">=</span><span class="n">cols</span><span class="p">,</span>
<span class="n">add</span><span class="p">.</span><span class="n">expr</span><span class="o">=</span><span class="n">lines</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">32</span><span class="p">),</span><span class="n">c</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1000</span><span class="p">),</span><span class="n">lwd</span><span class="o">=</span><span class="mi">2</span><span class="p">),</span>
<span class="n">main</span><span class="o">=</span><span class="s">'Measles cases in US states 1930-2001</span><span class="se">\n</span><span class="s">Vaccine introduced 1961'</span><span class="p">,</span> <span class="n">na</span><span class="p">.</span><span class="n">color</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.8</span><span class="p">))</span></code></pre></figure>
<figure>
<img src="/images/nas-1.png" />
<figcaption> </figcaption>
</figure>
<p>As suggested by <a href="https://twitter.com/BulbousSquidge/status/567318406857515008">@bulboussquidge</a> I’ll try just clipping the few high values to a something+ category.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">hist</span><span class="p">(</span><span class="k">as</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">y</span><span class="p">[,</span><span class="mi">1</span><span class="p">:</span><span class="n">NCOL</span><span class="p">(</span><span class="n">y</span><span class="p">)]))</span></code></pre></figure>
<figure>
<img src="/images/clipped-1.png" />
<figcaption> </figcaption>
</figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">y2</span> <span class="o"><-</span> <span class="n">y</span><span class="p">[,</span> <span class="mi">2</span><span class="p">:</span><span class="n">NCOL</span><span class="p">(</span><span class="n">y</span><span class="p">)]</span>
<span class="nb">sum</span><span class="p">(</span><span class="n">y2</span><span class="p">[</span><span class="err">!</span><span class="ow">is</span><span class="p">.</span><span class="n">na</span><span class="p">(</span><span class="n">y2</span><span class="p">)]</span> <span class="o">></span> <span class="mi">2500</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">y2</span><span class="p">[</span><span class="n">y2</span> <span class="o">></span> <span class="mi">2500</span><span class="p">]</span> <span class="o"><-</span> <span class="mi">2500</span>
<span class="n">cols</span> <span class="o"><-</span> <span class="n">colorRampPalette</span><span class="p">(</span><span class="n">brewer</span><span class="p">.</span><span class="n">pal</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="s">'Reds'</span><span class="p">))(</span><span class="mi">100</span><span class="p">)</span>
<span class="n">bks</span> <span class="o"><-</span> <span class="n">seq</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="nb">max</span><span class="p">(</span><span class="n">y2</span><span class="p">,</span> <span class="n">na</span><span class="p">.</span><span class="n">rm</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">),</span> <span class="n">length</span><span class="p">.</span><span class="n">out</span> <span class="o">=</span> <span class="mi">101</span><span class="p">)</span>
<span class="n">par</span><span class="p">(</span><span class="n">cex</span><span class="p">.</span><span class="n">main</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">heatmap</span><span class="p">.</span><span class="mi">2</span><span class="p">(</span><span class="k">as</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">y2</span><span class="p">)),</span> <span class="n">Rowv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span> <span class="n">Colv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span>
<span class="n">dendrogram</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">trace</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">FALSE</span><span class="p">,</span>
<span class="n">labCol</span><span class="o">=</span><span class="n">row</span><span class="p">.</span><span class="n">labels</span><span class="p">,</span> <span class="n">cexCol</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">lhei</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.15</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">lwid</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">margins</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">12</span><span class="p">),</span>
<span class="n">breaks</span><span class="o">=</span><span class="n">bks</span><span class="p">,</span> <span class="n">colsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">72</span><span class="p">,</span> <span class="n">srtCol</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">rowsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">57</span><span class="p">,</span> <span class="n">sepcolor</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">col</span><span class="o">=</span><span class="n">cols</span><span class="p">,</span>
<span class="n">add</span><span class="p">.</span><span class="n">expr</span><span class="o">=</span><span class="n">lines</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">32</span><span class="p">),</span><span class="n">c</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1000</span><span class="p">),</span><span class="n">lwd</span><span class="o">=</span><span class="mi">2</span><span class="p">),</span>
<span class="n">main</span><span class="o">=</span><span class="s">'Measles cases in US states 1930-2001</span><span class="se">\n</span><span class="s">Vaccine introduced 1961'</span><span class="p">,</span> <span class="n">na</span><span class="p">.</span><span class="n">color</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.8</span><span class="p">))</span></code></pre></figure>
<figure>
<img src="/images/clipped-2.png" />
<figcaption> </figcaption>
</figure>
<p>Only 3 data points are affected. I’m torn here.</p>
<p>Now I want to try organising the data by measles burden. The areas with lots of measles are the important bit, so I think that makes sense. I think I’ll just do mean (with NAs removed) and order by size. Certainly a useful thing could be to order by number of cases rather than incidence. But I don’t want to go get the other dataset.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">means</span> <span class="o"><-</span> <span class="nb">apply</span><span class="p">(</span><span class="n">y2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">na</span><span class="p">.</span><span class="n">rm</span> <span class="o">=</span> <span class="n">TRUE</span><span class="p">))</span>
<span class="n">y3</span> <span class="o"><-</span> <span class="n">y2</span><span class="p">[,</span> <span class="n">rev</span><span class="p">(</span><span class="n">order</span><span class="p">(</span><span class="n">means</span><span class="p">))]</span>
<span class="n">par</span><span class="p">(</span><span class="n">cex</span><span class="p">.</span><span class="n">main</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
<span class="n">heatmap</span><span class="p">.</span><span class="mi">2</span><span class="p">(</span><span class="k">as</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">y3</span><span class="p">)),</span> <span class="n">Rowv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span> <span class="n">Colv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span>
<span class="n">dendrogram</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">trace</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">FALSE</span><span class="p">,</span>
<span class="n">labCol</span><span class="o">=</span><span class="n">row</span><span class="p">.</span><span class="n">labels</span><span class="p">,</span> <span class="n">cexCol</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">lhei</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.15</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">lwid</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">margins</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">12</span><span class="p">),</span>
<span class="n">breaks</span><span class="o">=</span><span class="n">bks</span><span class="p">,</span> <span class="n">colsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">72</span><span class="p">,</span> <span class="n">srtCol</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">rowsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">57</span><span class="p">,</span> <span class="n">sepcolor</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">col</span><span class="o">=</span><span class="n">cols</span><span class="p">,</span>
<span class="n">add</span><span class="p">.</span><span class="n">expr</span><span class="o">=</span><span class="n">lines</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">32</span><span class="p">),</span><span class="n">c</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1000</span><span class="p">),</span><span class="n">lwd</span><span class="o">=</span><span class="mi">2</span><span class="p">),</span>
<span class="n">main</span><span class="o">=</span><span class="s">'Measles cases in US states 1930-2001</span><span class="se">\n</span><span class="s">Vaccine introduced 1961'</span><span class="p">,</span> <span class="n">na</span><span class="p">.</span><span class="n">color</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.85</span><span class="p">))</span></code></pre></figure>
<figure>
<img src="/images/ordered-1.png" />
<figcaption> </figcaption>
</figure>
<p>I think the reordering is an improvement. It’s interesting at least.</p>
<p>Finally, I just want to tweak a few things. This turns out to be a complete pain. I had ago hacking <code class="language-plaintext highlighter-rouge">heatmap.2()</code>. The new function is saved in customHeatmap.R.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">source</span><span class="p">(</span><span class="s">'customHeatmap.R'</span><span class="p">)</span>
<span class="n">customHeatmap</span><span class="p">(</span><span class="k">as</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">y3</span><span class="p">)),</span> <span class="n">Rowv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span> <span class="n">Colv</span><span class="o">=</span><span class="n">NULL</span><span class="p">,</span> <span class="n">lmat</span> <span class="o">=</span> <span class="n">rbind</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">3</span><span class="p">),</span><span class="n">c</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span><span class="n">c</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">)),</span>
<span class="n">dendrogram</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">trace</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="n">TRUE</span><span class="p">,</span>
<span class="n">labCol</span><span class="o">=</span><span class="n">row</span><span class="p">.</span><span class="n">labels</span><span class="p">,</span> <span class="n">lhei</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.15</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mf">0.25</span><span class="p">),</span> <span class="n">lwid</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">margins</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">6</span><span class="p">),</span>
<span class="n">breaks</span><span class="o">=</span><span class="n">bks</span><span class="p">,</span> <span class="n">colsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">72</span><span class="p">,</span> <span class="n">rowsep</span><span class="o">=</span><span class="mi">1</span><span class="p">:</span><span class="mi">57</span><span class="p">,</span> <span class="n">sepcolor</span><span class="o">=</span><span class="s">"white"</span><span class="p">,</span> <span class="n">col</span><span class="o">=</span><span class="n">cols</span><span class="p">,</span>
<span class="n">add</span><span class="p">.</span><span class="n">expr</span><span class="o">=</span><span class="n">lines</span><span class="p">(</span><span class="n">c</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">32</span><span class="p">),</span><span class="n">c</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1000</span><span class="p">),</span><span class="n">lwd</span><span class="o">=</span><span class="mi">2</span><span class="p">),</span>
<span class="n">main</span><span class="o">=</span><span class="s">'Measles incidence in US states'</span><span class="p">,</span> <span class="n">na</span><span class="p">.</span><span class="n">color</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.8</span><span class="p">),</span>
<span class="n">density</span><span class="p">.</span><span class="n">info</span> <span class="o">=</span> <span class="s">'none'</span><span class="p">,</span> <span class="n">RowLabColors</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.4</span><span class="p">),</span> <span class="n">cexCol</span> <span class="o">=</span> <span class="mf">1.3</span><span class="p">,</span> <span class="n">key</span><span class="p">.</span><span class="n">title</span> <span class="o">=</span> <span class="s">''</span><span class="p">,</span>
<span class="n">cexRow</span> <span class="o">=</span> <span class="mf">0.65</span><span class="p">,</span> <span class="n">ColLabColors</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.4</span><span class="p">),</span> <span class="n">key</span><span class="p">.</span><span class="n">xlab</span> <span class="o">=</span> <span class="s">'Cases per 100,000'</span><span class="p">,</span> <span class="n">titleColor</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.4</span><span class="p">),</span> <span class="n">key</span><span class="p">.</span><span class="n">par</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">col</span> <span class="o">=</span> <span class="n">grey</span><span class="p">(</span><span class="mf">0.6</span><span class="p">),</span> <span class="n">lwd</span> <span class="o">=</span> <span class="mf">0.1</span> <span class="p">)</span>
<span class="p">)</span></code></pre></figure>
<figure>
<img src="/images/final-1.png" />
<figcaption> </figcaption>
</figure>
<p>At this point, I’m bored of hacking. I’m just going to make the last few changes in inkscape.</p>
<p>Which gives me:</p>
<figure>
<img src="/images/measlesTimeseries.png" />
<figcaption> </figcaption>
</figure>
<p>Still not perfect. But I’m bored now.</p>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-52019087-1', 'timcdlucas.github.io');
ga('send', 'pageview');
</script>Tim CD LucasWorking through visualising the effects of the measles vaccine.Lmvsanova2015-01-18T00:00:00-08:002015-01-18T00:00:00-08:00https://timcdlucas.github.io/lmVSanova<hr />
<p>Power of different linear models.</p>
<p>If you want to test for a change over a variable we can consider some different ways to do it.</p>
<p>We can collect the data so that it is spread out along the x axis or clumped at either end.
We can analyse the x axis as a continuous variable or a discrete variable (binning the x if it is spread out.)
These four options are displayed below with boxplots implying the data is analysed as discrete x values.
Note that only two sets of data are simulated, one clumped and one spread out.</p>
<figure class="half">
<img src="../../images/examplePlots-1.png" title="plot of chunk examplePlots" alt="plot of chunk examplePlots" style="width: 350px;" />
<img src="../../images/examplePlots-2.png" title="plot of chunk examplePlots" alt="plot of chunk examplePlots" style="width: 350px;" />
</figure>
<figure class="half">
<img src="../../images/examplePlots-3.png" title="plot of chunk examplePlots" alt="plot of chunk examplePlots" style="width: 350px;" />
<img src="../../images/examplePlots-4.png" title="plot of chunk examplePlots" alt="plot of chunk examplePlots" style="width: 350px;" />
</figure>
<p>So to examine the power of these approaches here’s a function that simulates some data (from a linear model with normal error) and then calculates and extracts p-values (sorry) for the four cases shown above.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>calcLM <- function(){
x1 <- runif(30)
y1 <- x1 * 2 + rnorm(30)
x2 <- rep(c(0,1), 15)
y2 <- x2 * 2 + rnorm(30)
coef <- c( summary(lm(y1 ~ x1))$coef[8],
summary(lm(y2 ~ x2))$coef[8],
summary(lm(y1 ~ x1 > 0.5))$coef[8],
summary(lm(y2 ~ as.factor(x2)))$coef[8]
)
}
</code></pre></div></div>
<p>Then let’s run the simulation 1000 times.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>p <- t(replicate(1000, calcLM())) %>% data.frame
colnames(p) <- c('continuousSpread', 'continuousClumped', 'discreteSpread', 'discreteClumped')
pLong <- melt(p, variable.name = 'model', value.name = 'p')
</code></pre></div></div>
<p>And plot the results.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ggplot(pLong, aes(x = model, y = p)) +
scale_y_log10() +
geom_violin() +
ggtitle('p values from different models.')
</code></pre></div></div>
<p><img src="../../images/simPlot-1.png" title="plot of chunk simPlot" alt="plot of chunk simPlot" width="500" /></p>
<p>So, using data from the edges of our range of x values gives us more power (lower p-values).
Also, it’s interesting to note that doing a discrete ANOVA with the x values as a factor is identical to treating this as a continuous linear model.
This will not be true if you have more than two groups though.
Furthermore, if you actually want to use this as a linear model, you will have to do an extra step to scale the coefficients if you do an ANOVA rather than a linear model.</p>
<p>So… that was kinda fun. And another chance to get to know ggplot2 better. Some code is suppressed here but you can see the full knitr document <a href="https://github.com/timcdlucas/statsforbios/blob/master/lmVSanova.Rmd">here</a>.</p>Tim CD LucasPower of different linear models.Zoon2014-12-04T00:00:00-08:002014-12-04T00:00:00-08:00https://timcdlucas.github.io/zoon<hr />
<p>Reproducible science and ZOÖN Internship. <br /><br /><img src="/images/elife02851f003.jpg" /></p>
<p>Reproducibility in science (without getting into <a href="http://cogprints.org/7691/7/ICMLws09.pdf">semantics</a>) is the ability of other scientists to reproduce your results. The first step of that is being able to check what you have done. Did you make a mistake with your algebra? Does running the same experiment give wildly different results? As the use of computational methods in ecology increases, we are in a position where we should be able to quickly and easily reproduce the research in an entire paper. First I rerun your code, and check that the outputs match those in your paper (should be easy). Then I check the code for errors (less easy).</p>
<p>However, even the first step is often hampered. Code is not included in a paper, or is hidden in an unsuitable format in the supplementary material, which is hosted neither carefully nor with longevity in mind. When code is included, the data needed to run an analysis is often not. Other times, a script is included, but is a mess with different bits of analysis and output all jumbled together.</p>
<p>Species distribution modelling (SDM) uses data on where a species lives to predict the whole distribution of the species. In short, a species is likely to exist in areas with environmental conditions similar to those we have seen it in before. So, as long as your data is shared, I should be able to reproduce your results with minimal effort. However, even field defining papers are completely unreplicable. For example <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.2006.0906-7590.04596.x/abstract">Elith et al. (2006)</a> benchmarks how good a number of different models are and has been cited some 3,000 times. However the paper is totally unreplicable. It would be great would be to add more recently developed methods to this benchmark. If a new method can’t outperform the current ones, then it is not very useful. But with the previous benchmark being unreplicable, this is not possible.</p>
<figure>
<img src="/images/elife02851f003.jpg" />
<figcaption> An example use of SDM, mapping the climatic niche of leishmaniases. Pigott et al. 2014. DOI: http://dx.doi.org/10.7554/eLife.02851.007 </figcaption>
</figure>
<h2 id="the-internship">The Internship</h2>
<p>Over the past months I have been working on an internship creating an R package for reproducible SDMs. The package is called ZOÖN and can be found on <a href="https://github.com/zoonproject/zoon">github</a> with more information <a href="https://zoonproject.wordpress.com/">here</a>. The ideas behind ZOÖN have been developed over the last year, with consultation of SDM users at every step (i.e. before I started). It is hoped that this constant discussion will avoid pitfalls of writing software that is then never used. It was decided that while there are great SDM packages out there (<a href="http://cran.r-project.org/web/packages/biomod2/index.html">biomod2</a>, <a href="http://www.cs.princeton.edu/~schapire/maxent/">maxent</a> etc.) there was still a gap for a higher level package, that aids the running, sharing and reproducing of whole SDM analyses, including data collection, data cleaning and outputs. However, as this is a fast moving field, an inflexible package, written and maintained by a small group of developers, would quickly become out of date. So instead the plan is to use web-hosted ‘modules’ that are quick and easy to program (compared to a full R package). ZOÖN will pull these modules from the web and run an analysis. This also means wrappers for other packages can easily be written.</p>
<figure>
<img src="/images/workshop.jpg" />
<figcaption> Presenting the package at a workshop </figcaption>
</figure>
<p>The goal of the internship was to write a working prototype <a href="http://cran.r-project.org/">R</a> package which I think I have succeeded in. The working package is on <a href="https://github.com/zoonproject/zoon">github</a> and can be installed in R with <code class="language-plaintext highlighter-rouge">devtools::install_github("zoonproject/zoon")</code>. Although there is some work to be done, the core package works. Whole SDM workflows can be run with one command. The output then contains all the data needed to run the analysis and a record of the call (the inputted text command) used to run the analysis. As the modules are all online, an analysis can be rerun simply be having access to this output (one R object.) In the case of analyses using online data (from GBIF for example) only the call is needed to rerun an analysis. Furthermore, while still in early development, there is already a very simple way to upload an analysis to <a href="www.figshare.com">Figshare</a>.</p>
<p>To run a SDM with the package you must specify at least five ‘modules’. One each that: collects occurrence data, collects environmental data, processes the data, runs a model, gives some output. To flesh out the variety of analyses possible will require much more work writing modules (there are plans for a hackathon to get this going.) However, as wrapping existing packages is easy, there are already modules for running all the models available in Biomod2, collecting data from GBIF, worldclim and NCEP and creating basic maps and uploading analyses to Figshare to name a few.</p>
<figure>
<img src="/images/EresussandaliatusTwoModels.png" />
<figcaption> Two distributions of Eresus sandaliatus created using ZOÖN </figcaption>
</figure>
<p><br />
So in three months I think I have laid the ground work for a package that really simplifies sharing analyses while making it easy for new methods to be incorporated into current analyses.</p>
<h2 id="open-science">Open science</h2>
<p>This project has been conducted in a very open manner which I have really enjoyed. The code can be found on <a href="https://github.com/zoonproject/zoon">github</a> as soon as it is written. And the code is licensed to make it useable by anyone. Most of us <a href="www.twitter.com/gregmci">are</a> <a href="www.twitter.com/_nickgolding_">on</a> <a href="www.twitter.com/timcdlucas">twitter</a> and happy to discuss the research. And as discussed above, regular contact with a <a href="https://zoonproject.wordpress.com/user-panel/">user panel</a> means we are not locked in our dark computer lab, working in isolation.</p>
<h2 id="lessons">Lessons</h2>
<p>Through this internship I have learned an awful lot about the nuts and bolts of R. Writing a package is a really good way to get to know the language better. I can totally recommend <a href="http://r-pkgs.had.co.nz/description.html">R packages</a> and <a href="http://adv-r.had.co.nz/#r-pkgs">Advanced R</a> for more information.</p>
<p>I have also become much more comfortable with handling a large (ish) software project. Git is now second nature, using Github to record issues has become an invaluable tool and the benefits of unit testing have become clearer.</p>
<p>On a less tangible front it has been really interesting to see how differently people approach approach the community side of software development. Without users, your software is worthless. This project relies on it’s community for more than just a user base. We are hoping for users to contribute code in the form of modules. So community development has been important from the beginning. However I really liked working while talking to potential end users (although a workshop six weeks into a project is terrifying.) I don’t think this approach is easy, but I definitely think it’s worth putting effort into building a community around your software.</p>
<p>And now it just remains to see how the project develops and whether the software becomes commonly used.</p>Tim CD Lucas