Unlike the previous homework assignments, this one does not have separate short answer questions that are unrelated to coding. Thus, you will primarily turn in your R code. However, some components do also include short answer questions. These are marked by the word Question in bold. You can submit these answers in a separate pdf OR answer them in the comments of your R script at the appropriate positions.

Bayesian inference with JAGS

For this assignment, we’ll make use of a dataset of lexical decision responses. As you may be aware, lexical decision is a task in which participants see a string of characters such as ‘flutter’ or ‘blutcher’ and have to rapidly push a button indicating whether it is a real word or not. Your dataset is the set of responses (‘word’ or ‘nonword’) made to actual words by 17 participants and is available as bicknell_et_al_2010_lexdec_ling300.csv download here.

In particular, we’re interested in using the binary answers (correct / incorrect) to make inferences about the probability of getting a trial correct in this task, which we’ll call \(\pi\). (Remember that ‘correct’ means responding ‘word’, since this dataset is all actual words.) We’ll use two different Bayesian methods to estimate the posterior distribution on \(\pi\). Note that by itself, of course, this probability isn’t very interesting. However, these models are a simpler piece of an interesting analysis that examining whether \(\pi\) differs across conditions (for example, following related or unrelated prime words).

  1. First steps:

    • As the first R commands in your script, load the packages you’ll need for this assignment (dplyr, ggplot2, rjags)
    • Load the data from the file to a data frame called dat. (Do not use file.choose().)
    • Note that the second variable is factor with two levels ‘word’ and ‘nonword’. For the use of this variable with JAGS, it would be more convenient if it was a numeric vector, where a value of 1 indicates correct and 0 indicates incorrect. Using the dplyr package, add a column to this data frame called correct that has this property.

Part 1: simple model

To start analyzing our beliefs about \(\pi\), we’ll use a very simple model, much like the one we discussed in class, and then perform inference with JAGS.

  1. [Note: before completing this problem, you’ll need to install the Hmisc package if it’s not already installed in R. No need to load it in your script with the library() function, though.] To get a sense of the overall data, plot a 95% bootstrapped confidence interval on the overall probability of responding correct. Since this is a single overall probability, not split by condition, the simplest way to get ggplot to do this is to specify a constant number for the x-axis aesthetic, e.g., aes(1, correct). To get this type of interval out of stat_summary, use mean_cl_boot for the fun.data argument instead of mean_se that you were using in the previous assignment. (Note that bootstrapped confidence intervals such as these – which we haven’t discussed in class – are usually much better behaved for parameters bounded between 0 and 1 than the more common standard-error-based intervals.) Questions: What are the endpoints of that interval? How much uncertainty does this (frequentist) analysis reveal about the probability of getting a trial correct (i.e., about how wide is the interval)?

  2. The first step of creating this simple model is to create a variable called model_string_simple that describes this simple model in JAGS. The model should specify that the variable pi is distributed according to a uniform Beta distribution, and that each element of the variable correct is drawn from a Bernoulli trial with parameter pi.

  3. The next step is to initialize variables named num_samples_to_keep, num_samples_to_toss, and num_chains. For now, keep just 50 samples, fix the number of samples to toss to be the same as the number of samples to keep, and use two Markov chains.

  4. Now, define the initialization values for the two chains. The first should start at a value of pi at 0.01 and the other at 0.99. The first chain should use a random number seed of 13 and the second of 1871. Both chains should use the Mersenne Twister random number generator.

  5. Next, initialize your JAGS model, using the model string, initialization values, and number of chains you’ve defined. You should also link the correct variable in your JAGS model to the correct variable in your data frame.

  6. Run burn-in for num_samples_to_toss iterations.

  7. Now, get your samples from the model, specifying that pi is the variable for which we are interested in obtaining samples. Store your samples in a variable named mcmc_samples_simple.

  8. As is usually the first thing you should do to check mcmc output, run gelman.diag on your samples. Questions: Is the Upper C.I. of \(\hat{R}\) for your variable of interest below 1.05 as desired?

  9. Another useful thing to check is how many effectively independent samples from the posterior the mcmc samples contain about a given variable of interest. Check that from this mcmc output with the effectiveSize function. If a variable is mixing very well, then practically every mcmc iterate will be an independent sample from the posterior, and thus the number of effective samples per iterate will be near 1. Conversely, if a variable is mixing poorly, this ratio could be very close to zero. Question: Is pi mixing well or poorly in this model? Justify your answer.

  10. Finally, we’ll extract an estimate for the posterior mean and estimates for the bounds of the posterior 95% central interval from our samples. To do this, we’ll first need to change our mcmc samples object into a simple vector of samples. To do this, first convert the mcmc samples to a matrix using the as.matrix() function. Then, convert that matrix into a data frame with the as.data.frame() function. Now, extract the pi column from that data frame into a vector called samples_vector_simple. Using this vector:
    • Estimate the mean of the posterior by taking the mean of these samples.
    • Estimate the bounds of the 95% central interval of the posterior by taking the 0.025 and 0.975 quantiles of these samples. You can do this using the quantile() function.
    • Questions: What are the estimated posterior mean and 95% central interval? Exactly how wide is that interval estimated to be? How well do these numbers agree with the mean and bounds that you plotted with bootstrap methods above?

Part 2: hierarchical model

The generative model used in part 1 made the assumption that the correct responses were all independently generated. Because these data were produced by 17 participants, that isn’t a good assumption. For example, one participant’s overall probability of answering correct is likely to be different than another participant’s. In this part, you’ll extend the model from part 1 into one that doesn’t make this incorrect assumption.

  1. To see the extent to which this is the case in the data, first plot bootstrapped confidence intervals on each participant’s probability of answering a trial correctly. This should be as simple as swapping out the constant x-axis from your plot above into using participant for the x-axis. Question: How much do the individual participants’ probabilities of answering correct seem to differ from each other?

  2. In order to provide the model with information that each participant has their own probability of answering correctly \(pi_i\), you’ll create a hierarchical model. The first step of creating this model is to create a variable called model_string_hier that describes this model in JAGS. The model should specify that:
    • pi_overall \(\sim \mbox{Beta}(1,1)\)
    • \(k \sim \mbox{Uniform}(0, 1000)\)
    • for each participant \(i\) from 1 to num_ppts: pi[i] \(\sim \mbox{Beta}(k\pi, k(1-\pi))\),
    • for i from 1 to length(correct): correct[i] ~ dbern(pi[participant[i]])
  3. The next step is to set variables for the number of samples to keep to 100, the number of samples to toss to be equal to the number of samples to keep, and the number of chains to 2. Additionally, create a new variable num_ppts, which is equal to 17.

  4. Create variables that will be used to initialize the two chains to relatively disparate parts of the parameter space. Make sure to initialize all of the variables, and to set the random number generators and seeds to ensure reproducibility. Note that pi should be initialized to a vector of length num_ppts.

  5. Initialize your JAGS model, using the model string, initialization values, and number of chains you’ve defined. You should also link the correct and participant variables in your JAGS model to the those in your data frame, and the num_ppts variable in your JAGS model to the variable in R.

  6. Run burn-in and obtain your samples, ensuring to keep all three unknown variables as part of your samples.

  7. Check whether your model appears to have converged. Questions: Are all \(\hat{R}\) Upper C.I. values already below 1.05? If not, double the number of samples (ensuring that burn-in is also doubled) and repeat until all values are below 1.05. How many samples did it require?

  8. Now that you have some basis for believing that your model has converged, the next thing to check is whether our upper bound on the prior for k (of 1000) influenced our posterior distribution at all. To do this, first convert your samples object into a data frame (as described above). Now, you can use the max() function to find the maximum of the k column of this dataframe to see the largest sample of k. Questions: What is this value? Is it safe to assume that our upper bound of 1000 was not influencing the random walk?

  9. As another diagnostic, use the effectiveSize() function on your mcmc object to determine the approximate number of effectively independent samples your mcmc samples contain for each variable. Questions: Which two variables are mixing the slowest? What are their efficiencies? (i.e., the ratio of effective samples to actual iterates. Remember to sum across all chains when calculating the number of actual iterates.) Question: How does this compare to the efficiency of pi in the simple non-hierarchical model? (You may have already computed this as part of #10.)

  10. Say our goal is to estimate a 95% central probability interval for pi (in part 1) and pi_overall (in part 2), and say that to estimate a good interval, we want to have 100 independent samples in each tail. Questions: How many independent samples of the whole distribution would we need to obtain in order to have 100 in each tail? Given the efficiencies of pi (in part 1) and pi_overall (in part 2), how many actual samples should we obtain about each of them in order to get about that number of independent samples?

  11. Change the number of samples for the simple and hierarchical models now to get as many as determined in problem 21, and confirm that effectiveSize() now shows at least the appropriate number of effective samples for having 100 in each tail. If it does not, increase the number of samples again. Questions: How many samples did it end up taking to get the appropriate number of effectively independent samples in each model? What has happened to the \(\hat{R}\) values in these models?

  12. Now, with these final sets of samples, estimate the means and 95% central intervals of the posterior in each model. Using the bounds, calculate the width of the two posterior intervals. Questions: Which model has a broader posterior distribution? Why would this be the case?