This homework assignment has two parts. The first part tests your understanding of some basic concepts in probability theory and is short answer. The second part requires programming in R, and requires turning in a script you’ll create, in addition to answering some short answer questions.

Probability theory

  1. Probability mass. Say we have a probability mass function \(p(x)\) that puts probability on just three outcomes: 1, 3, and 5. We know that \(p(1) = 0.4\), and we know that \(p(3) = p(5)\). What must \(p(3)\) and \(p(5)\) be? Why?

  2. Probability density. Say we have a uniform distribution on the range from 3.0 to 3.1. This distribution, like all uniform distributions, has the same probability density \(p(x)\) at every point \(x\) within its range \([3.0, 3.1]\). What is the probability density \(p(x)\) at each of these points? Why do you know this? Is it a problem that this probability density is greater than 1? Why or why not?

  3. Bernoulli trial distribution. We often say that the Bernoulli trial has a single parameter, the probability of a ‘success’, often denoted \(\pi\). In terms of \(\pi\), what is the probability of a failure? Why?

  4. Mean and variance. Say we have a multinomial trial distribution \(p(x)\) over three outcomes 1, 2, and 4, where \(p(1) = 0.3\), \(p(2) = 0.4\), and \(p(4) = 0.3\). What is the expectation or mean of this distribution? What is the variance? What is the standard deviation? (show how you got these results)

Data summarization and plotting in R

For this part of the homework, you’ll be making use of a data file called schilling_ling300.txt, download here. This file was constructed from a freely available dataset of eye movements in reading called the Schilling corpus (Schilling, Rayner, & Chumbley, 1998). This corpus was created when participants read a number of short sentences while their eyes were being tracked. The file I’ve provided you with describes, for a number of words and for each of 9 subjects, three binary eye movement measures that relate to ongoing language processing:

In general, words that take less time to process are associated with more skipping, and words that take longer to process are associated with more instances of multiple fixations and more regressions. For the purposes of this homework, we’ll be looking at the relationships between these three eye movement measures and one measure of how long it takes to process a word: its length in characters. On average, words that are longer take longer to process, so the expectation is that as length increases, words will be associated with less skipping, more instances of multiple fixations, and more regressions.

  1. Getting to know the dataset without R. For text-encoded datasets such as this one (and the one we used in class), it’s usually a good idea to look at the dataset in a text editor of some sort before loading it into R. This can help you decide which commands and options to use to load the dataset. If you’re comfortable on a *nix-style terminal, using the head command is a great way to look at a dataset. If not, just open this file in your favorite text editor, or just open this file in RStudio with File -> Open File… to look at it. (Make sure not to accidentally save any changes to the file.)
    • You’ll notice that the file has 6 columns, separated by a tab. Each column has a name, given in the first row (header row):
      • subj. subject number, uniquely identifying participants
      • word. word number, in the form [sentence number]_[word number in sentence]
      • wlen. length of word, in characters
      • skip. a 1 if the word was skipped by this subject and 0 otherwise
      • mfix. a 1 if the word was fixated multiple times by this subject and 0 otherwise
      • fp.reg. a 1 if a regression was launched from this word back to a previous word by this subject, and a 0 otherwise. If a word was skipped, the value for fp.reg is NA, since if the reader did not fixate the word, they had no opportunity to launch a regression from it
    • Looking just at the first 10 lines of this file (so the first 9 lines of data, since row 1 is the header), what range of word lengths does it seem that this dataset is investigating? Also, looking at these rows of data, which outcome seems to be least likely: skipping a word, fixating a word multiple times, or launching a regression from a word?
  2. Loading data. Now, create a new R script called hw1.R, which you’ll be modifying in the rest of this homework. As you develop this script, remember to add comments beginning with the # character whenever you think it would be useful if you wanted to read this script again in a month or two. The first thing you’ll add to the script is a line to read the data file schilling_ling300.txt into R. Before you can do this, remember that you’ll need to change the working directory to match the directory that the data file is in. To load the data, you’ll need to choose the appropriate function in the read_delim() family from readr (i.e., read_delim(), read_csv(), etc.) and pass it the appropriate arguments. Look at the help for this family of functions by typing ?read_delim to see the available options. Remember that the most relevant arguments are whether or not the file has a header and what character is used to separate the columns. In case you haven’t seen it before, '\t' is a common representation in computer programming languages for the tab character in a string. Assign the result of this function to a new variable called dat.
    • What type of object does dat now store? (Hint: Think what type of object the read_table() family of functions return.)
    • Use the summary() function on the new variable dat to summarize the data in each of its columns. Of the vector types we discussed in class (logical, numeric, boolean, character, factor), what type is each of the six columns?
  3. Function creation and data transformation. After loading data into R, one common task is changing the type of some of its columns into ones that represent the underlying data more explicitly. For example, three of the columns of dat (skip, mfix, and fp.reg) are represented as numeric variables that just happen to only take two values (0 and 1), but in actuality represent a logical value that is either true or false. It is often a good idea to convert these variables into formal logical vectors to make this explicit, so this is what you’ll do next. Since this is something that we’ll want to do to all three of these columns, this is a good candidate for creating a function (so that we don’t need to copy and paste code).
    • Define a new function called to_logical that takes a single vector argument that is 0s and 1s (and possibly NAs) and returns its corresponding single logical vector of TRUEs and FALSEs (and possibly NAs).
    • At the beginning of the function, check to ensure that the input meets the function’s assumptions, and emit an error if not. As in class, this can be done by combining an if statement with a stop() function call in the following form:

      if (FALSE) {
        stop("Error message")
      }
      • You’ll need to replace the FALSE in the conditional statement above with an appropriate test of whether every element in the vector is 0, 1, or NA. Testing whether each element in the vector is equal to 1 can be done with the == operator, and similarly for 0. Testing whether each element in a vector is NA, however, requires the special function is.na(). Thus, you can create three logical vectors that denote whether each element is 0, 1, or NA. Then, you can put these together with the logical-or operator |, to produce one vector that tells whether each element of the input vector is either 0, 1, or NA. Finally, to collapse this vector into a single vector that specifies whether all of the elements of the vector are TRUE, you can use the all() function. Note: you’re likely to also want to use the logical-not operator ! to perform the test you want.
      • You’ll also need to replace "Error message" with something more informative about the error.
    • Once the checking is complete of whether the input meets the function’s assumptions, the next step is to code the body of the function, which transforms 1 to TRUE, 0 to FALSE, and leaves NA as NA. For this step we can assume that those are the only three possible inputs (thanks to the checking we’ve already implemented). There are two ways to achieve this goal. One is to use the ifelse() function (see ?ifelse), which can check whether each element meets a certain condition (say, being 1), and if so, specifies one resulting value (say, TRUE), and if not, another resulting value (say, FALSE). (The ifelse() function, like most elementwise functions in R, leaves NA values as NA.) A more direct approach, only valid because of our test, is simply to use the logical operator ==, which returns TRUEs and FALSEs (and also leaves NA values as NA).
    • Test your resulting function, call it with the argument c(0, 1, 1, NA, 0, 1) to see whether it returns c(FALSE, TRUE, TRUE, NA, FALSE, TRUE).
    • Finally, use this function to transform the three binary columns of dat. Using mutate() in the dplyr package, transform the mfix, skip, and fp.reg columns into logical vectors using your new to_logical() function. Make sure to assign the results of the transformation back to the dat variable.

  4. Data aggregation with dplyr. As mentioned above, for this dataset, the main question of interest is how the eye movement measures of ongoing language processing (skip rate, multiple fixation rate, and regression rate) vary with word difficulty, as assessed by word length. In the next problem, we’ll be visualizing these relationships using ggplot2. However, it is customary (and very reasonable) in behavioral research on multiple subjects to not plot raw data, but to first aggregate the data to produce means for each condition of interest for each subject, and then plot summary statistics of those means. So, in this problem, you’ll aggregate the data to produce a new data frame dat_subj where there is a single mean for each of these three measures for each subject for each possible value of word length.
    • Since we’re wanting to summarize the data for each subject for each word length, you’ll want to group the data by subj and wlen using dplyr’s group_by() function. Assign this new grouped data frame to your new variable dat_subj.
    • Chained to that first grouping command, use the summarise() function (note the spelling!) to create one “new” variable skip that is the mean of the old skip variable, e.g., summarise(skip = mean(skip)). Because the data is already grouped, this will calculate the mean for each group, which is exactly what we want. And since this is chained to the group_by() function, it will now be the result of this summarise function that is assigned to the new dat_subj variable.
    • Once you’ve verified that that is working, you can add the other two variables to the summarise() function. Note that because the column fp.reg includes NA values, you’ll need to add a second argument to the mean() function, na.rm = TRUE (see ?mean), to denote that you want to remove the NA values prior to averaging (otherwise, the average will also be NA).
    • At this point, use the head() function in R to look at the first few rows of the new data frame dat_subj. Verify that the columns skip, mfix, and fp.reg are no longer binary, but now are proportions between 0 and 1.
  5. Plotting. Finally, using ggplot2, we’ll plot our transformed data frame dat_subj that we created in the previous problem.
    • First, using the ggplot function, set the data argument to dat_subj and use the aesthetics function aes() to use wlen for the x axis and skip for the y axis. Add geom_point() to get a scatter-plot of the subject means. The result gives a vague sense of the range of the data, but isn’t yet a very useful visualization.
    • One problem is that the x axis is a bit odd given its meaning, including a tick mark at 2.5 – which is an impossible value given that words can only have a whole number of letters. We can fix this by telling R that wlen is not an arbitrary numeric value, but takes discrete levels, i.e., a factor. To fix this, add to the previous part of the script, which used mutate() to transform skip, etc., into logical values code to mutate wlen into a factor using the factor() function. Re-run the script and make sure that the x axis now explicitly shows only possible values of word length in this dataset.
    • Another problem with this graph is that we don’t have any way of associating the data points that come from the same subject. We can fix this, as we did in class, by adding the aesthetics color = subj and group = subj. Do this and make sure the result now shows different subjects in different colors.
    • You may have noticed that the color scale that was created is very hard to read. E.g., subjects 1 and 2 are very nearly the same color and subjects 8 and 9 are nearly the same color too. This is again because R doesn’t know that the subj variable only takes 9 possible values (one for each subject), but is assuming that it can take any value in its range (e.g., 1.42), and so it’s assigning a gradient of colors to this range. To fix this, add more code to the previous part of the script that transformed the variables with mutate() to change subj into a factor as well. Re-run the script and make sure that the colors chosen are now more reasonable.
    • Now, let’s make a second plot that actually computes the mean and standard errors of these by-subject means instead of just plotting every single one of them. This is also very easy with ggplot2. Create a new plot, with the same basic aesthetics as before (x = wlen, y = skip), but omitting the bit about subj. Now, replace the geom_point() bit that was added, which just plotted every subject mean as a scatter-plot point, with stat_summary(fun.data = mean_se). Now, instead of directly visualizing the data (what the geom did), this function will calculate a statistic of the y values (their mean and standard error of the mean) for each of the possible values of the x variable (wlen). Each statistic is associated with a default visualization scheme (here a point with a simple line around it to indicate a range), so this code is all that’s needed to see the means and standard errors of the skip rates for each word length in the dataset. Describe the relationship between word length and skip rate that seems apparent.
    • Finally, create two more versions of the previous plot that do the same thing with the other two dependent measures of interest (mfix and fp.reg) and describe the relationship between word length and each of those two variables that are apparent there.