This homework assignment has two parts. The first part tests your understanding of some basic concepts in probability theory and is short answer. The second part requires programming in R, and requires turning in a script you’ll create, in addition to answering some short answer questions.

*Probability mass.*Say we have a probability mass function \(p(x)\) that puts probability on just three outcomes: 1, 3, and 5. We know that \(p(1) = 0.4\), and we know that \(p(3) = p(5)\). What must \(p(3)\) and \(p(5)\) be? Why?*Probability density.*Say we have a uniform distribution on the range from 3.0 to 3.1. This distribution, like all uniform distributions, has the same probability density \(p(x)\) at every point \(x\) within its range \([3.0, 3.1]\). What is the probability density \(p(x)\) at each of these points? Why do you know this? Is it a problem that this probability density is greater than 1? Why or why not?*Bernoulli trial distribution.*We often say that the Bernoulli trial has a single parameter, the probability of a ‘success’, often denoted \(\pi\). In terms of \(\pi\), what is the probability of a failure? Why?*Mean and variance.*Say we have a multinomial trial distribution \(p(x)\) over three outcomes 1, 2, and 4, where \(p(1) = 0.3\), \(p(2) = 0.4\), and \(p(4) = 0.3\). What is the expectation or mean of this distribution? What is the variance? What is the standard deviation? (show how you got these results)

For this part of the homework, you’ll be making use of a data file called `schilling_ling300.txt`

, download here. This file was constructed from a freely available dataset of eye movements in reading called the Schilling corpus (Schilling, Rayner, & Chumbley, 1998). This corpus was created when participants read a number of short sentences while their eyes were being tracked. The file I’ve provided you with describes, for a number of words and for each of 9 subjects, three binary eye movement measures that relate to ongoing language processing:

- whether the subject
*skipped*that word, which means not directly fixating the word before fixating words past it - whether the subject made
*multiple fixations*on the word prior to moving the eyes to words past it - whether the subject made a
*regression*on the word, which means moving the eyes backwards to previous words, prior to moving the eyes past it

In general, words that take less time to process are associated with more *skipping*, and words that take longer to process are associated with more instances of *multiple fixations* and more *regressions*. For the purposes of this homework, we’ll be looking at the relationships between these three eye movement measures and one measure of how long it takes to process a word: its length in characters. On average, words that are longer take longer to process, so the expectation is that as length increases, words will be associated with less skipping, more instances of multiple fixations, and more regressions.

*Getting to know the dataset without R.*For text-encoded datasets such as this one (and the one we used in class), it’s usually a good idea to look at the dataset in a text editor of some sort before loading it into R. This can help you decide which commands and options to use to load the dataset. If you’re comfortable on a *nix-style terminal, using the`head`

command is a great way to look at a dataset. If not, just open this file in your favorite text editor, or just open this file in RStudio with File -> Open File… to look at it. (Make sure not to accidentally save any changes to the file.)- You’ll notice that the file has 6 columns, separated by a tab. Each column has a name, given in the first row (
*header row*):*subj*. subject number, uniquely identifying participants*word*. word number, in the form [sentence number]_[word number in sentence]*wlen*. length of word, in characters*skip*. a 1 if the word was skipped by this subject and 0 otherwise*mfix*. a 1 if the word was fixated multiple times by this subject and 0 otherwise*fp.reg*. a 1 if a regression was launched from this word back to a previous word by this subject, and a 0 otherwise. If a word was skipped, the value for fp.reg is NA, since if the reader did not fixate the word, they had no opportunity to launch a regression from it

- Looking just at the first 10 lines of this file (so the first 9 lines of data, since row 1 is the header), what range of word lengths does it seem that this dataset is investigating? Also, looking at these rows of data, which outcome seems to be least likely: skipping a word, fixating a word multiple times, or launching a regression from a word?

- You’ll notice that the file has 6 columns, separated by a tab. Each column has a name, given in the first row (
*Loading data.*Now, create a new R script called`hw1.R`

, which you’ll be modifying in the rest of this homework. As you develop this script, remember to add comments beginning with the`#`

character whenever you think it would be useful if you wanted to read this script again in a month or two. The first thing you’ll add to the script is a line to read the data file`schilling_ling300.txt`

into R. Before you can do this, remember that you’ll need to change the working directory to match the directory that the data file is in. To load the data, you’ll need to choose the appropriate function in the`read_delim()`

family from`readr`

(i.e.,`read_delim()`

,`read_csv()`

, etc.) and pass it the appropriate arguments. Look at the help for this family of functions by typing`?read_delim`

to see the available options. Remember that the most relevant arguments are whether or not the file has a header and what character is used to separate the columns. In case you haven’t seen it before,`'\t'`

is a common representation in computer programming languages for the tab character in a string. Assign the result of this function to a new variable called`dat`

.- What type of object does
`dat`

now store? (Hint: Think what type of object the`read_table()`

family of functions return.) - Use the
`summary()`

function on the new variable`dat`

to summarize the data in each of its columns. Of the vector types we discussed in class (`logical`

,`numeric`

,`boolean`

,`character`

,`factor`

), what type is each of the six columns?

- What type of object does
*Function creation and data transformation.*After loading data into R, one common task is changing the type of some of its columns into ones that represent the underlying data more explicitly. For example, three of the columns of`dat`

(`skip`

,`mfix`

, and`fp.reg`

) are represented as numeric variables that just happen to only take two values (0 and 1), but in actuality represent a logical value that is either true or false. It is often a good idea to convert these variables into formal`logical`

vectors to make this explicit, so this is what you’ll do next. Since this is something that we’ll want to do to all three of these columns, this is a good candidate for creating a function (so that we don’t need to copy and paste code).- Define a new function called
`to_logical`

that takes a single vector argument that is 0s and 1s (and possibly NAs) and returns its corresponding single`logical`

vector of`TRUE`

s and`FALSE`

s (and possibly NAs). At the beginning of the function, check to ensure that the input meets the function’s assumptions, and emit an error if not. As in class, this can be done by combining an

`if`

statement with a`stop()`

function call in the following form:`if (FALSE) { stop("Error message") }`

- You’ll need to replace the
`FALSE`

in the conditional statement above with an appropriate test of whether every element in the vector is 0, 1, or`NA`

. Testing whether each element in the vector is equal to 1 can be done with the`==`

operator, and similarly for 0. Testing whether each element in a vector is`NA`

, however, requires the special function`is.na()`

. Thus, you can create three logical vectors that denote whether each element is 0, 1, or`NA`

. Then, you can put these together with the logical-or operator`|`

, to produce one vector that tells whether each element of the input vector is either 0, 1, or`NA`

. Finally, to collapse this vector into a single vector that specifies whether all of the elements of the vector are`TRUE`

, you can use the`all()`

function. Note: you’re likely to also want to use the logical-not operator`!`

to perform the test you want. - You’ll also need to replace
`"Error message"`

with something more informative about the error.

- You’ll need to replace the
- Once the checking is complete of whether the input meets the function’s assumptions, the next step is to code the body of the function, which transforms 1 to
`TRUE`

, 0 to`FALSE`

, and leaves`NA`

as`NA`

. For this step we can assume that those are the only three possible inputs (thanks to the checking we’ve already implemented). There are two ways to achieve this goal. One is to use the`ifelse()`

function (see`?ifelse`

), which can check whether each element meets a certain condition (say, being 1), and if so, specifies one resulting value (say,`TRUE`

), and if not, another resulting value (say,`FALSE`

). (The`ifelse()`

function, like most elementwise functions in R, leaves`NA`

values as`NA`

.) A more direct approach, only valid because of our test, is simply to use the logical operator`==`

, which returns`TRUE`

s and`FALSE`

s (and also leaves`NA`

values as`NA`

). - Test your resulting function, call it with the argument
`c(0, 1, 1, NA, 0, 1)`

to see whether it returns`c(FALSE, TRUE, TRUE, NA, FALSE, TRUE)`

. Finally, use this function to transform the three binary columns of

`dat`

. Using`mutate()`

in the`dplyr`

package, transform the`mfix`

,`skip`

, and`fp.reg`

columns into logical vectors using your new`to_logical()`

function. Make sure to assign the results of the transformation back to the`dat`

variable.

- Define a new function called
*Data aggregation with*As mentioned above, for this dataset, the main question of interest is how the eye movement measures of ongoing language processing (skip rate, multiple fixation rate, and regression rate) vary with word difficulty, as assessed by word length. In the next problem, we’ll be visualizing these relationships using`dplyr`

.`ggplot2`

. However, it is customary (and very reasonable) in behavioral research on multiple subjects to not plot raw data, but to first aggregate the data to produce means for each condition of interest for each subject, and then plot summary statistics of those means. So, in this problem, you’ll aggregate the data to produce a new data frame`dat_subj`

where there is a single mean for each of these three measures for each subject for each possible value of word length.- Since we’re wanting to summarize the data for each subject for each word length, you’ll want to group the data by
`subj`

and`wlen`

using`dplyr`

’s`group_by()`

function. Assign this new grouped data frame to your new variable`dat_subj`

. - Chained to that first grouping command, use the
`summarise()`

function (note the spelling!) to create one “new” variable`skip`

that is the mean of the old`skip`

variable, e.g.,`summarise(skip = mean(skip))`

. Because the data is already grouped, this will calculate the mean for each group, which is exactly what we want. And since this is chained to the`group_by()`

function, it will now be the result of this`summarise`

function that is assigned to the new`dat_subj`

variable. - Once you’ve verified that that is working, you can add the other two variables to the
`summarise()`

function. Note that because the column`fp.reg`

includes`NA`

values, you’ll need to add a second argument to the`mean()`

function,`na.rm = TRUE`

(see`?mean`

), to denote that you want to remove the`NA`

values prior to averaging (otherwise, the average will also be`NA`

). - At this point, use the
`head()`

function in R to look at the first few rows of the new data frame`dat_subj`

. Verify that the columns`skip`

,`mfix`

, and`fp.reg`

are no longer binary, but now are proportions between 0 and 1.

- Since we’re wanting to summarize the data for each subject for each word length, you’ll want to group the data by
*Plotting.*Finally, using`ggplot2`

, we’ll plot our transformed data frame`dat_subj`

that we created in the previous problem.- First, using the
`ggplot`

function, set the data argument to`dat_subj`

and use the aesthetics function`aes()`

to use`wlen`

for the x axis and`skip`

for the y axis. Add`geom_point()`

to get a scatter-plot of the subject means. The result gives a vague sense of the range of the data, but isn’t yet a very useful visualization. - One problem is that the x axis is a bit odd given its meaning, including a tick mark at 2.5 – which is an impossible value given that words can only have a whole number of letters. We can fix this by telling R that
`wlen`

is not an arbitrary numeric value, but takes discrete levels, i.e., a`factor`

. To fix this, add to the previous part of the script, which used`mutate()`

to transform`skip`

, etc., into logical values code to mutate`wlen`

into a`factor`

using the`factor()`

function. Re-run the script and make sure that the x axis now explicitly shows only possible values of word length in this dataset. - Another problem with this graph is that we don’t have any way of associating the data points that come from the same subject. We can fix this, as we did in class, by adding the aesthetics
`color = subj`

and`group = subj`

. Do this and make sure the result now shows different subjects in different colors. - You may have noticed that the color scale that was created is very hard to read. E.g., subjects 1 and 2 are very nearly the same color and subjects 8 and 9 are nearly the same color too. This is again because R doesn’t know that the
`subj`

variable only takes 9 possible values (one for each subject), but is assuming that it can take any value in its range (e.g., 1.42), and so it’s assigning a gradient of colors to this range. To fix this, add more code to the previous part of the script that transformed the variables with`mutate()`

to change`subj`

into a`factor`

as well. Re-run the script and make sure that the colors chosen are now more reasonable. - Now, let’s make a second plot that actually computes the mean and standard errors of these by-subject means instead of just plotting every single one of them. This is also very easy with
`ggplot2`

. Create a new plot, with the same basic aesthetics as before (`x = wlen, y = skip`

), but omitting the bit about`subj`

. Now, replace the`geom_point()`

bit that was added, which just plotted every subject mean as a scatter-plot point, with`stat_summary(fun.data = mean_se)`

. Now, instead of directly visualizing the data (what the`geom`

did), this function will calculate a statistic of the y values (their mean and standard error of the mean) for each of the possible values of the x variable (`wlen`

). Each statistic is associated with a default visualization scheme (here a point with a simple line around it to indicate a range), so this code is all that’s needed to see the means and standard errors of the skip rates for each word length in the dataset. Describe the relationship between word length and skip rate that seems apparent. - Finally, create two more versions of the previous plot that do the same thing with the other two dependent measures of interest (
`mfix`

and`fp.reg`

) and describe the relationship between word length and each of those two variables that are apparent there.

- First, using the