[Version 0.0.3] Last compiled: July 02, 2025 12:16 (ADT)

1 Introduction

Per the prior page, the following is a basic (and crude) crash course in Data Analytics with R for common tasks needed for data and statistical analysis - it is by no means exhaustive and (in all honesty) will have some parts that may be outdated (or soon to be), in part, due to the rapidly, evolving nature of R and associated packages. I have worked with R for the better part of 20 years, as such, some of the things written here are partly me just set in my own ways and not bothering to find the newest tools (if it ain’t broke, I’m not going to fix it). It is assumed that you have a basic understanding of the theory/assumptions of most of these tests and, therefore, the primary purpose is to expose you to the the specifics steps necessary to perform these steps (and that you will do your due diligence when it comes to verifying the veracity of your results).

With that said, there are plenty of times where I’ve stumbled across packages that are/were far superior to codes I had to write myself when, at that point of time, it did not exist. Even during the process of writing this document, I have come across packages I was unaware of. Tangent aside, I am writing this simply to encourage you to search for new packages in the various R repositories due to R’s constant evolution.

As a starting point for anyone new to R, the following are some important resources that I encourage you to look at:

  1. RStudio: An IDE that was developed to support R. RStudio includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging and managing your workspace.
  2. Rmarkdown: R Markdown (a markup language for creating formatted text using a plain-text editor) provides an authoring framework for writing documents that can incorporate R codes that can be saved and executed to generate high quality reports that can be shared with an audience. This document was written in RMarkdown; however, as a consultant, I’ve written reports to clients that, for example, were updated every 6 months with new data and I only had to make minor edits to the report because all the basic stats were automatically regenerated once I recompiled the document.
  3. R for Data Science: Several parts of this document will probably direct you to “R for Data Science” (Hadley Wickham), which is an amazing resource for introductory data science. I use this resource, which is open-source, for many courses I teach and to train my own research students.

This document will be updated continuously. Please look at the Version History on the main page to see updates/changes made.


2 R Overview

The following is a quick introduction to R. I am assuming that you already have it installed; however, for the general installation process for R, you can follow the steps specified in their website and you can download RStudio at their website. The following is a screenshot of RStudio taken from the Wiki entry. As mentioned above, R is the actual language but RStudio is a nice development environment that’s substantially prettier and nicer to work with.

RStudio: Default look that can be customized in a variety of ways.

Figure 2.1: RStudio: Default look that can be customized in a variety of ways.


By default, the bottom left window of RStudio, which is called the Console, will have a greater than sign:

>

The Console is where you submit the code (i.e. prompt R to do a command) and to get R to execute it (i.e. run the command) you simply hit the [Enter] key. Alternatively, if you are in the Code window (typically in the top left), you can hit either the [Run] button to run the code where the cursor is currently located, or you can hit [Source] and submit everything in the Code window.


2.1 Basic Functions

One nice thing about R is that it is not only a statistical program, but also a language that we’ll get into that later. Let’s first get you familiar with R by using it like a calculator. If we let x and y be two variables, then the most common functions are used like so:

Table 2.1: Basic calculator functions in R
Function Command
Addition x + y
Subtraction x - y
Multiplication x * y
Division x / y
Exponential exp(x)
Logarithm log10(x)
Natural Log log(x)
Power x^y
TIP: If by chance you made a mistake in the command line and want cancel a process that’s you were working on, you can either hit the [Esc] key, Ctrl + C, or ⌘ + C, based on your OS.

So if we want to do some basic calculations in R such as: 2 + 5, 2 * 5, or 2 / 5 we’ll run the following code:

2 + 5
[1] 7

The first line is the code and after we hit [Enter] we get the second line. The first bit isn’t really important and the second half is our result. For those of your really interested, the [1] is a position indicator that helps keep track of the index if we have a long vector; more on that later. Two more examples:

2 * 5
[1] 10
2 / 5
[1] 0.4

should produce 10 and 0.4 as its solutions. Please note that R does not care about whitespace, so if you have spaces, no space, or multiple between values and operators, R will treat all the same.


2.2 Vectors and Matrices

Continuing where we left off, what we’ve worked with above are scalar values, i.e. single values. We can also manipulate vectors (a group of scalar values) in a similar fashion. To do this in R we use the function c() to concatenate the scalars together. For example:

c(1,2,3,4,5)
[1] 1 2 3 4 5

and so if we wanted to take the sequence of 1 to 5 and add a constant we’d get:

c(1,2,3,4,5) + 2
[1] 3 4 5 6 7

Similarly, if we wanted to manipulate the vector, such as squaring it, we do:

c(1,2,3,4,5)^2
[1]  1  4  9 16 25

and, not surprisingly, to multiply all elements by 2:

c(1,2,3,4,5)*2
[1]  2  4  6  8 10

Next up after a vector is a matrix, which is a 2D array of vectors (technically a vector is 2D as well since it is 1 row by n columns). Let’s say we want to create a 3 by 4 (3 rows by 4 columns) matrix, we can create this matrix by using the command matrix(). We can do this through the following code where the first input is the “data”, second input of code (separated by the comma) is the number of rows, and the third is the number of columns:

matrix(1:12,3,4)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

If you want to go across the rows instead, you can use the option byrow = True like so:

matrix(1:12,3,4,byrow=T)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Similar to vectors, we can manipulate matrices and add a constant or square them like so:

matrix(1:12,3,4,byrow=T) + 2
     [,1] [,2] [,3] [,4]
[1,]    3    4    5    6
[2,]    7    8    9   10
[3,]   11   12   13   14
matrix(1:12,3,4,byrow=T)^2
     [,1] [,2] [,3] [,4]
[1,]    1    4    9   16
[2,]   25   36   49   64
[3,]   81  100  121  144
TIP: If you hit the up arrow (↑) on your keyboard while it is on the > of the command line it will go through the history of everything you previously ran during your session.

TIP: Sequences are something that you will use and may find useful. However, to type out, for example, 1 to 100 would be time consuming. To get around this we can tell R to do it using two different functions. The simplest is the colon symbol : where R will give you each value from the start and stop point (separated by the colon) using a unit step of 1. For example,

1:10
 [1]  1  2  3  4  5  6  7  8  9 10

Alternatively, you can use the function seq(), which allows you to specify the size of the steps, which by default is 1. For example,

seq(1:10)
 [1]  1  2  3  4  5  6  7  8  9 10

which is the same as 1:10. However, if you wanted the odd numbers between 1 and 10 you can do

seq(1,10,2)
[1] 1 3 5 7 9
where the third input is the step-size.


2.3 Saving Objects (i.e. all your work)

Be efficient and save objects, and by objects I mean variables, constants, and everything you’ve been working with. A constant theme in computer science is DRY (Don’t Repeat Yourself). We already have a method for that by cycling through the history (saved in the .Rhistory file), but saving objects in R that you created makes things a lot easier and allows us to be more efficient than having to punch in the same commands over and over again.

We can assign anything we want to a variable by using one of two commands, either the equal sign (=) or the combined symbols the less than symbol (<) with a dash (–) that make a left-arrow (<–). For any object name, you can use anything you want with the exception of certain symbols and whitespace, although periods and underscores are allowed, and that the first character has to be non-numerical, i.e. a letter. Let’s say we want to save all the odd numbers from 1 to 100, we can do this by using either command:

x <- seq(1,100,2)
x = seq(1,100,2)

which will both assign the values 1, 3, …, 99 to the variable (or object) x. Now, if we type x into the command line and hit [Enter], we’ll see our sequence.

 [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
[26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

Later on we’ll talk about how we can manipulate data, but before we do that let’s save the matrix we created a moment ago:

mat <- matrix(1:12,3,4,byrow=T)
TIP: Every function that requires an input has each value separated by a comma and requires a specific order. For example, in matrix the first input has to be the data we are manipulating, the second is the number of rows, and the third is the number of columns. They also have default values for several options, and so for the option byrow in the function matrix() the default is FALSE. To understand the input order, and in general the function better, we can use the help() function or put the ? symbol in front of the function. For example, help(matrix) or ?matrix will bring up R’s manual for matrix.

TIP: Another reason I like using the arrows is because if I am just typing the code into the console (and not a script) but I get to create an object to save the output to, I can use a right-arrow (–>) and add the object name afterwards:

seq(1,100,2) -> x
You cannot do this with the = sign since R “reads” from left to right, which allows the right-arrow to “point” the object into the specified name.


2.4 Exiting / Saving your Data

At this point of time since we are learning how to save an object, we should address how to save your data and exit so you can continue on another day. Now, let’s say that you have a slow or cantankerous computer and you don’t want to lose information – especially the night before an assignment is due. The nice thing about RStudio is that it will save anything in the Source code editor, even if you have not officially saved it. However, you should make a habit of saving any code you write for your own records and as a point of reference. Regarding the objects you have been creating/saving, and the whole R environment, to save that work you can use the function save.image() like so:

save.image()

Entering it into the Console and hitting [Enter] it will (or at least it should) save all the objects you’ve created. Alternatively, you can just exit R and it will ask you if you want to save your work. To exit/quit R you use the q() function, which will then be prompted with the following:

q()
Save workspace image? [y/n/c]: 

If you type in y, it will exit R, close the program, and save anything you have done during this session. If you type in n, it will exit R, close the program, and ANYTHING YOU HAVE DONE WILL BE LOST. If you type in c, it will continue with the R session. The latter option lets you continue any work you are doing just in case you changed your mind. As an alternative to continuing, you can just hit the [ESC] key and it “should” do the same thing.

Please note that:

  1. When you save your workspace or workspace image, R saves it into a file called .RData, which will be located in the current directory you are working in.

  2. If you are a Mac user, anything with a dot in the beginning of the name will often be hidden from plan view. So, if you go to the directory (e.g. Desktop) you may not be able to see the file there, even if it is there. This is not the smartest thing in the world and you can get around this by using the save.image() function and giving it an actual name. However, R loads a the file .RData by default, so until you get familiar with R I would stay clear and leave things as defaults.

  3. Be default, when you open RStudio, it will go to its default location and automatically load the .RData file without prompting. This is nice, because then you don’t have to do much, but annoying when you want to organize things. There are two ways around this when you run R/RStudio:

    1. Change the default location. If you want to start organizing things, you can go to: [Tools] → [Global Options…] Under General, you should see the first option for the Default working directory. There is nothing wrong with this, but I would stay clear of this option until you get more comfortable with RStudio’s environment.

    2. Change your location through the setwd() or the GUI method mentioned at the beginning of this section, and then use the command: load('.RData'), which will load the file (if it exists); this is the safest way to go while you get use to R.


2.5 What does this code do again?

Something that you will learn, or maybe you did in previous programming experience, is that when you write code you can often forget what a particular function does or even what you were attempting to do. Even the most experienced programmers will forget sometimes and so commenting is one of your best friends. A comment is literally note you can leave yourself that the program will not attempt to execute. For R, if we decided to write our code in a document and submit (or source) it to be executed in R (we’ll discuss this later), you can use the number symbol (#), or as you kids call it a hashtag, to tell R to ignore anything after #. Please note that in R, any code that is commented out only applies to the current line. For example,

mat <- matrix(1:12,3,4) # matrix function: value / row / column

are all on the same line and so would be commented out; however, if you did:

mat <- matrix(1:12,3,4) # matrix function:
value / row / column

because the line is broken, R will attempt to submit value / row / column, which to it is a sequence of divisions. However, this is really only an issue if you are submitting an R script or if you copy/paste the code, so it should be obvious there is something wrong if you are typing the commands directly into the R console.


2.6 Random numbers

In this part we’re going to show you how to generate random data, in particular, random data from our distributions, but before we do that we need to introduce the set.seed() function. In most computer languages, we can generate random numbers (and random data); however, the crux of generating random data is how to get the same results for comparative purposes, and this is where the set.seed() function comes in. In the simplest terms, the set.seed() function allows consistently get the “same” randomly generated data – not very random, but it has a purpose. For example, if you write the following code:

for(i in 1:10) print(sample(1:100,1))
[1] 64
[1] 51
[1] 55
[1] 29
[1] 24
[1] 1
[1] 83
[1] 80
[1] 17
[1] 63

You’ll probably get different values than the list above; however, if you repeat the previous code with the following modification, you’ll (hopefully) get the exact same values as me:

set.seed(2018)
for(i in 1:10) print(sample(1:100,1))
[1] 15
[1] 55
[1] 3
[1] 12
[1] 18
[1] 85
[1] 73
[1] 72
[1] 32
[1] 32

This little function allows us to create reproducible code. The only issue we must identify is that, when you are using the set.seed() function, order will be very important, so if you set the seed, run a random number generating code like sample and make a mistake, you’ll need to reset the seed again.

R Code: The for-loop code allows us to loop through the same code over and over again until a specified stop point. In R, any function (such as the one above) can be written on a single line and executed without issue. However, if you need to break the code into separate lines for whatever reason (including complexity), you need to use an opening and closing brace (i.e. { and }). Test out as one continuous line:

for(i in 1:10) if(i %% 2) cat(i,"is an odd number\n") else cat(i,"is an even number\n")
1 is an odd number
2 is an even number
3 is an odd number
4 is an even number
5 is an odd number
6 is an even number
7 is an odd number
8 is an even number
9 is an odd number
10 is an even number

and then test out the follow as two lines of code:

for(i in 1:10) if(i %% 2) cat(i,"is an even number\n")
else cat(i,"is an odd number\n")
1 is an even number
3 is an even number
5 is an even number
7 is an even number
9 is an even number
Error: unexpected 'else' in "else"

In the first set of code, everything runs smoothly while in the second, you get the odd numbers but then an error message about an unexpected else. To break lines but maintain the code, we need to use { and } after we start the for code:

for(i in 1:10){
if(i %% 2) cat(i,"is an odd number\n")
else cat(i,"is an even number\n")}
1 is an odd number
2 is an even number
3 is an odd number
4 is an even number
5 is an odd number
6 is an even number
7 is an odd number
8 is an even number
9 is an odd number
10 is an even number
which maintains the loop. Note: The %% is the modulus operator, if-else are standard if-else statements, cat concatenates text and objects and prints them, the resulting 1 or 0 from mod acts as True/False for the Boolean statements, and sample generates a value from between 1 and 100.

Now that we know how to generate random data, let’s generate some from the classic normal distribution. Again, though, for consistency, let’s use the set.seed() function again:

set.seed(357)
norm.dat <- rnorm(100)

This will “randomly” generate 100 values from normally distributed data with \(\mu= 0\) and \(\sigma = 1\) (i.e. the standard deviation, not variance) by default, which we’ll use in the next part.


2.7 Descriptive Statistics

As you learned in introductory statistics, there are multiple descriptive statistics of interest – just see the first Table of any scientific article and it will usually give you information about the mean and SD (or median and IQR) for each available variable, or the number of measurements through either a contingency table or frequency table. The following are some of the most common descriptive functions you’ll use:

Table 2.2: Basic statistics functions in R
Function Command
mean mean(x)
median median(x)
standard deviation / variance sd(x) / var(x)
interquartile range IQR(x)
quartiles, mean, and min/max summary(x)
25\(^{\text{th}}\), 50\(^{\text{th}}\), and 75\(^{\text{th}}\) percentiles quantile(x)
min/max min(x) / max(x)
both min and max range(x)
the number of elements in an object length(x)

As you can see, a lot of them are very self-explanatory/intuitive. Through most parts, you do not need to do anything to the these functions except to call them and specify the object you want to apply it to. For example,

mean(norm.dat)
[1] -0.041074
sd(norm.dat)
[1] 0.8862014
var(norm.dat)
[1] 0.7853529
IQR(norm.dat)
[1] 1.047742
quantile(norm.dat)
         0%         25%         50%         75%        100% 
-2.30310514 -0.54234602 -0.09202726  0.50539606  2.70309602 
range(norm.dat)
[1] -2.303105  2.703096

This is why the set.seed() function is important, since if you did it slightly differently (i.e. out of order) or did not set the seed to 357, we would have different numbers.


2.8 Basic figures and plots (Part I)

The 3 most common figures used in introductory statistics are the boxplot, histogram, and scatterplot. The commands for these functions should not be surprising: boxplot, hist, plot. We will review the basics implementation of these plots and let you figure out the small nuisances in how to customize these figures (e.g. titles, axes, labels, colours, etc.). Just like with the descriptive functions above, all you need to do is call the function and specify the object you want to apply it to:

boxplot(norm.dat)
Simple Boxplot

Figure 2.2: Simple Boxplot

As can be seen in the figure above, the whiskers of the boxplot go to just above 2 and before –2. This should not be surprising, why? If we recall how a boxplot is supposed to work, they can either be the lower or upper quartile + 1.5·IQR or the min and max. We determine the IQR as 1.0477, so 1.5·IQR = 1.5716, so Q1 – 1.5·IQR and Q3 + 1.5·IQR equals –2.1140 and 2.0770, which are larger and smaller than the min (-2.3031) and maximum (2.7031) values. As such, the two whiskers are the theoretical values and the two points, which are the min and max, are identified as potential outliers.

Based on the boxplot, the data looks relatively normal looking with a slight skewness to the right - notice that the median, which is the thick black line in the boxplot, isn’t in the middle of the bottom and top of the box, which represent the 25th and 75th percentile. If the data was symmetric, which is the case of normally distributed data, the distance between the 25th and 50th percentile should be the same as the distance between the 50th and 75th percentile. With that said, we can confirm this through the histogram:

hist(norm.dat)
Simple Histogram

Figure 2.3: Simple Histogram

Part of the reason why it doesn’t look perfect, despite being generated from the normal distribution, is the sample size. If we redo this using a sample size of 10,000 and settings the seed to 0, then we have a more expected histogram.

set.seed(0)
hist(rnorm(10000))
Simple Histogram (larger n)

Figure 2.4: Simple Histogram (larger n)

To get a scatterplot we need to use the plot() function, which needs two inputs (one for each axis). Although we do not have a second input (i.e. set of data) we can just plot it based on order of generation, i.e. from 1 to 100, as the data for the other axis.

plot(1:100,norm.dat)
Random scatterplot

Figure 2.5: Random scatterplot

Not surprisingly, there is no discernible pattern being formed, nor should there given it is randomly generated data. At this point of time, I would suggest checking the manual in R for these 3 functions to learn about the specific options and how to modify the graphs to your liking, e.g. setting a title or chancing the label axes.

3 Reading and Manipulating Data

Now that we’ve created some data, it times to read in data. The following Google Drive folder has multiple datasets, including hotdogs.txt and hotdogs.csv; these are the exact same files but with different formats. For simplicity, download it to the Desktop on the computer you are using.

There are two methods in R, the first using the command setwd() and the second a much easier point and click method using the GUI. For PC users, your code should look something like:

setwd('C:/Users/username/Desktop')

Please note that you’ll replace the text “username” with your actual username. For Mac users, the code would look something like:

setwd('/Users/username/Desktop')

WARNING: One issue you may come across is the difference in formatting and ASCII code, and this often happens if you copy/paste code here and there is a hyphen (-) or single/double quotes (’ vs. “), it is possible you get:

Error: unexpected input in

which is due to a formatting issue. You may have noticed that when you use rich-text, the hyphens and quotes sometimes are angled or directiona, which does not exist in R, but the copy/paste may force that format over. As a result, R does not know how to handle it and so it gives an error.

To set the work direct through the GUI, you can use the following steps in RStudio:

[Session] → [Set Working Directory ] → [Choose Directory…]

and select the Desktop; by doing so, you tell R where the file is. If you do not set the directory or do it incorrectly you will get an error message about cannot open the connection and that there is no such file or directory when you attempt to import a file into R. If you look at the Console window in RStudio, you should see next to the words Console, the current location. Alternatively, if you run the getwd() code, which will output the work directory.

getwd()
[1] /Users/dlee/Desktop


3.1 Reading/Importing Data

Now that R knows where to look, let’s introduce the read functions. There are multiple ways to read data in R, which are dependent on the type and format of the data, with the most basic command for reading in data being the read functions – there are several options tailored to different formats of data. Please note that you should always know what the dataset looks like in terms of formatting. In the case with hotdogs.txt the format is space/tab-delimited whereas hotdogs.csv is comma delimited.

To reading in hotdogs.txt we can use the following command:

dat1 = read.table('hotdogs.txt', header = TRUE)

The first thing you should notice is the suffix at the end of the read function, .table, we’ll get to that in a moment. This command tells R that there is a header and that the separator between columns of data is either tab-delimited or white spaces (default value). The header option works under the assumption that IF the option for header is True, then the first line will indicate the names of all the columns. We’ll discuss this in more details in a moment. For hotdogs.csv we can use either of the two codes below:

dat2 = read.table('hotdogs.csv', header = TRUE, sep=',')

or alternatively we can use following code as well:

dat2 = read.csv('hotdogs.csv', header = TRUE)

The first thing you should notice is the different suffix at the end of the two read functions. In the second command, the suffix .csv, which tells R that the separator between columns/the format is a comma, i.e. csv (comma separated variable). In the first command, we need to tell R that the separator is going to be a comma by invoking the option sep=','. By using the latter code, R knows the dataset uses commas to separate each variable, and so we don’t need the separator option in the second code.

Please note that:

  1. If you use read.table() and do NOT specify a separator, it will default to spaces and so when you look at hotdogs.csv it will be look weird.

  2. If you do NOT specify a header and there is one, you will gain a line of information in your data that may not make sense when you analyze the data, i.e. have a column of data where the first element is a character string while the rest are numerical.

  3. If you do specify a header and there is NONE, you will lose a line of information because R will consider the first row the header. For example, if you have 100 lines of numeric data and you tell R that there is a header when it reads it in, it will consider row 1 as the header and you’ll end up with 99 lines of numbers.

TIP: Curious if you read in your data correctly? Here’s a tip use the function head(), which gives you a preview of the first 6 observations in the dataset, if you use the option n = you can specify any number you want. For example, if you wanted the first 10 observations of the dataset dat2,

head(dat2, n = 10)
   Type Sodium Calories
1  Beef    495      186
2  Beef    477      181
3  Beef    425      176
4  Beef    322      149
5  Beef    482      184
6  Beef    417      190
7  Beef    370      158
8  Beef    322      139
9  Beef    479      175
10 Beef    375      148
TIP: Let’s say you mistyped the name of a variable that you saved data to. For example, let’s say you saved the last dataset to dst2 instead of dat2. If you type in dat2 and hit [Enter] you should get the error Error: object ‘dst2’ not found. What do you do to figure out what was the variable besides looking through all the commands you ran? You can use the ls() function to list everything in the directory.
WARNING: Please note that R is case-sensitive, and so there is a difference between dat2 and DAT2. If you used the latter you would get a similar error to the one above.

TIP: Want to know what the bottom looks like? The function tail() gives you a preview of the last 6 observations in the dataset and, similar to head(), if you use the option n = you can specify any number you want. For example, if you wanted the last 10 observations of the dataset dat2,

tail(dat2, n = 10)
      Type Sodium Calories
45 Poultry    457       99
46 Poultry    428      107
47 Poultry    313      113
48 Poultry    426       75
49 Poultry    413      112
50 Poultry    358       76
51 Poultry    481      110
52 Poultry    488       70
53 Poultry    422       66
54 Poultry    445       54
This function is quite useful when reading in Excel files and (for example) someone calculated some row or column means and the last row is not “actual” data.


3.2 Data Manipulation (Part I)

Let’s start by going back to the dataset hotdogs.txt that we have named dat1 and dat2, recalling there is no difference between the two except they were created through different function based on format. The dataset is a sample of hotdogs taken and tested for the sodium and calorie count; there are 20 beef, 17 mystery meat, and 17 chicken hotdogs. Again, we can use the function head() to get a preview of the dataset. We can see the dataset has 3 columns: type of hotdog (Beef, Meat, or Poultry), the amount of sodium, and the calorie count.

Now that we have the dataset up and running, the next section will discuss how to manipulate the dataset, which can be done in a number of ways depending on the structure and type of data. The four most common ways (at least for me) to manipulate data in R is to isolate specific pieces of information is through:

  • attach() and detach() function
  • with() function
  • manipulating the data structure of a data frame by using the $ symbol
  • matrix properties of a data frame by using coordinates

The attach() function works in conjunction with the headers associated with a dataset, in this case the names Type, Sodium, and Calories. But before we attach our dataset, let’s first ls() the directory to see all the names of the variables.

ls()
[1] "dat1"     "dat2"     "i"        "mat"      "norm.dat" "x"       

We should see a couple of different objects but none should have the names Type, Sodium, and Calories, and if we entered either of those terms into the console and hit [Enter] we should get an error about the object not being found. To attach the dataset:

attach(dat2)

Again, if we ls() we would see the exact same thing as before but this time if we entered any of the 3 headers into the console there is no error:

Calories
 [1] 186 181 176 149 184 190 158 139 175 148 152 111 141 153 190 157 131 149 135
[20] 132 173 191 182 190 172 147 146 139 175 136 179 153 107 195 135 140 138  91
[39] 132 102  88  94 102  87  99 107 113  75 112  76 110  70  66  54

What just happened? When you attach a dataset, you create temporary or pseudo-variables that correspond to the header name of a dataset. Similarly, if you entered in the other two header names you would get the data associated with that column in the dataset. Do you notice the text [1], [20], and [39] above? (In your actual Console, you’ll probably get something different, but similar as this text will differ depending on the length of your Console screen). These numbers indicate a position in a long vector and so the number immediately to its right corresponds to that position in the vector. For example, the number 186 is in the 1st position of the vector and similarly, the number 190 is in the 15th position. Going back to our earlier example with the calculation of 2 + 5 we saw 7 with [1], indicating that the value 7 was in the first position of the vector. Given that the vector was only 1 element in length makes the point moot but you get the idea/purpose behind this.

TIP: As we seen with the attach() function, we can create temporary/pseudo-variable that attaches the data from a data frame to the header associated with it. What happens if we don’t know all the names of the headers? By using the name() function with any data frame it will automatically output all the names associated with it:

names(dat2)
[1] "Type"     "Sodium"   "Calories"
which is what we expected. So, this is a nice option when you don’t know or forgot the names of all the headers in a dataset, especially when it’s a particularly large dataset.

In R, we can combine several options together. Let’s say that we only want a preview of the information in the column Calories instead of all the information, then in R we can combine the pseudo-variable with the head() function like so:

head(Calories,n=10)
 [1] 186 181 176 149 184 190 158 139 175 148

Once we are done, we can detach the dataset by using the function detach() or you can put the name of the original dataset between the two soft brackets:

detach(dat2)

but the former method works just the same. When we use this command, we “remove” the temporary/pseudo-variable, and if type in Calories into the Console it will give us an error message about the object not being found.

Another way to do it is by using the with() function, which follows a similar process as the attach() function, except you have to specify the dataset of interest. For example, the following codes produce the same results as we just saw above:

with(dat2, Calories)
 [1] 186 181 176 149 184 190 158 139 175 148 152 111 141 153 190 157 131 149 135
[20] 132 173 191 182 190 172 147 146 139 175 136 179 153 107 195 135 140 138  91
[39] 132 102  88  94 102  87  99 107 113  75 112  76 110  70  66  54
with(dat2, head(Calories))
[1] 186 181 176 149 184 190
head(with(dat2, Calories)) # This does the same thing as above, just changing the order
[1] 186 181 176 149 184 190

The third method is to manipulate the structure of the data frame through the $ symbol or element extraction. To use the $ symbol we use the name of the dataset, followed by the $ symbol, and the name of the header we are interested in. For example:

dat2$Calories
 [1] 186 181 176 149 184 190 158 139 175 148 152 111 141 153 190 157 131 149 135
[20] 132 173 191 182 190 172 147 146 139 175 136 179 153 107 195 135 140 138  91
[39] 132 102  88  94 102  87  99 107 113  75 112  76 110  70  66  54

The fourth and final option plays with matrix properties. To do this imagine the 3x4 matrix we made earlier. With element extraction, we use coordinates and hard brackets to extract specific data. In my opinion, this is the most flexible and useful bit of code when manipulating data in R. To do this we’ll first look at a vector we’ll create:

x <- seq(1,100,2)
x
 [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
[26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

Let’s say we just want the last number only (which is the 50th element), we do:

x[50]
[1] 99

TIP (Lengths and dimensions): How do we know how long a vector is, i.e. how many elements are in it? Let’s say we have a vector called Calories and a matrix called mat, then to determine the number of elements in Calories or the dimensions of mat (or any data.frame, including a dataset) by using the command:

length(Calories) # No. of elements in the vector
dim(mat)  # nxm dimensions
nrow(mat) # No. of rows
ncol(mat) # No. of columns

We call the object and get it to look for a particular position within the vector by using the left [ and right ] hard bracket. Now we’ll look at the matrix mat we saved previously.

mat
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

As we can see, we’ve have a matrix that is 3 rows and 4 columns. Let’s say that we want the value in the 2nd row and 3rd column, which is the value 7, which we get by:

mat[2,3]
[1] 8

Similar to looking for the 50th element in x we need to input a position into mat; however, since it is 2D, we need a position for row first, then followed by a comma, and then a position for the column. If, though, we were interested in the entire third row, we could use the following:

mat[3,]
[1]  3  6  9 12

where by leaving the column input blank, it tells R I want all of the columns attached to the specified row. Now that we know about how to use element extraction let’s go back to manipulating our dataset. Again, through the matrix properties of a data frame we can get any piece(s) of information we want. Recall: Calories is in the third column in dat2, thus we can use the following code to extract the appropriate data:

dat2[,3]
 [1] 186 181 176 149 184 190 158 139 175 148 152 111 141 153 190 157 131 149 135
[20] 132 173 191 182 190 172 147 146 139 175 136 179 153 107 195 135 140 138  91
[39] 132 102  88  94 102  87  99 107 113  75 112  76 110  70  66  54

Similar to how we manipulated the matrix mat, we need coordinates for the row and column, but in this case, we want all the rows and only the third column. By now leaving the row position empty it tells R we want all the row information for the specified column.

NOTE: If we left the row and column positions both empty it would be the same as if we just typed in dat2 and hit [Enter] because if both row and column positions are empty it is asking for all the rows and all the columns, i.e. the entire dataset.


3.3 Quick (Contingency) Tables

In any publication tables and graphs are some of the most important aspects because it conveys so much information that can be derived without actually reading the paper, which I’ll admit I often don’t bother reading the entire article and instead jump to the tables/graphs and reading their summaries and the conclusion. Let’s first summarize information into a table, i.e. a count table, that will tell you how many hotdogs we measured that were of each type. To do this in R we use the table() function:

with(dat2,table(Type))
Type
   Beef    Meat Poultry 
     20      17      17 

You don’t have to use the exact same method to isolate the Type of hotdog, you can try the attach() function or any of the other 2 options, if you wish.

One of the most common tables you will ever see are contingency tables, which are some \(n\) x \(m\) (or \(r\) x \(c\)) table that counts the number of occurrences. If you have some experience with probabilities, it is like the intersection between events (A ∩ B). Let’s say that a hotdog is considered healthy when the calorie count is less than or equal to 150 per hotdog. We could go and count the numbers or we can do the following:

cont <- with(dat2,table(Healthy = Calories <= 150, Type))
cont
       Type
Healthy Beef Meat Poultry
  FALSE   11    9       0
  TRUE     9    8      17

The way it works is that the table() command can evaluate the Boolean expression. In particular, the expression that a hotdog is healthy if the calorie ≤ 150. As can be seen, there are 0 unhealthy poultry hotdogs (Healthy = False) and 17 healthy poultry hotdogs (Healthy = TRUE), and so forth. The alternative method is we could have created a new variable within the dataset dat2.

Additionally, we can add in more “layers”/dimensions to the table by simplying adding in another variable. In this case, since we have another variable Sodium, we could add in a low-salt dimension to this like so:

with(dat2,table(Healthy = Calories <= 150, LowSalt = Sodium < 400, Type))
, , Type = Beef

       LowSalt
Healthy FALSE TRUE
  FALSE     9    2
  TRUE      1    8

, , Type = Meat

       LowSalt
Healthy FALSE TRUE
  FALSE     6    3
  TRUE      2    6

, , Type = Poultry

       LowSalt
Healthy FALSE TRUE
  FALSE     0    0
  TRUE     10    7

and we could clean this up just a little by relabling the TRUE and FALSE to (for example) reflect if they are Healthy or Low in Salt by using ifelse statements; however, we’ll save that for one of the next sections about Boolean Statements.


3.4 Basic figures and plots (Part II)

Continuing with some other basic figures and plots, an appropriate one to introduce here is the bar chart, which we implement through the barplot() function:

barplot(with(dat2,table(Type)))
Basic barplot

Figure 3.1: Basic barplot

We can combine a contingency table with the barplot() function, which will not only specify the healthy hotdogs, but also the unhealthy hotdogs:

barplot(cont)
Contingency barplot

Figure 3.2: Contingency barplot

Now we know that all the poultry hotdogs are healthy (i.e. light grey), but let’s make it clear by putting in a legend into the figure.

barplot(cont, legend.text=c('Unhealthy','Healthy'))
Contingency barplot with legend

Figure 3.3: Contingency barplot with legend

Note that the option legend.text has “Unhealthy” and then “Healthy”, this corresponds to the order listed in the contingency table, which has Healthy = FALSE, i.e. unhealthy, first and then Healthy = TRUE. However, although we have a description of the proportion of healthy vs. unhealthy hotdogs, since we have continuous data it may also be useful to view the real data, as opposed to categorizing it based on some definition. The standard boxplot can be brought about by the following codes:

with(dat2,boxplot(Calories))
Boxplot of calories

Figure 3.4: Boxplot of calories

and histogram by:

with(dat2,hist(Calories))
Histogram of calories

Figure 3.5: Histogram of calories

We can now combine the graphs side-by-side. To do this we’ll use the par() function that essentially allows you to partition the graphics window and by using the option mfrow or mfcol you can tell R how many columns or rows it should split the screen into. For this example, we’ll use the mfrow option to create a single row of graphs split among 2 columns:

par(mfrow=c(1,2))
with(dat2, boxplot(Calories))
with(dat2, hist(Calories))
Combining figures

Figure 3.6: Combining figures

It should be noted, right now, the display is split into 1 row by 2 columns, so every time you use the display again it will fill up each one of those partitioned displays. To restore the display to its original state, you need to use the command: dev.off() to turn off the graphical device or reset it to a single partition: par(mfrow=c(1,1)).


3.5 Sorting Data

Now that we’ve learned how to look at specific pieces of information, e.g. columns within a data frame, another useful thing is sorting the dataset. In R we can use the codes sort() or order(). The function sort() will give you the values in order from smallest to largest (by default). So, let’s say we want to sort the values of Calories.

with(dat2,sort(Calories))
 [1]  54  66  70  75  76  87  88  91  94  99 102 102 107 107 110 111 112 113 131
[20] 132 132 135 135 136 138 139 139 140 141 146 147 148 149 149 152 153 153 157
[39] 158 172 173 175 175 176 179 181 182 184 186 190 190 190 191 195

This nice thing about this is, unlike some other programs out there, running this function will not permanently modify the dataset. However, this gives you the actual values, which makes it easier to determine the median (139.5, the middle of the 27th and 28th element), but if you want to properly organize things, then the function order() is a bit more useful, especially if you want to order by more than variable. For example, to order the variables by Type and then by Calories, i.e. first order all the hotdogs by their type and then within each type order by calorie count, we combine order() with element extraction. First:

with(dat2,order(Type))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54

This is the order of hotdogs based on Type and conveniently, the hotdogs are already in order by type as the first 20 are beef, the next 17 are meat, and the last 17 are poultry. Doing the same thing to Calories we get:

with(dat2,order(Calories))
 [1] 54 53 52 48 50 44 41 38 42 45 40 43 33 46 51 12 49 47 17 20 39 19 35 30 37
[26]  8 28 36 13 27 26 10  4 18 11 14 32 16  7 25 21  9 29  3 31  2 23  5  1  6
[51] 15 24 22 34

TIP: Not sure if the directory you’re in has the files you want to read in? Use the list.files() function. In a similar fashion of how ls() tells you what objects are in the workspace of R, list.files() tells you what objects are in the work directory that you can import/read in:

list.files() # No input is necessary for this function

Now, importantly, although the first value observed in Calories using the order() function is the exact same as the sort() function, the latter will give you the actual number whereas the former will give you the position within, in this case, the vector/column. Therefore, the hotdog with the lowest amount of calories is in the 54th position and the hotdog with the largest amount of calories is in the 34th position. Combining the two we get:

with(dat2,order(Type,Calories))
 [1] 12 17 20 19  8 13 10  4 18 11 14 16  7  9  3  2  5  1  6 15 33 35 30 37 28
[26] 36 27 26 32 25 21 29 31 23 24 22 34 54 53 52 48 50 44 41 38 42 45 40 43 46
[51] 51 49 47 39

which tells us that hotdogs in position 12, 17, 20, …, 6, 15 (e.g. the first 20 hotdogs, which are all beef hotdogs), are the the ordered hotdogs from smallest to largest for beef; hotdogs in position 33, 35, 37, …, 22, 34 are the ordered hotdogs from smallest to largest for meat; and hotdogs in position 54, 53, 52, …, 47, 39 are the ordered hotdogs from smallest to largest for poultry. Now to combined them with element extraction:

with(dat2,dat2[order(Type,Calories),])
   Type Sodium Calories
12 Beef    400      111
17 Beef    317      131
20 Beef    353      132
19 Beef    298      135
...
   Type Sodium Calories
6  Beef    417      190
15 Beef    445      190
33 Meat    294      107
35 Meat    405      135
...
      Type Sodium Calories
22    Meat    406      191
34    Meat    311      195
54 Poultry    445       54
53 Poultry    422       66
...
      Type Sodium Calories
51 Poultry    481      110
49 Poultry    413      112
47 Poultry    313      113
39 Poultry    375      132

What this did was it took the positions outputted from the function order() and then rearranged the dataset dat2 using positions as a guide in the coordinate spots for the rows. If this is not clear, let’s quickly do a simpler example. Let’s type in the following:

temp <- c(1,10,100,1000,10000)
temp
[1]     1    10   100  1000 10000
temp[c(5,4,3,2,1)]
[1] 10000  1000   100    10     1

What I just did is created a variable called temp and saved some numbers into it. There are 5 numbers in all. Now as we discussed, by using the hard brackets I can look at particular pieces of information based on their position. So if I wanted to look at the value in the third position I would have ran:

temp[3]
[1] 100

In the code above I told R I wanted to look at the values in position 5, 4, 3, 2, and then 1. In the example with the function order() I did the exact same thing where I told R that I wanted to look at dat2 based on the particular positions, and by leaving the columns value blank (recall it is a 2D variable and so requires row and column input) it tells R we want all the column information. Note that there are several ways I could have invoked the original code, with the only caveat being that with dat2 I need to make sure I have the appropriate inputs for the two positions. Given that the code:

with(dat2,order(Type,Calories))
 [1] 12 17 20 19  8 13 10  4 18 11 14 16  7  9  3  2  5  1  6 15 33 35 30 37 28
[26] 36 27 26 32 25 21 29 31 23 24 22 34 54 53 52 48 50 44 41 38 42 45 40 43 46
[51] 51 49 47 39

gives us the positions within the data frame, it makes sense this piece of information be placed in the row coordinates, especially since we know that there are 54 positions (and values) within the rows, and only 3 within the columns; had we swapped the positions or forgotten the comma, we would have gotten an error about undefined columns selected.


3.6 Boolean statements and Subsets

Finally, as I eluded to previously when talking about the equal sign (for saving a variable in R) we’re going to talk about Boolean statements, i.e. true or false conditions. Let’s say, for example, you only wanted the data from the beef hotdogs? We already know that it they are the first 20 values in the dataset, but what if they weren’t so conveniently sorted for us? We could use the order() function like above and create a new object to contain this information, but let’s not do that and instead combine element extraction and coordinates with a Boolean condition. If we type in the following into R we get:

with(dat2,Type=='Beef')
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE

and if we combine that Boolean statement in a similar manner as we did with the hotdog example above we get:

with(dat2,dat2[Type=='Beef',])
  Type Sodium Calories
1 Beef    495      186
2 Beef    477      181
3 Beef    425      176
4 Beef    322      149
...
   Type Sodium Calories
17 Beef    317      131
18 Beef    319      149
19 Beef    298      135
20 Beef    353      132

To confirm we are right, we could use either the dim() or nrow() function to give us the dimensions or the number of rows, respectively, of this subset. An alternatively to this method is to use the subset() function, which creates a subset of a dataset (or matrix or vector) based on a condition, e.g. the type of hotdog being beef, which produces the exact same results as above:

subset(dat2,Type=='Beef')
  Type Sodium Calories
1 Beef    495      186
2 Beef    477      181
3 Beef    425      176
4 Beef    322      149
...
   Type Sodium Calories
17 Beef    317      131
18 Beef    319      149
19 Beef    298      135
20 Beef    353      132

TIP: One of the issues with Boolean statements above is that it does NOT handle missing data very well. For example, if we run the following code, where we randomly select 3 values to become missing:

temp <- with(dat2,Type)
set.seed(119)
temp[sample(1:length(temp),3)] <- NA
temp
 [1] "Beef"    "Beef"    "Beef"    "Beef"    "Beef"    "Beef"    NA       
 [8] NA        "Beef"    "Beef"    "Beef"    "Beef"    "Beef"    "Beef"   
[15] "Beef"    "Beef"    "Beef"    "Beef"    "Beef"    "Beef"    "Meat"   
[22] "Meat"    "Meat"    "Meat"    "Meat"    "Meat"    "Meat"    "Meat"   
[29] "Meat"    "Meat"    "Meat"    "Meat"    "Meat"    "Meat"    "Meat"   
[36] "Meat"    "Meat"    "Poultry" "Poultry" NA        "Poultry" "Poultry"
[43] "Poultry" "Poultry" "Poultry" "Poultry" "Poultry" "Poultry" "Poultry"
[50] "Poultry" "Poultry" "Poultry" "Poultry" "Poultry"
length(temp[temp=='Beef'])
[1] 21

We know that there are 20, 17, and 17 hotdogs for beef, meat, and poultry, respectively. However, when we run the last code for the length of temp (when isolating for Beef) it is longer than it should be since R does not know how to handle the NAs. In these cases, using the which() function is more appropriate to use with Boolean statements, as it will ignore those missing values:

length(temp[which(temp=='Beef')])
[1] 18
which gives us the correct number – as you can see above, the 13th value is the only Beef that has gone missing when we randomly removed it in the initial code.

Use of Boolean statements can be used in many other ways, such as getting only the hotdogs that have more than 150 calories like we did in the last lab for the Chi-square test. If we recall, we defined a healthy hotdog has one that had at most 150 calories:

       Type
Healthy Beef Meat Poultry
  FALSE   11    9       0
  TRUE     9    8      17

we created a temporary variable that identified the health vs. unhealthy hotdogs, but we can use the Boolean statements, along with a standalone variant of the if-else statement, called ifelse, to create a more permanent variable:

dat2$Healthy <- with(dat2, ifelse(Calories <= 150,'Good','Bad'))
with(dat2,table(Healthy, Type))
       Type
Healthy Beef Meat Poultry
   Bad    11    9       0
   Good    9    8      17

TIP: We’ve talked about how to isolate particular pieces of data, what happens if you just want to exclude a piece of data, e.g. you had an outlier. How can you get rid of it? To do this, you could again use Boolean statements, but one of the easiest methods is by employing a – symbol with the coordinates. First let’s run the following:

set.seed(119)
temp <- sample(1:100,10)
temp
 [1] 40 71 72 45 85  3 84 59 37  2

To get rid of any element, you just need its specific coordinate(s), so if you want to get rid of the 1st value, or maybe the 2nd through 5th values, then we can do:

temp[-1]
[1] 71 72 45 85  3 84 59 37  2
temp[-c(2:5)]
[1] 40  3 84 59 37  2

We can also apply this to 2D (or n-D) objects. For example, to get rid of Sodium we could do:

dat2[,-2]
  Type Calories Healthy
1 Beef      186     Bad
2 Beef      181     Bad
3 Beef      176     Bad
4 Beef      149    Good
5 Beef      184     Bad
6 Beef      190     Bad

TIP: An interesting fact about Boolean statements is that a True value is (often) converted to a 1 and a False is converted to a 0. So, in the case where we used the statement:

with(dat2,Type=='Beef')

that produces 20 True’s and 34 False’s, if we took the sum of the results we would have gotten 20. That is, if we did:

sum(dat2$Type=='Beef')
[1] 20
it would add up the numerical values of the vector to get 20, which corresponds to our nrow().

TIP: Everything at this point has involved meeting a single condition, so in this case let’s add in multiple conditions AND (&) and OR (|). If you do google this, older code may talk about && and ||; however, since R 4.3.0, you’ll get an error (but it’s not worth diving into). So, in the case where we used the statement:

with(dat2,table(Type=='Beef'))

FALSE  TRUE 
   34    20 

we can see the same number above. If we ask for Meat OR Poultry,

with(dat2,table(Type=='Meat' | Type=='Poultry'))

FALSE  TRUE 
   20    34 

which is the complement of the Beef tally.

TIP: I mentioned the idea of the complement above ((Meat | Poultry)' = Beef), and another logical operator related to this is the ! operator, which represents “not”:

with(dat2,table(Type!='Beef'))

FALSE  TRUE 
   20    34 

If we ask for Meat AND Poultry, we see that the number is 0 since the Types are mutually exclusive and you cannot have a hotdogt that is both Meat AND Poultry.

with(dat2,table(Type=='Meat' & Type=='Poultry'))

FALSE 
   54 

TIP: Since we are talking about multiple conditions, then if you are given a set of logical vectors, two related logical operators are:

  • all, which will return a single TRUE if all of the values are true

  • any, which will return a single TRUE if at least one of them is true


3.7 Applying functions to objects

The following isn’t really about data manipulation, per say; however, it can allow you to bypass the need of using some of the aforementioned functions. The two main functions that I have found to be very useful are the apply() and tapply() functions that can be used on any 2-dimensional data structure, including matrices, arrays, and data frames. For now, let’s define a matrix A:

A <- matrix(c(4,1,3,7,-1,0,2,2,6,5,-2,1),3,4,T)
A
     [,1] [,2] [,3] [,4]
[1,]    4    1    3    7
[2,]   -1    0    2    2
[3,]    6    5   -2    1
apply(A,1,sum)
[1] 15  3 10
apply(A,2,sum)
[1]  9  6  3 10

As the function implied, it is applying some function to A based on the value 1 or 2, which represent the rows (MARGIN=1) or columns (MARGIN=2), and so the above give us our row and column sums, respectively. This is a vastly powerful function because we can apply any function we want across the columns and rows. Obviously in this case, a small 3x4 matrix isn’t that hard, but working with larger datasets highlights the benefits of this function:

apply(dat2[,2:3],2,sum) # Col sums for Sodium and Calories
  Sodium Calories 
   21521     7413 
apply(dat2[,2:3],2,mean) # Means for Sodium and Calories
  Sodium Calories 
398.5370 137.2778 
apply(dat2[,c(1,4)],2,table) # tabulation for Type and Healthy
$Type

   Beef    Meat Poultry 
     20      17      17 

$Healthy

 Bad Good 
  20   34 

The variant tapply() does a very similar job, except instead of using the margins as an indicator of where the apply the function, tapply() asks for a group indicator. As such, if you wanted to determine the mean number of Sodium or Calories, but then break it down by Type, then:

with(dat2,tapply(Sodium,Type,mean))
    Beef     Meat  Poultry 
392.6500 392.0588 411.9412 
with(dat2,tapply(Calories,Type,mean))
     Beef      Meat   Poultry 
156.85000 158.70588  92.82353 

Based on what we can see here, especially for Calories, there is probably a difference between the calorie count for hotdog by type, with poultry having significantly less calories per hotdog.


3.8 Basic figures and plots (Part III)

Continuing with the boxplots (and an issue proposed on an assignment), if you run the code:

par(mfrow=c(1,3))
boxplot(with(dat2,Calories[Type=='Beef']))
boxplot(with(dat2,Calories[Type=='Meat']))
boxplot(with(dat2,Calories[Type=='Poultry']))
Boxplots (1x3) with different scales

Figure 3.7: Boxplots (1x3) with different scales

We come across a potentially misleading issue, whereby upon a quick cursory look, it is easy to conclude that there is no significant difference between the boxplots based on Type. The issue here is that we did not specify the appropriate y-axes for the plots and R does not realize you are trying to compare them. To deal with this, we use the ~ symbol to tell R that Type will be a group indicator to separate the function by topresent a more accurate representation of the difference between the number of calories based on the type of hotdog.

with(dat2,boxplot(Calories ~ Type))
Boxplots (1x3) with the same scale

Figure 3.8: Boxplots (1x3) with the same scale

As indicated by the title, these are (very) basic figures and plots, and later on you’ll be introduced to the Tidyverse packages. I would strong recommend you explore the package more when it is introduced, particularly R for Data Science (by Wickham & Grolemund), which provides a great overview of options for data manipulation and more advanced data visualization options. For now, let’s cover the basics of loading a package and installing one.


4 Installing and loading packages

By default, R has a lot of packages already installed. In these cases, you’ll just need to use the library() to load the package. As an example, the package MASS that has some useful functions and datasets and is already installed in R but not loaded (MASS stands for “Modern Applied Statistics with S”; R is the open-source version of S). One of the datasets is called accdeaths, which is a record of accidental deaths in the US between 1973-1978, but if we run the following:

accdeaths
Error: object 'accdeaths' not found

We get an error, as the dataset cannot be found. However, if we load the MASS library, the dataset will become available too:

library(MASS)
accdeaths
       Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
1973  9007  8106  8928  9137 10017 10826 11317 10744  9713  9938  9161  8927
1974  7750  6981  8038  8422  8714  9512 10120  9823  8743  9129  8710  8680
1975  8162  7306  8124  7870  9387  9556 10093  9620  8285  8466  8160  8034
1976  7717  7461  7767  7925  8623  8945 10078  9179  8037  8488  7874  8647
1977  7792  6957  7726  8106  8890  9299 10625  9302  8314  8850  8265  8796
1978  7836  6892  7791  8192  9115  9434 10484  9827  9110  9070  8633  9240

In the case of tidyverse (and many of the dependencies), they are not installed so if we attempt to load the library, we get the following error:

library(tidyverse)
Error in library(tidyverse) : there is no package called ‘tidyverse’

In this case, we’ll have to install the package, which we can do via the GUI through:

[Tools] → [Install Packages…]

and searching for tidyverse or by using the install.packages() like so:

install.packages('tidyverse')
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.2/tidyverse_1.3.2.tgz'
Content type 'application/x-gzip' length 425892 bytes (415 KB)
==================================================
downloaded 415 KB


The downloaded binary packages are in
    /var/folders/lw/t6jjkmk91dn2gfl97zr90z9r0000gn/T//Rtmpq48rvg/downloaded_packages

Nothing happened: If you get an error or pop-up about a secure CRAN mirror, it just means that R doesn’t know which it’s mirrors (think of it as a back up of a website) to go to in order to download the package. In the the case of a pop-up, select any of the mirrors - back “in the day”, how far away you were from a mirror site affected your download speed; nowadays, this isn’t an issue. If no pop-up occurs and you just get an error, we can run the code, which tells R to go to the US server:

install.packages('tidyverse', repos='http://cran.us.r-project.org')

If successful, we’ll see a message without an error warnings and that the package(s) was downloaded to some temporary location and now all you’ll need to do is load the library like before:

library('tidyverse')
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::group_rows() masks kableExtra::group_rows()
✖ dplyr::lag()        masks stats::lag()
✖ dplyr::select()     masks MASS::select()

Do not worry about the Conflicts, as this means certain functions from one package cannot be called because they are being superseded by the most recently loaded package (e.g. the packages dplyr and tidyverse both have functions named filter() and the tidyverse version will take precedence as the most recently loaded package. In case you were wondering, a package is a collection of R functions, codes, and data, while the location where the packages are stored is called the library.


Before we end this section, let’s add in a few extra packages for reading in Excel files. There are several out there and one of the most common ones was xlsx; however, there are issues with this package that rely on Java being installed and, when the file is large, problems regarding the amount of available memory. As such, the package readxl, which is independent of Java, is a better option. Please note that the following code assumes that the format is xlsx; however, older Excel file formats include xls, which require the read_xls function. If you are ever working with multiple formats, you can use the read_excel function, which will examine which format is used and select the most appropriate one. The file hsb.xlsx in the aforementioned Google Drive will be used in the next section, so let’s read it in and do a quick head of the file.

library(readxl)
data <- read_xlsx('hsb.xlsx',sheet=1)
head(data)
# A tibble: 6 × 10
     id sex    race      ses    schtyp prog        read write  math science
  <dbl> <chr>  <chr>     <chr>  <chr>  <chr>      <dbl> <dbl> <dbl>   <dbl>
1    70 Male   Caucasian Low    Public General       57    52    41      47
2   121 Female Caucasian Medium Public Vocational    68    59    53      63
3    86 Male   Caucasian High   Public General       44    33    54      58
4   141 Male   Caucasian High   Public Vocational    63    44    47      53
5   172 Male   Caucasian Medium Public Academic      47    52    57      53
6   113 Male   Caucasian Medium Public Academic      44    52    51      63

Notice that when using the readxl package, the output above looks different and you see the word A tibble: 6 x 10. A tibble, or tbl_df, is a modern re-imagining of the data.frame that makes them easier to use with large datasets containing complex objects. I won’t get into specifics, but if you use the class function:

class(data)
[1] "tbl_df"     "tbl"        "data.frame"

You can see that one of the classes (i.e. tells us the type of object, be it numeric, character,factor, etc.) is the expected data.frame, which means the previous methods for manipulating data (e.g. using $ to isolate the column name)

table(data$sex)

Female   Male 
   109     91 

As mentioned above, though, it tibble (and therefore the readxl) will limit how much information is printed out and gives details on the size of the data.frame. Since I used the head function, it limits (by default) a print out to only 6 rows, hence the output: A tibble: 6 x 10, where the 10 is the number of columns. If you use enter the name of the tibble object (i.e data), you get:

data
# A tibble: 200 × 10
      id sex    race      ses    schtyp prog        read write  math science
   <dbl> <chr>  <chr>     <chr>  <chr>  <chr>      <dbl> <dbl> <dbl>   <dbl>
 1    70 Male   Caucasian Low    Public General       57    52    41      47
 2   121 Female Caucasian Medium Public Vocational    68    59    53      63
 3    86 Male   Caucasian High   Public General       44    33    54      58
 4   141 Male   Caucasian High   Public Vocational    63    44    47      53
 5   172 Male   Caucasian Medium Public Academic      47    52    57      53
 6   113 Male   Caucasian Medium Public Academic      44    52    51      63
 7    50 Male   Black     Medium Public General       50    59    42      53
 8    11 Male   Hispanic  Medium Public Academic      34    46    45      39
 9    84 Male   Caucasian Medium Public General       63    57    54      58
10    48 Male   Black     Medium Public Academic      57    55    52      50
# ℹ 190 more rows

Notice that only 10 rows are printed out, that the actual size is 200 x 10, and you get the message that there are 190 more rows that are not printed out. ALthough only 1 option was included in the read_xlsx funciton, the sheet option either required the number of the sheet or the name, there are several others including skip and n_max, which can be used to specify how many rows should be skipped before importing/reading in the data while the latter indicates how many rows should be read in.

WARNING: Although a tibble does print out much nicer, as we’ll see later on, it can make things a little more annoying when it comes to manipulating data. Again, I mention I am set in my old ways, so there may be ways to get around some of the issues that make me dislike tibble but I’m unlikely to search for them/change my ways. As a result of this, I’m going to create a second object with the same data but remove the tibble class and keep the information as a data.frame.

data2 <- data.frame(data)

5 Basics of Programming

I won’t get into the why programming is important and will jump into the basics of how to program in R to do things more efficiently, including creating your own programs/functions and scaling up your work, using examples from my own experience or teachings.

5.1 Extracting data from strings

Analyzing genotype data often requires extracting data from pre-defined formats. For example, in some of my datasets that look at genetic variants (known as single nucleotide polymorphisms or SNPs) use very specifically formats where the information for each SNP is separated by a double forward slash (//):

transcript accession // SNP-gene relationship // distance (value 0 if within the gene) // 
           UniGene Cluster ID // gene name or symbol // NCBI Gene ID // GenBank description.

This can be made all the more difficult by the fact that because a SNP could be physically between two genes (in the intronic region), then both genes would be listed in a single entry and separated by a tripe forward slash (///).

In either case, the function strsplit is useful and will allow you to split any text/character based on a certain character/string of text of interest. For example, using the following input (x) and splitting on a whitespace:

strsplit(x="fox in socks",split=" ")
[[1]]
[1] "fox"   "in"    "socks"

The output is three entries (i.e. each word); however, it is outputted as the aforementioned list and not an array or vector. Because of this, if you want (for example) the 2nd entry, using the positional code: [2] doesn’t quite work in the same way and if you did, you’d get the following error:

[[1]]
NULL

In short, because x (right now) is a single vector/input, you’ll need to tell R to look at that first entry and then the 2nd element. The reasoning for this is because you can input a vector and apply the same process:

strsplit(x=c("fox in socks","socks on knock","knox in box"),split=" ")
[[1]]
[1] "fox"   "in"    "socks"

[[2]]
[1] "socks" "on"    "knock"

[[3]]
[1] "knox" "in"   "box" 

Additionally, lists can have different classes. For example:

set.seed(1)
a <- list(1:8,letters[1:5],rnorm(10))
a
[[1]]
[1] 1 2 3 4 5 6 7 8

[[2]]
[1] "a" "b" "c" "d" "e"

[[3]]
 [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684
 [7]  0.4874291  0.7383247  0.5757814 -0.3053884

As can be seen here, there are 3 vectors in the list a and with differing classes (i.e. numeric, character, numeric) with different lengths; that is, we can have a mix of classes and lengths. If we wanted the 5th element of the 1st vector, we would do:

a[[1]][5]
[1] 5

The process is very similar to what you’ve seen before but requires a double hard bracket [[1]] to tell R to look for vector 1 and then within it we want the element in the 5th position: [5]. More details about lists can be found here. Another function you may find useful is unlist, which takes a list and forces them into a single vector:

unlist(a)
 [1] "1"                  "2"                  "3"                 
 [4] "4"                  "5"                  "6"                 
 [7] "7"                  "8"                  "a"                 
[10] "b"                  "c"                  "d"                 
[13] "e"                  "-0.626453810742332" "0.183643324222082" 
[16] "-0.835628612410047" "1.59528080213779"   "0.32950777181536"  
[19] "-0.820468384118015" "0.487429052428485"  "0.738324705129217" 
[22] "0.575781351653492"  "-0.305388387156356"
Importantly, since it unlists all the data, the data will be forced into the same class (if different), so in this case we get a vector of length 23 and character. This can be problematic, hence the ability to have mixed classes can be useful.
TIP: As you can see on the index on the left side, it would be easy to see that the length of the vector is 23; however, the function length, dim, nrow, and ncol are useful functions to figure out how to describe the dimensions of various vectors, matrices, etc.

TIP: Sticking with the idea of pre-defined formats, some other important functions I’ve used include:

nchar: Determines the length of a character string

nchar("fox in socks")
[1] 12


regexpr: Finds where a specific pattern is located

regexpr(pattern="-",text="AX-123456")
[1] 3
attr(,"match.length")
[1] 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

In the example above, the character - is in the third position within the string

gsub: Substitutes/replaces a pattern

gsub(pattern="-",replacement=".",x="AX-123456")
[1] "AX.123456"
In the example above, the character - is in the third position within the string

TIP: A nice thing about R is that when you have a function with multiple inputs, R only needs enough characters to differentiate between the options, so the gsub example can be shortened to:

gsub(pat="-",rep=".",x="AX-123456")

or if you know the input order, then the following works as well:

gsub("-",".","AX-123456")
[1] "AX.123456"
There are significant more versions and variants of these functions. For those that want to dive deeper into the Tidyverse, then you can find a lot of similar programs under the stringr package.


5.2 Building Basic Functions

The strsplit function is probably not necessary for your work, but for me the genotype data I work with has over 800,000 SNPs, and so being able to determine what genes are associated with each SNP (which requires having to extract it from the 5th entry per the example below) was a necessary task.

transcript accession // SNP-gene relationship // distance (value 0 if within the gene) // 
           UniGene Cluster ID // gene name or symbol // NCBI Gene ID // GenBank description.

Another important part of basic programming is the ability to scale up your work and writing your own functions - this was really important for me since I had to repeat the process described above 800,000 times. As we’ll discuss later on, computation time can be decreased heavily when we write functions, even for simple tasks. Let’s give you a quick introduction to a function:

hello <- function(name){
  msg <- paste("hello ",name,", this is your first function",sep="")
  print(msg)
}
hello("derrick") 
[1] "hello derrick, this is your first function"

In this function, it receives an input called name and then creates a character string that inserts the name into the string above and then prints the message. Similar to the comments above, since there is only a single input in the function, I don’t need to specify that name = "derrick".

TIP: An alternative to the paste function is paste0, which by default concatenates any character strings together without a separator. Play around with this before we move on to the next example.

As with most reports, having the mean and SD of some variable of interest is useful, although we listed common statistics functions previously, we can “combine” this into a single function like so:

summary.stat <- function(data,sig=3){
  out <- c(mean(data),sd(data))
  out <- round(out,sig)
  names(out) <- c('mean','sd')
  return(out)
}
set.seed(1)
summary.stat(rnorm(100))
 mean    sd 
0.109 0.898 

In this example, we’ve created a more useful function where it gives us the summary statistics of some data, in particular, the mean and SD, and specified a default value for the number of significant digits printed out. Again, play around with this and update the function so that it outputs the median and IQR as well.

Please note that:

  1. The name of the input(s) in any function and those that are created within the function only exist within that function. For example, in my code above, it is entirely possible there is nothing called data in my R environment and the “data” (i.e. rnorm(100)) is not saved as an object, so it will disappear once the function is executed, and the object out only exists within the function and so will disappear once the function is executed as well.

  2. Depending on how the function is written, the output can be saved. In the examples above, the outputs can be saved as a (for example) data.frame that allows it to be retained for other purposes.


5.3 Loops and scaling-up your work

Going back to the previous hsb file read in:

# A tibble: 6 × 10
     id sex    race      ses    schtyp prog        read write  math science
  <dbl> <chr>  <chr>     <chr>  <chr>  <chr>      <dbl> <dbl> <dbl>   <dbl>
1    70 Male   Caucasian Low    Public General       57    52    41      47
2   121 Female Caucasian Medium Public Vocational    68    59    53      63
3    86 Male   Caucasian High   Public General       44    33    54      58
4   141 Male   Caucasian High   Public Vocational    63    44    47      53
5   172 Male   Caucasian Medium Public Academic      47    52    57      53
6   113 Male   Caucasian Medium Public Academic      44    52    51      63

we can see that there are several variables of interest that we may want to tabulate quick summary statistics for. There is nothing wrong with writing the code and adjusting it accordingly for each variable but that is quite repetitive, so for the purposes of covering the basics in programming, we’ll (re)introduce a loop and basic parallel processing function as two options for doing this.

Loop functions are fairly straight forward and work on the basis of an index i, which we saw previously under the Random numbers section, but in R the index does not have to be a number, which we’ll show momentarily. In the following code, I’m going to use the same summary.stat function we just created to generate a vector that outputs the summary stats for the variables read, write,math, and science scores in the hsb dataset.

WARNING: Previously, I highlighted my preference against tibble, if you use the original data, you may find that the functions/steps provided below will fail but if you use data2, which is the data.frame version, everything works fine. If you use data instead, you’ll probably get the following error message:

Error in is.data.frame(x) : 
  'list' object cannot be coerced to type 'double'
In addition: Warning message:
In mean.default(data) : argument is not numeric or logical: returning NA

This is because the original data has multiple classes:

[1] "tbl_df"     "tbl"        "data.frame"
So, to get around this you can either use the data2 mentioned above or (per the issue about coercing a list into a double) you have to unlist the input - I’ll highlight where in the code you can make the amendment.
COMMENT: As with any computer programming, there are an infinite number of ways to do these processes, some are more efficient than others, but doing it in a way that is easy and readable to you should be prioritized until you become more comfortable with the language.

5.3.1 Option 1: Sequential loop

In short, a loop, be it through for or while involves the code repeating itself over and over again until the index meets a terminating point or the condition for looping ends. In either case, the process is done sequentially and no new process is done until the prior one is complete. This method is very different different from parallel processing, which we’ll discuss as option 2; however, although the former is “slower”, both have their own respective advantages and disadvantages.


5.3.1.1 Option 1A: Numerical index

colind <- 7:10                                        # numerical index for each variable
output <- NULL                                        # empty object to be appended
for(i in colind){
  output <- c(output,summary.stat(data2[,i]))         # appends the stats to output during each loop
}
output <- data.frame(matrix(output,ncol=2,byrow=T))   # convert to data.frame to make it pretty
row.names(output) <- names(data2)[colind]             # naming the rows
colnames(output)  <- c('mean','sd')                   # naming the columns

COMMENT: To deal with the issue mentioned above, you can (for example) unlist the input to bypass the issue.

colind <- 7:10                                        
output <- NULL                                        
for(i in colind){
  output <- c(output,summary.stat(unlist(data[,i])))  # modified to unlist the original data
}
output <- data.frame(matrix(output,ncol=2,byrow=T))   
row.names(output) <- names(data2)[colind]             
colnames(output)  <- c('mean','sd')                   


5.3.1.2 Option 1B: Character index

vars   <- names(data2)[7:10]                          # names of the specific variables
output <- NULL                                        # empty object to be appended
for(i in vars){                                       # i is now a name within data2
  output <- c(output,summary.stat(data2[,i]))         # appends to output during each loop
}
output <- data.frame(matrix(output,ncol=2,byrow=T))   # convert to data.frame to make it pretty
row.names(output) <- vars                             # vars
colnames(output)  <- c('mean','sd')                   # naming the columns

Unlike option 1A, which uses the coordinates (e.g. data2[,7] is equivalent to data2$read), by changing the index i to a character, we can use a process similar to using coordinates (e.g. data2[,'read'] is equivalent to data2$read), which is useful when you know the variable names but not where they are within the data.frame.


5.3.2 Option 2: Parallel processing

In short, parallel processing involves breaking down a task into smaller parts (i.e. applying the function summary.stat to a set of variables) and executing these parts simultaneously to enhance performance, which speeds up processing time and is especially beneficial for large datasets and complex calculations.

5.3.2.1 Option 2A: sapply function

The function sapply is a user-friendly version of lapply but returns a vector instead of a list. Although lapply is more flexible (i.e. can have multiple classes), sapply is more appropriate for the class of data we are using and the fact that we are going to combine them into a data.frame. Before we get into the exact code, let’s do a quick example:

sapply(1:4,function(x) x^2)
[1]  1  4  9 16

In our example above, the input is the values 1 through 4 and the function is the square of the inputted value. Notice that I’ve simplified the function code a bit more and did not include a return of an object nor used the { and } brackets. There are some shortcuts we can implement when coding and building functions in R but I’ll leave it to you to explore the specifics. To my knowledge, there is no native function for squaring a value outside of the operator ^ that accepts the values as an input, unlike:

sapply(1:4,exp)
[1]  2.718282  7.389056 20.085537 54.598150

where I applied the exponential function and the input is the vector: 1:4. Since a function requires an input (unlike an operator), I had to “create” the function to square the input. Applying the same process to our example above, I’m going to create a temporary function that (A) calls on the summary.stat function and (B) utilizes the object data2.

Recall: summary.stat assumes that the input is some vector of data, hence I need to tell R that the vector is within data2, where the input (in both sapply and summary.stat) utilize the specified index.

vars   <- names(data2)[7:10]                          # names of the specific variables
fn <- function(x) summary.stat(data2[,x])             # function calling summary.stat and data2
output <- sapply(vars,fn)                             # "applying" fn to vars
output <- data.frame(matrix(output,ncol=2,byrow=T))   # convert to data.frame to make it pretty
row.names(output) <- vars                             # naming the rows using info in vars
colnames(output)  <- c('mean','sd')                   # naming the columns


5.3.2.2 Option 2B: apply function

Given that we want to apply a function to a specific set of columns, we can use the apply function again that we previously introduced. As a reminder, we can apply a function across the rows (MARGIN=1) or down columns (MARGIN=2); however, R will apply this for all rows (or columns) and so we’ll need to isolate those columns first.

output <- apply(data2[,vars],2,summary.stat)          # "applying" a function to the columns (MARGIN=2)
output <- data.frame(t(output))                       # convert to data.frame to make it pretty 

This option was a lot cleaner; however, requires knowing how output is formatted (a matrix), and then knowing we can use the t function to transpose our matrix before saving it as a data.frame.

          mean     sd
read    52.230 10.253
write   52.775  9.479
math    52.645  9.368
science 51.850  9.901


TIP: Although it is important to be able to build/create custom functions, the package rstatix does a great job doing a similar job as above and works nicely with tidyverse functions:

library(rstatix)
get_summary_stats(data, read, write, math, science, type='common')
# A tibble: 4 × 10
  variable     n   min   max median   iqr  mean    sd    se    ci
  <fct>    <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 read       200    28    76     50  16    52.2 10.3  0.725  1.43
2 write      200    31    67     54  14.2  52.8  9.48 0.67   1.32
3 math       200    33    75     52  14    52.6  9.37 0.662  1.31
4 science    200    26    74     53  14    51.8  9.90 0.7    1.38
I’ll leave it to you to explore the package more.


5.3.3 Benchmark the processing time

As mentioned, parallel processing allows us to simultaneously apply a function instead of going through each variable sequentially. To figure out the computing time, we can use the function proc.time, which can be used to determine how much “real” and CPU time (in seconds) a process takes:

proc.time()
   user  system elapsed 
  1.318   0.238   2.789 

The third output elapsed will be important for us to use to calculate the process time.

COMMENT: In order to accurately calculate the process time, we need to:

  1. Use proc.time twice, before and after the tasks

  2. Submit the code to R at the same time. We can do this by highlighting all of the code and submitting it. To do this, you’ll need to click on the [Run Selected Line(s)] as can be seen in the top right of the image below.

RStudio: Run Selected Line(s).

Figure 5.1: RStudio: Run Selected Line(s).

Below is a quick example:

g <- rnorm(100000)
h <- rep(NA, 100000)

start <- proc.time()  # Start the clock

for (i in 1:100000){  # Loop through the vector, adding one
    h[i] <- g[i] + 1
}

stop <- proc.time()   # Stop the clock
stop - start          # Computing time
   user  system elapsed 
  0.004   0.000   0.004 

In this case, the computing time was 0.004 seconds, which isn’t very long but this is clearly not a computationally extensive task.

colind <- 7:10                                        # numerical index for each variable
output <- NULL                                        # empty object to be appended
# Start highlighting code here
start1 <- proc.time() 
for(i in colind){
  output <- c(output,summary.stat(data2[,i]))         # appends the stats to output during each loop
}
stop1  <- proc.time()
# Stop highlighting code here
stop1 - start1                                        # Computing time
   user  system elapsed 
  0.002   0.000   0.001 
output <- data.frame(matrix(output,ncol=2,byrow=T))   # convert to data.frame to make it pretty
row.names(output) <- names(data2)[colind]             # naming the rows
colnames(output)  <- c('mean','sd')                   # naming the columns

# Start highlighting code here
start2 <- proc.time()
output <- apply(data2[,7:10],2,summary.stat)          # "applying" a function to the columns (MARGIN=2)
output <- data.frame(t(output))                       # convert to data.frame to make it pretty 
stop2  <- proc.time()
# Stop highlighting code here
stop2 - start2                                        # Computing time
   user  system elapsed 
  0.001   0.000   0.000 

Adding the proc.time function to the code above, we can see that the processing time for option 1 (sequential) is 0.001 seconds while for option 2 (parallel), the processing time is much smaller at 0.0001 seconds (i.e. sequential is 10 times longer). Although that does not seem like much, time is not linear, so as we scale up the number of (for example) variables we are applying the function to, time won’t increase linearly but exponentially.

COMMENT: Processing time will vary for everyone because each computer’s specs are different while the current tasks being run by other programs on the same computer will also impact the resources at your disposal.

MODIFICATION: Although there is nothing wrong with proc.time, one thing I’ve come to take “issue with” is that, although I can determine the computing time, it would be good to know the actual time the process finished since we rarely remember when we started the code. Below is a modification to the proc.time function I created:

  comp.time <- function(){
    cat(paste0("Time: ",format(Sys.time(),'%H:%M:%S')),"\n")
    return(proc.time())
  }

This comp.time function is the exact same as the original (and everything we did above is still applicable) but it is a wrapper that will also output the actual time when proc.time is started (or stopped). That is, if you replace proc.time with comp.time, you will see the actual clock start time when the code is initiated and the clock time when it is completed. Thus, if we run the code right now, you’ll see the actual clock time when I compiled this code:

start3 <- comp.time()
Time: 12:16:21 


6 Data Wrangling

Data wrangling is just the process of transforming raw data into a usable format for analysis, which includes cleaning and structuring/formatting to make it suitable for our analytic needs. Previously, under Data Manipulation (Part I), we talked about:

  • attach and detach, the former of which means that (for example) a data.frame is searched by R when evaluating a variable, so that objects can be accessed by simply giving their name.

  • $ is an operator that can be used to extract (for example) variables within a data.frame

  • [ and ] to use coordinates within data frames, vectors, matrices, arrays, etc., and

  • subset, which returns a subset of a vector, matrix, or data frames that meet certain conditions.

We’ll continue now with some more advanced data manipulation and visualization steps that will dive a bit more into the tidyverse.


6.1 Data Manipulation (Part II)

The following are other functions I’ve come across that are useful when manipulating data (or datasets).

6.1.1 Matching

The function match returns a vector of the positions of (first) matches of the first input in the second.

set.seed(1)
x <- sample(1:5,5,replace=T)
x
[1] 1 4 1 2 5
y <- sample(1:10,10,replace=T)
y
 [1]  7  2  3  1  5  5 10  6 10  7

As we can see above, x has the value 1 twice and the values (in order) 1, 2, 3, 5, 5, 6, 7, 7, 10, 10. When using the match function, we’ll see that

  • 1, 2, and 5 will match and have an outputted position within y

  • 4 doesn’t exist in y and therefore will not produce a position in y, and

  • 5 will only be matched on the first occurrence within y despite being there twice.

match(x,y)
[1]  4 NA  4  2  5

Admittedly, match is not a function I use regularly but a related function is %in%, which is a more intuitive function that utilizes the match function. It is a logical operator and returns TRUE or FALSE

x %in% y
[1]  TRUE FALSE  TRUE  TRUE  TRUE

TIP: Other operators that are used in a similar fashion fall under the Set Operation functions:

  • union: merge all elements
union(x,y)
[1]  1  4  2  5  7  3 10  6
  • intersect: only show the common elements
intersect(x,y)
[1] 1 2 5
  • setdiff: what values within x are not in y
setdiff(x,y)
[1] 4
  • is.element: what values within x are in y (identical to %in%)
is.element(x,y)
[1]  TRUE FALSE  TRUE  TRUE  TRUE

This matching function is useful when you are, for example, comparing two datasets based on some key/ID and want to confirm that the values in x exist in y. For example, recently I merged two datasets together (x: questionnaire data, y: genotype data), and if I wanted to confirm which members did not have genotype data, which we’ll mimic using the following simulated data:

set.seed(1)
x <- data.frame(ID=sample(1:50,10,replace=F),x=rnorm(10))
y <- data.frame(ID=sample(1:50,40,replace=F),y=rnorm(40))

which produces the following (sorted) IDs

 [1]  1  4 14 18 21 23 33 34 39 43
 [1]  1  2  4  6  7  8  9 10 11 12 13 14 15 18 19 20 21 22 23 24 25 26 28 29 30
[26] 32 33 34 35 36 37 38 41 42 43 44 46 47 48 49

The rest of the data is not important but will be used (sort of) in the next sections. Now, to determine which of the IDs in x are in y (or vice versa), we could have used the %in% function like so:

x$ID[x$ID %in% y$ID]
[1]  1  4 14 18 21 23 33 34 43

Thus, we can see that among the 10 IDs in x, 9 of them are in y (while among the 40 in y, 9 are in x).

Comment: The situation above may seem weird since most would assume that all 10 IDs in x should be in y given the situation described above and (I would say) in theory it should be but in practice it isn’t. Case in point, in the study I’ve referenced, 4 participants were ineligible but their IDs were erroneously included when DNA samples were pulled by a technician. As such, I had 4 participants (incorrectly) genotyped that could not be matched to any questionnaire data (which only included eligible participants) - that was a $500 mistake.

TIP: The function setdiff is quite useful in the situation above since I could have used it to tell me which IDs were in the questionnaire data but not in the genotype data (since not everyone gave biosamples), and which IDs were in the genotype data but not in the questionnaire data (which would have told me I had ineligible IDs about to be gentoyped). Let’s use the previous dataset to showcase this:
- setdiff: what values within x are not in y

setdiff(x$ID,y$ID) # ID 38 in x was not y
[1] 39
setdiff(y$ID,x$ID) # Since y involves 40 IDs, we're expecting several IDs to not be in x.
 [1] 15 20 35  6 10 42 38 48 28 44 46 25 36 24 32  2 13 22 49 47 19  8 26 12 41
[26]  7 11 37 30 29  9


6.1.2 Recoding variables

Be it converting a contiunous variable into a categorical, collapsing a categorical (or continuous) variable into a binary variable, or simply relabeling (or combining) levels, the recode function is quote useful. Unfortunately, there are several variants of this function, including ones from dplyr, which is part of the tidyverse; car, which is a common package; or a fairly new package (circa 2019) called admisc - I’m going to show the latter, as I like the flexibility of this version.

COMMENT: Although the recode function is the main focus, the ifelse function that was previously introduced is another great option here.

We’ll quickly run through 5 examples using the recode function, but first let’s install the admisc package:

install.packages('admisc')
library('admisc')


6.1.2.1 Example 1: Collapsing categorical variables together

set.seed(1)
x <- sample(1:3, 12, replace=T)
y <- recode(x, "1:2 = 'x \u2264 2'; else = 'x > 2'")
table(x,y)
   y
x   x > 2 x ≤ 2
  1     0     4
  2     0     3
  3     5     0

Using the table function above, we can see that the new variable y contains a recategorized version of x based on whether the value is 3 or it is 1 or 2. Although not important, the code, \u2264 will tell R, when printing outputs like that above, or convert the unicode into the “Less than or equal to” symbol.


6.1.2.2 Example 2: Cutting a continuous variable into a categorical variable

set.seed(1)
levs <- c('low','normal','high')
x <- sample(20:70, 25, replace = TRUE)
y <- recode(x, cut = "35, 55", values = levs)
y <- factor(y,levels = levs)
table(y)
y
   low normal   high 
     7      8     10 

Note that I’ve added an extra step that converts the recoded y, based on the two listed thresholds, into a factor form to maintain my (arbitrary) order. If you do not specify values, R will (by default) assign the categorized values as 1 through n, which is the upper limit of the number of levels, which happens to be 3 in this example.


6.1.2.3 Example 3: Cutting a sequential, categorical variable into a categorical variable

set.seed(1)
x <- factor(sample(letters[1:12], 20, replace = TRUE),
          levels = letters[1:12])
y <- recode(x, "a:d = 1; e:h = 2; else = 3")
table(y,x)
   x
y   a b c d e f g h i j k l
  1 2 2 1 1 0 0 0 0 0 0 0 0
  2 0 0 0 0 4 1 3 0 0 0 0 0
  3 0 0 0 0 0 0 0 0 2 2 2 0

In this example, we can take advantage of the sequential nature of letters and categorized them using the sequence operator :, which can be use for both numerical examples (as seen previously) and sequential characters (as above). Importantly, notice that we have 3 groups number 1 through 3 and that each range is separated by a semi-colon ;.


6.1.2.4 Example 4: Grouping parts of a sequential, categorical variable into a categorical variable

# Example 3
z <- recode(x, "a, c:d = 'Grp 1'; e,g:h = 'Grp 2'; else = 'Grp 3", labels = "Grp 1, Grp 2, Grp 3")
table(z,x)
       x
z       a b c d e f g h i j k l
  Grp 1 2 0 1 1 0 0 0 0 0 0 0 0
  Grp 2 0 0 0 0 4 0 3 0 0 0 0 0
  Grp 3 0 2 0 0 0 1 0 0 2 2 2 0

In this example, there are two parts to take notice of,

  • The option labels tells R that the new object z will be a factor with the levels: Grp 1, Grp 2, Grp 3

  • Example 4 is very similar to example 3; however, b and f are not in their “expected” labels and were “skipped” over by breaking the sequence and now fall under the else condition, making it part of Grp 3.


6.1.2.5 Example 5: Partial recode

set.seed(1)
x <- sample(20:70, 25, replace = TRUE)
y <- recode(x, "20:64 = 1; else = copy")
table(y,x)
    x
y    20 23 26 28 29 33 34 37 40 42 44 52 53 56 58 60 61 62 65 70
  1   1  1  1  1  1  1  1  1  3  1  1  1  1  3  1  1  1  1  0  0
  65  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2  0
  70  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1

In this example, there is one part to take notice of in the latter part of the code. Per the previous use of a semi-colon ; to tell R that we have multiple groups and the else to tell R that any value not meeting the prior condition (i.e. between 20 to 64, inclusive) will be categorized into the next group label (like in Example 4), by using the copy option, R will keep the prior value/group label. That is, anything between 20 and 64, inclusive, will be relabeled as 1 while anything else will maintain its original labeling.


6.1.3 Merging and joining datasets

As many of you have experienced, data linkages are a pain, both in terms of waiting for the bureaucracy and making sure all the data merges correctly. There are several options out there; however, the ones that I find using the most are the functions merge, which is a native part of R, and variants of the join, which is part of the dplyr library. It is worth noting that there are 4 variants (inner_join, which only keeps observations from x that have a matching ID in y, while there are three “outer” join functions: full_join, left_join, and right_join).

Using the previously established simulated data:

set.seed(1)
x <- data.frame(ID=sample(1:50,10,replace=F),x=rnorm(10))
y <- data.frame(ID=sample(1:50,40,replace=F),y=rnorm(40))

We’ve already established that 9 of the observations in x are in y, so if we merged the two datasets, we should expect 1 “missing” observations under the randomly generated data within x. Likewise, since there are 31 IDs in y that are not in x, we would expecting “missing” observations there.


6.1.3.1 Option 1: merge function

With the merge function, there are several options; however, the main one is by, which you typically set to the ID that you want to match on, and all, which tells R if extra rows are added to the output when there are mismatches; the default is FALSE, so that only rows with data from both x and y are included in the output.

set.seed(1)
x <- data.frame(ID=sample(1:50,10,replace=F),x=rnorm(10))
y <- data.frame(ID=sample(1:50,40,replace=F),y=rnorm(40))
z <- merge(x,y,by="ID")
  ID           x           y
1  1 -0.30538839  0.01915639
2  4  0.73832471  1.25408311
3 14 -2.21469989 -0.11582532
4 18  1.12493092 -1.42509839
5 21 -0.01619026  0.14377148
6 23  0.38984324  0.61824329
7 33 -0.04493361 -1.56378205
8 34  1.51178117 -0.64901008
9 43 -0.62124058 -1.26361438

We know that one of the x IDs is missing from y, but if we wanted to include all of the IDs (even if there is missing data) we’d have to make a small amendment. First, let’s look at the help output for merge:

merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
      incomparables = NULL, ...)

First, notice the two all variants, all.x and all.y, which are logical operators. In the scenario above, where I want to keep the missing x ID, then we can do:

z <- merge(x,y,by="ID",all.x=TRUE)
   ID           x           y
1   1 -0.30538839  0.01915639
2   4  0.73832471  1.25408311
3  14 -2.21469989 -0.11582532
4  18  1.12493092 -1.42509839
5  21 -0.01619026  0.14377148
6  23  0.38984324  0.61824329
7  33 -0.04493361 -1.56378205
8  34  1.51178117 -0.64901008
9  39  0.57578135          NA
10 43 -0.62124058 -1.26361438

Thus, we have all 10 IDs within x and a missing data (per the NA) in the y output. Conversely, if we wanted to completely merge the two datasets, then

z <- merge(x,y,by="ID",all=TRUE)

I’m not going to bother with the output, nor the output for the all.y option. The second thing to notice is the two by variants, by.x and by.y, which are useful if you have differing names for the key that the matching occurs by. For example:

set.seed(1)
x <- data.frame(ID=sample(1:50,10,replace=F),x=rnorm(10))
y <- data.frame(id=sample(1:50,40,replace=F),y=rnorm(40))
z <- merge(x,y,by="ID")

you’ll get the following error:

Error in fix.by(by.y, y) : 'by' must specify a uniquely valid column

To correct for the differing key names, we would do:

z <- merge(x,y,by.x="ID",by.y="id")

Although this is a viable option, I would suggest renaming the key of one of the datasets.

TIP: If you have a situation where the key has different names based on the dataset and want to rename it then you’ll need to replace the “name” using the name function like so:

names(y)[1] <- 'ID'

The only problem with this option is that you may not know the location of the ID and the dataset so large that listing out all of the names and figuring out the location of the key is quite annoying. In this case, we can do something like:

names(y)[which(names(y)=='id')] <- 'ID'


6.1.3.2 Option 2: join function

The focus here will be left_join and full_join, and I’ll leave it to you to explore the specifics of inner_join and right_join. Starting with the basics, we’ll recreate our prior scenario where one of the IDs in x does not exist in y but ignore the y IDs that have missing x data.

set.seed(1)
x <- data.frame(ID=sample(1:50,10,replace=F),x=rnorm(10))
y <- data.frame(ID=sample(1:50,40,replace=F),y=rnorm(40))
z <- merge(x,y,by="ID",all.x=TRUE)
   ID           x           y
1   1 -0.30538839  0.01915639
2   4  0.73832471  1.25408311
3  14 -2.21469989 -0.11582532
4  18  1.12493092 -1.42509839
5  21 -0.01619026  0.14377148
6  23  0.38984324  0.61824329
7  33 -0.04493361 -1.56378205
8  34  1.51178117 -0.64901008
9  39  0.57578135          NA
10 43 -0.62124058 -1.26361438
Note: Per the help output for the join functions, the order of the rows and columns of x is preserved as much as possible, thus you can see that the outputs above and below are the same but differ by order.

The analogous option for the situation above is left_join:

z <- left_join(x,y,by="ID")
   ID           x           y
1   4  0.73832471  1.25408311
2  39  0.57578135          NA
3   1 -0.30538839  0.01915639
4  34  1.51178117 -0.64901008
5  23  0.38984324  0.61824329
6  43 -0.62124058 -1.26361438
7  14 -2.21469989 -0.11582532
8  18  1.12493092 -1.42509839
9  33 -0.04493361 -1.56378205
10 21 -0.01619026  0.14377148

and for the situation where we want to simply merge both datasets and keep all IDs (and therefore coerce NAs for the missing data) is full_join:

z <- full_join(x,y,by="ID")
   ID           x           y
1   4  0.73832471  1.25408311
2  39  0.57578135          NA
3   1 -0.30538839  0.01915639
4  34  1.51178117 -0.64901008
5  23  0.38984324  0.61824329
6  43 -0.62124058 -1.26361438
7  14 -2.21469989 -0.11582532
8  18  1.12493092 -1.42509839
9  33 -0.04493361 -1.56378205
10 21 -0.01619026  0.14377148
11 15          NA -1.06559058
12 20          NA  1.15653700
...
Note: There are more than 15 rows but I truncated after 12 observations to limit the output.

6.1.4 Merging multiple datasets

As a motivating example, consider my work that genotyped several hundred participants for over 800,000 SNPs. The genotyping technology I used was limited to 96 samples per array and the data format for each array produced a file where each row = SNP (i.e. 800,000 rows) and each column = participant (i.e. each plate produced a 800,000 x 97 data.frame where column 1 = ID and columns 2 - 97 = participants). Since I had several hundred participants, multiple arrays were needed, all of which had the exact same format as described above.

The join variants (and really, everything in dplyr and tidyverse) was built with efficiency in mind, and takes advantage of what is known as an infix operator introduced by the package magritter. The infix operator %>% passes the left hand side of the operator to the first argument of the right hand side of the operator. Note: We’ve seen the % notation before with the %in% function.

For example, if consider the previously created z, then the two following codes will produce the same results:

head(z)       # usual notation
  ID          x           y
1  4  0.7383247  1.25408311
2 39  0.5757814          NA
3  1 -0.3053884  0.01915639
4 34  1.5117812 -0.64901008
5 23  0.3898432  0.61824329
6 43 -0.6212406 -1.26361438
z %>% head()  # infix operator
  ID          x           y
1  4  0.7383247  1.25408311
2 39  0.5757814          NA
3  1 -0.3053884  0.01915639
4 34  1.5117812 -0.64901008
5 23  0.3898432  0.61824329
6 43 -0.6212406 -1.26361438

That is, z %>% head() is equivalent to head(z). The %>% operator can be called multiple times to “chain” functions together, which is a great when you want to merge multiple datasets together. Consider the following simplified data for the prior motivating example described above:

set.seed(1)
x1 <- data.frame(ID=sort(sample(1:50,10,replace=F)),x1=1:10)  # Plate 1
x2 <- data.frame(ID=x1$ID,x2=2*x1$ID)                         # Plate 2
x3 <- data.frame(ID=x1$ID,x3=3*x1$ID)                         # Plate 3
x4 <- data.frame(ID=x1$ID,x4=4*x1$ID)                         # Plate 4
z <- x1 %>% full_join(x2) %>% full_join(x3) %>% full_join(x4)

In this case, we have the same ID between all 4 plates but “data” proportionate to the original plate’s data:

   ID x1 x2  x3  x4
1   1  1  2   3   4
2   4  2  8  12  16
3  14  3 28  42  56
4  18  4 36  54  72
5  21  5 42  63  84
6  23  6 46  69  92
7  33  7 66  99 132
8  34  8 68 102 136
9  39  9 78 117 156
10 43 10 86 129 172

It should come as no surprise we can use the merge function in a similar fashion, but we’ll leave it to you to figure out the details.


6.1.5 Transform: Long > Wide format

Continuing with the idea of working with multiple datasets, let’s consider another motivating example involving longitudinal data. GuysWork is a program that is aimed at helping young males explore alternatives to pervasive and unhealthy norms around masculinity and is currently being implemented in several schools in Nova Scotia. The ongoing quasi-experimental design involves a questionnaire that is administered at baseline and several times as the students progress from Grade 6 through 8.

In this example, we have a similar scenario as above where we’ll want to merge the dataset together, but the format we get may be in the long or wide format, which are both useful depending on the type of analysis you are doing. It should come as no surprise that we’ll use the tidyverse package again.

To do this, we’ll use data from a paper that can be found in the aforementioned Google Drive. The study was conducted in 16 boys and 11 girls that, at ages 8, 10, 12, and 14, had their distance (mm) from the center of the pituitary gland to the pterygomaxillary fissure (a vertical fissure in the skull) measured. The goals of the study were to describe the distance in boys and girls as simple functions of age, and then to compare the functions for boys and girls.

In this case, the data formatting involves merging multiple follow-up datasets and contains the following variables:

  • id: Patient ID

  • sex: Sex of the patient (Girl or Boy)

  • distance: Measured pituitary-pterygomaxillary distance

  • age: Age of the patient when the measurement was taken (8, 10, 12, or 14)

and a quick head of the data shows:

head(dental)
  id  sex distance age
1  1 Girl     21.0   8
2  1 Girl     20.0  10
3  1 Girl     21.5  12
4  1 Girl     23.0  14
5  2 Girl     21.0   8
6  2 Girl     21.5  10

The problem with the “long” format is that since it has multiple rows/observations linked to the same id and if we (for example) wanted to do summary statistics on static variables (i.e. !(age|distance)) this format is not ideal - obviously this isn’t too bad but most of the data you work with probably have more demographic data.

To deal with this, we can use the function pivot_wider to move from the long > wide format (and pivot_longer to move from wide > long format) from tidyverse. Contining to follow the more common syntax for tidyverse, we’ll add in more of the infix operator %>% to make things easier, but first, let’s look at the help output for pivot_wide to see all the inputs:

pivot_wider(
  data,
  ...,
  id_cols = NULL,
  id_expand = FALSE,
  names_from = name,
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_sort = FALSE,
  names_vary = "fastest",
  names_expand = FALSE,
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL,
  values_fn = NULL,
  unused_fn = NULL
)

Unfortunately, this is an example where the learning curve for R (particularly the tidyverse) can get a bit steep. In our example, id and sex are the static variables but distance and age are the time-varying variables where the former is the actual outcome of interest while the latter established a temporal order (i.e. the time component of our longitudinal data). To reshape the data, we need to tell R two specific inputs plus an optional third.

dental_wide <- dental %>% pivot_wider(values_from=distance,
                                      names_from = age,
                                      names_prefix="y")

The two main options:

  • values_from: The variable (i.e. distance) that contains the values for the “new” outcome we want to use and reshape so that they become the new columns

  • names_from: The variable (i.e. age) that is the “name” for the outcome, which really is just the variable that allows us to establish the temporal order.

It is important to note that these two are paired together and pivot_wider requires both of them to have inputs. The third input is names_prefix, which isn’t necessary but since we draw our names from age, which is numerical, R does not like a variable name to be numerical and so I’ve added it to highlight both the typical notation of y being an output and to keep R happy with respect to the naming convention. As we can see below, although dental is a regular data.frame, dental_wide will now be a tibble, so calling dental_wide prints out much nicer.

# A tibble: 27 × 6
      id sex      y8   y10   y12   y14
   <int> <chr> <dbl> <dbl> <dbl> <dbl>
 1     1 Girl   21    20    21.5  23  
 2     2 Girl   21    21.5  24    25.5
 3     3 Girl   20.5  24    24.5  26  
 4     4 Girl   23.5  24.5  25    26.5
 5     5 Girl   21.5  23    22.5  23.5
 6     6 Girl   20    21    21    22.5
 7     7 Girl   21.5  22.5  23    25  
 8     8 Girl   23    23    23.5  24  
 9     9 Girl   20    21    22    21.5
10    10 Girl   16.5  19    19    19.5
# ℹ 17 more rows

As mentioned, this format makes it easier to do some basic summary stats and (for example) if we only had pre-post data (i.e. measurements at two points), we can quickly take the difference and have the format ready for a paired analysis.

6.1.6 Transform: Wide > Long format

To my knowledge, most software prefer the long format, so let’s assume that the data you were working with was the wide format, then (not surprising) we’ll use the pivot_longer format. Again, we’ll look at the help for this function:

pivot_longer(
  data,
  cols,
  ...,
  cols_vary = "fastest",
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = NULL,
  values_transform = NULL
)

The main inputs we’ll need are:

  • cols: Which columns within the dataset that you want to reshape to the longer format

  • names_to: Specifies a new column (or columns) to create from the information stored in cols.

  • values_to: Specifies a new column (or columns) to create from the values stored in cols.

Thus, we could do the following:

dental_long <- dental_wide %>% pivot_longer(cols = 3:6,
                                            names_to = "age",
                                            values_to = "distance")
# A tibble: 108 × 4
      id sex   age   distance
   <int> <chr> <chr>    <dbl>
 1     1 Girl  y8        21  
 2     1 Girl  y10       20  
 3     1 Girl  y12       21.5
 4     1 Girl  y14       23  
 5     2 Girl  y8        21  
 6     2 Girl  y10       21.5
 7     2 Girl  y12       24  
 8     2 Girl  y14       25.5
 9     3 Girl  y8        20.5
10     3 Girl  y10       24  
# ℹ 98 more rows

There is nothing wrong with leaving the code like this, but clearly there is some ways we can make the code better. Two points of contention (for me, anyway) are how we called the columns col above and the ugliness of age.

We have to know the position of the columns we want to reshape; however, this may not always be true and it’s more likely we know the names of the variables of interest. There are few ways we can find their location but a great option (once you get past the annoying syntax) is starts_with that can help select variables that match a specified pattern. In our example, since we want to reshape all the y-based columns, then:

dental_wide %>% select(starts_with('y'))
# A tibble: 27 × 4
      y8   y10   y12   y14
   <dbl> <dbl> <dbl> <dbl>
 1  21    20    21.5  23  
 2  21    21.5  24    25.5
 3  20.5  24    24.5  26  
 4  23.5  24.5  25    26.5
 5  21.5  23    22.5  23.5
 6  20    21    21    22.5
 7  21.5  22.5  23    25  
 8  23    23    23.5  24  
 9  20    21    22    21.5
10  16.5  19    19    19.5
# ℹ 17 more rows

Before we update our pivot_longer code, let’s address the last issue, the presentation of age. In this case, there are a couple of ways to deal with it, including using regular expressions to remove the y-prefix in age. We previously used gsub to replace a pattern, so in this case we could just replace the “y” with nothing like so:

dental_long$age <- gsub("y","",dental_long$age)

The only problem we have now is that R believes that age is character and so even though we got rid of the y-prefix, it still maintains that class. We can get around this by re-saving age as numerical, i.e. using the function as.numeric; however, instead of going through these two steps, let’s introduce the function parse_number. In this case, we know that age will be a number and parse_number, which is another function within the tidyverse, will take any vector and parses (i.e. extracts) the first number it finds, dropping any non-numeric characters before the first number and all characters after the first number. This function is very useful but should be used with caution.

dental_long$age <- parse_number(dental_long$age)

Similar to using parse_number to simplify our process, let’s take advantage of the infix operator %>% and chain together parse_number, pivot_longer, and a final tidyverse function to replace the need to re-save over age like I did above. The function mutate can create, modify, and delete existing columns (i.e. age) - I’ll leave it to you to explore the specifics of mutate but the following code will tell R that age should be modified with the output from parse_number.

dental_long <- dental_wide %>% pivot_longer(cols = starts_with('y'),
                                            names_to = "age",
                                            values_to = "distance") %>% 
                                mutate(age = parse_number(age))
# A tibble: 108 × 4
      id sex     age distance
   <int> <chr> <dbl>    <dbl>
 1     1 Girl      8     21  
 2     1 Girl     10     20  
 3     1 Girl     12     21.5
 4     1 Girl     14     23  
 5     2 Girl      8     21  
 6     2 Girl     10     21.5
 7     2 Girl     12     24  
 8     2 Girl     14     25.5
 9     3 Girl      8     20.5
10     3 Girl     10     24  
# ℹ 98 more rows

Thus, we’ve now chained together multiple steps functions into a much cleaner (albeit more terse) code where we

  1. Call upon pivot_longer to source the dataset dental_wide,

  2. Isolate the columns that start with “y”,

  3. Create a new column called “age” that contains the column names from step #2,

  4. Create a new column called “distance” that contains the values from step #2,

  5. Mutate the new dataset by modifying age so that the values from #3 are only the numbers.

NOTE: Notice that when executing the modified code, I slightly modified how starts_with was written. This is because the original select tells R that dental_wide is the source of the data; however, the option cols within pivot_longer is already calling select.


6.2 Data Visualization

This is really a figures and plots (Part III), but more advanced. There is a lot you can do with data visualization and although the basics were already covered, there are better options out there. From my experience in academia and industry, the package ggplot2 has become the standard when it comes to creating visually pleasing figures/graphics in R. Moreover, the creator of ggplot2, Hadley Wickham, is also the author of the aforementioned open-source text R for Data Science.

R for Data Science (2e): Data Visualization.

Figure 6.1: R for Data Science (2e): Data Visualization.

As such, there is no point in writing any tutorials for ggplot2 and instead I’ll leave it to you to reference the Data visualization section of his online text, which has plenty of examples to work through.


7 Statistical Analysis

The assumption right now is that you are aware and comfortable with all the underlying assumptions necessary to draw any (valid) conclusion for the following tests, particularly when it comes to the appropriateness of the methods for the type/scale of data being used. Generally speaking, my colleagues and I joke/criticize flowcharts for describing what statistical tests should a person use, but the reality is, particularly given how forgiving/robust some methods are, how loosely we “require” certain distributional assumptions, and the “magic” behind the central limit theorem, these flowcharts are more useful than not. The following is a nice one to reference by Antoine Soetewey, who created a blog called Stats and R that has some nice tidbits.


Statistical Test Flowchart

Figure 7.1: Statistical Test Flowchart


7.1 Basic Bivariate Tests

Previously, we described some of the basic descriptive statistics that are regularly reported, so now let’s talk about the basic and intermediate bivariate tests we might run. Along the way, we’ll reference back to several of the earlier sections and describe/give examples where I have “gone under the hood” to extract important components/calculations made by the following functions.


7.1.1 T-test: One-sample

Although the two-sample t-test (independent and dependent) are more interesting, we still need to introduce the basic one-sample t-test. In this case, we’ll perform the test on simulated data with our test but before we do that, let’s quickly look at the help output:

t.test(x, y = NULL,
       alternative = c("two.sided", "less", "greater"),
       mu = 0, paired = FALSE, var.equal = FALSE,
       conf.level = 0.95, ...)

t.test(formula, data, subset, na.action = na.pass, ...)

As we can see, by default t.test invokes a one-sample test and will shift to a two-sample test if we provide a second input y or a formula (per the second method that we will discuss below). Importantly, we can change the alternative and our hypothesized \(\mu_0\) from mu; the rest of the options will be ignored for now aside from our level of significance through our complement to \(\alpha\) that is set by conf.level, which by default is 0.95 or an \(\alpha = 0.05\).


7.1.1.1 Example 1: One-sample t-test (\(\textrm{H}_a {:} \mu \neq 0\))

We can do a basic one-sample test using the default values of a two-sided test, \(\textrm{H}_0 {:} \mu = 0\), and \(\alpha = 0.05\) by:

set.seed(1)
x <- rnorm(30,1,1)
t.test(x)

    One Sample t-test

data:  x
t = 6.4157, df = 29, p-value = 5.126e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.7373858 1.4275306
sample estimates:
mean of x 
 1.082458 

We should not be surprised by these results since the t-test is testing \(\textrm{H}_0 {:} \mu = 0\) vs. \(\textrm{H}_a {:} \mu \neq 0\), but the data generated from a normal distribution \(X \sim N(\mu = 1, \sigma = 1)\).


7.1.1.2 Example 2: One-sample t-test (\(\textrm{H}_a {:} \mu \neq 1\))

If we want to test the “appropriate” mean, then we changed the t-test to:

t.test(x,mu=1)

    One Sample t-test

data:  x
t = 0.48873, df = 29, p-value = 0.6287
alternative hypothesis: true mean is not equal to 1
95 percent confidence interval:
 0.7373858 1.4275306
sample estimates:
mean of x 
 1.082458 

We get results that make a bit more sense.


7.1.1.3 Example 3: One-sample t-test (\(\textrm{H}_a {:} \mu > 1\))

Just to get used to the syntax, let’s perform a one-side, upper test of \(\textrm{H}_a {:} \mu > 1\):

t.test(x,mu=1,alt='great')

    One Sample t-test

data:  x
t = 0.48873, df = 29, p-value = 0.3144
alternative hypothesis: true mean is greater than 1
95 percent confidence interval:
 0.7957804       Inf
sample estimates:
mean of x 
 1.082458 


7.1.2 T-test: Two-sample Independent

Returning to the hsb dataset that we imported when introducing the tidyverse package (under Installing and loading packages, let’s start testing to see if math scores differ by sex (Hint: They don’t\(^*\)). There are few different ways we can do a t-test in R, as well as adjust the assumptions (e.g. equal variance), and parameters surrounding it. Let’s start with the basics and assume:

  1. Unequal variance (\(\sigma_1^2 = \sigma_2^2\))

  2. Two-tailed test

  3. A significance level of \(\alpha = 0.05\)

Then to invoke this test, we can either set it up where x and y become the continuous values for group 1 and 2, respectively, or input a formula (i.e. y ~ x). In this case, the latter is the better option, as it is much cleaner looking, but we’ll show both codes below:

t.test(math ~ sex, data) # Using the formula option

    Welch Two Sample t-test

data:  math by sex
t = -0.41097, df = 187.58, p-value = 0.6816
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -3.193325  2.092206
sample estimates:
mean in group Female   mean in group Male 
            52.39450             52.94505 

As we can see from the output, we can see the test-statistic (i.e. t = -0.411), which isn’t particularly strong but not surprising given the \(p\)-value (i.e. p = 0.6816). By default, R will do a two-sided test, but we can change that (and any other parameter) fairly easily. Recall from the one-sample t-test above, we can change the alternative hypothesis, along with our hypothesized value \(\mu_0\) from mu, if this is an independent (by default) or dependent test from paired, and our assumption about homogeneity from var.equal (only relevant for our independent test).

Let’s redo the test using a input 1 and 2 option, but we’ll change the assumption of homogeneity to true so that we don’t get the Welch’s results and instead use the other version of the pooled variance and simpler calculation of the degrees of freedom:

t.test(data$math[data$sex=='Male'],data$math[data$sex=='Female'], var.equal=T) # Using input 1/2 option

    Two Sample t-test

data:  data$math[data$sex == "Male"] and data$math[data$sex == "Female"]
t = 0.413, df = 198, p-value = 0.6801
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.078294  3.179413
sample estimates:
mean of x mean of y 
 52.94505  52.39450 

Not surprisingly, although we do see some minor changes, the overall significance doesn’t really change by adjusting our assumption of homoegeneity. The only notable change is that our test-statistic (i.e. t = 0.413) is positive, but that’s because the order was Males vs. Females whereas in the code above it was Females vs. Males, hence the change in sign but negligible change to the magnitude of the test-statistic. Based on how tedious the coding above is for this second option, you can see why I prefer to use the formula option where possible - I’ll leave it to you to explore the impacts of changing any of the other parameters.


\(^*\)Plug for Mixed-Methods: This is completely unrelated to the documentation, but as someone who was trained as a statistician (whereby the focus was previously “on the numbers” and less so of the “why”) and then became an epidemiologist, mixed methods was a bit of a eye-opener for me.

For the uninitiated, mixed methods is a research approach that combines both qualitative and quantitative research methods to leverage the strengths of each approach to provide a richer and more robust understanding of the research problem. In an Introduction to Health Research Methods course I used to teach, I would mention the subject above and highlight how the literature, including there two meta-analyses \([\)1,2\(]\), showed no differences in math scores by gender, which segued to work by Grossman et al. who wanted to understand why fewer girls study STEM subjects.

In their mixed-method study, they focused on urban youth’s perceptions of gender and racial/ethnic barriers to STEM success, and their meaning-making and coping regarding these experiences. One of the themes resulting from the study was the impact of microagressions, and I’ll leave you with a excerpt from one of the interviewees,

“You could have the brains, the techniques, but I think they’re going to look at you a different way if you’re a female… I think that they’re possibly just trying to like push them down ’cause they think that men are like smarter than women… I always think about the person who found DNA, and I researched that, and it actually says that a woman found it first, but a guy stole it from her.”
- African American Female Youth

This study highlights the range of perceptions of discrimination and stereotypes among high school students based on their gender/sex or ethnicity, and whether they believe that STEM engagement can be compatible with their gender and racial/ethnic identities may hinder or support their STEM pursuits.

Lived Experiences: I bring up all of this, in part, due to lived experiences. When I was in Grade 6, there was a girl named Wendy in my math class, and I remember very little of her with one exception: our teacher telling her, while I sat next to her, that she, “did math pretty good for a girl”. I didn’t think much of it at the time and it is possible neither did she, but you can probably imagine that someone like her could have pondered, “am I not supposed to be good at math?” This “complement” was (seemingly) meant to be innocuous, but can clearly have greater harm than intended. As an educator (and a father of 3 girls), I encourage you to be mindful of how you interact and mentor those junior to you; your actions can (and will) have an everlasting effect.

Distributions: Previously, we worked with the code rnorm to generate random data from the normal distribution. Given we are now working with a t-test, it would be good to talk about the t-distribution and associated R codes, particularly as we just referenced a \(p\)-value above. Some important codes related to the t-distribution are:

  • pt: Distribution function for the t-distribution where the (main) inputs are q = t-value (i.e. quantile) of interest, df = degrees of freedom, and lower.tail = TRUE (default) to determine \(P(t \le)\) or, if FALSE, \(P(t >)\).

  • qt: Quantile function for the t-distribution where the (main) inputs are p = probability, df = degrees of freedom, and lower.tail = TRUE (default) to determine the (critical) right/positive or, if FALSE, the left/negative t-value.

Please note there is an equivalent rt to the normal’s rnorm to generate random data from the t-distribution, but generally it’s not used - I’ll leave it to you to explore if you want.


7.1.3 T-test: Paired

Since the hsb dataset doesn’t have any paired data, we’ll just simulated some data again that looks at the weight of mice pre/post treatment. Moreover, we’ll show how to perform the paired t-test using two options.

# Weight of the mice Post-Tx
Post <- c(200.1, 190.9, 192.7, 213, 241.4, 196.9, 172.2, 185.5, 205.2, 193.7)
# Weight of the mice Pre-treatment
Pre <- c(392.9, 393.2, 345.1, 393, 434, 427.9, 422, 383.9, 392.3, 352.2)

In the first option, we’ll do the test where we input a x and y option since the Pre and Post data simulated make this the easiest option. The process itself is almost identical to the two-sample test above except now we have to set the option: paired = TRUE.

Important: Order matters here, particularly in the context of the problem. By default, R assumes that the x and y will result in a hypothesis test of: \(\textrm{H}_a {:} \mu_x - \mu_y\), so we need to be careful of the order of the input.

Therefore, to test the alternative hypothesis: \(\textrm{H}_a {:} \mu_{post} - \mu_{pre}\), we do:

t.test(Post,Pre,paired=TRUE)

    Paired t-test

data:  Post and Pre
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean difference 
        -194.49 

Alternatively, since we know that a dependent, paired test is essentially a one-sample t-test on the difference between values, we could have also done:

d <- Post-Pre
t.test(d)

    One Sample t-test

data:  d
t = -20.883, df = 9, p-value = 6.2e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -215.5581 -173.4219
sample estimates:
mean of x 
  -194.49 

which produces the exact same results as above. Similar to before, we can change the options to (for example) \(\textrm{H}_a {:} \mu < -50\) to do a one-sided lower test with some threshold, which makes sense for our particular example:

t.test(Post, Pre, paired=TRUE, alt='less',mu=-50)  # Option 1
t.test(d, alt='lower',mu=-50)                      # Option 2

    Paired t-test

data:  Post and Pre
t = -15.514, df = 9, p-value = 4.207e-08
alternative hypothesis: true mean difference is less than -50
95 percent confidence interval:
      -Inf -177.4177
sample estimates:
mean difference 
        -194.49 

Notice that the test-statistic isn’t are “large” in magnitude (or the \(p\)-value) isn’t as small since the strength of the evidence (i.e. distance between the average differene and \(\mu = -50\)) isn’t as “far away”.


7.1.4 ANOVA: One-Way

Continuing to use the hsb dataset, let’s now look at comparing science scores based on ses. This relationship is well-established in the literature, particularly as students from lower SES backgrounds tend to have lower academic achievement and slower rates of academic progress compared to students from higher SES backgrounds; however, it is clearly a complex relationship and SES is often a surrogate/proxy for other variables.

To perform an ANOVA, there “really” is only one proper function for it: aov; however, there is another function within R called anova, which makes this all the more confusing/annoying. In short, the former actually fits an ANOVA model, while as the latter computes an ANOVA table for one (or more) fitted models. If you want to get more into the specifics, you can take a look at the help outputs for both functions or take a look at the following Stack Overflow Thread. It is worth noting that these two functions are part of the base R package and does not include anova functions from other packages, including anova_test from the rstatix package I previously introduced.

If you recall your first introduction to ANOVA, your professor should have connected ANOVA with linear regression, in particular, that regression and ANOVA, although different statistical methods, share many similar features (and assumptions). Moreover, they are mathematically identical when the independent variable(s) are categorical, i.e. ANOVA is a special case of regression. There is nothing that an ANOVA can tell you that regression cannot derive itself, as our outputs for both methods have shown

The close relationship may be apparent when you consider (older) R code for ANOVA: aov(lm(y~x)). Nowadays, you no longer need to specify the lm portion of the code; however, if you venture off on your own, you may see examples that still use lm. Regardless if you use it or not, there are essentially only two (important) inputs: formula and data. There are a lot more options; however, I’ll leave it to you to dive more into that. The formula is the same one above, and earlier when we plotted the boxplot for all 3 types of hotdogs.

As such, to get the standard ANOVA table we’re familiar with, we first do:

model <- aov(science ~ ses, data) # or aov(lm(science ~ ses, data))

followed by either of the following two options:

summary(model)    # Option 1: summary, which was introduced in section 2.7 Descriptive Stats
anova(model)      # Option 2: anova(aov), very meta

which gives us:

Analysis of Variance Table

Response: science
           Df  Sum Sq Mean Sq F value    Pr(>F)    
ses         2  1561.6  780.79  8.5711 0.0002696 ***
Residuals 197 17945.9   91.10                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

None of this is surprising, particularly when we look at the boxplots; however, before we go into this, if we do a quick summary of the science scores by ses, we get:

with(data,tapply(science,ses,summary))
$High
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  26.00   50.00   58.00   55.45   62.50   69.00 

$Low
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   29.0    39.5    47.0    47.7    54.5    69.0 

$Medium
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.00   44.00   53.00   51.71   58.00   74.00 

This will tell us very similar results as the boxplot will in a moment, but it reminds me that I didn’t “prepare” the data. If you look at the output below, you can see that the results are ordered based on High, Low, and Medium, which is clearly not ordered by rank but by name. Let’s quickly order ses and repeat the above, along with the boxplot.

with(data,tapply(science,factor(ses,c('Low','Medium','High')),summary))
$Low
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   29.0    39.5    47.0    47.7    54.5    69.0 

$Medium
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.00   44.00   53.00   51.71   58.00   74.00 

$High
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  26.00   50.00   58.00   55.45   62.50   69.00 
TIP: I’m not sure if this is a tip or a general rule of thumb, but I never overwrite an object in R, be it a data.frame or variable, but will rather create a new one in case I make a mistake and have to re-run the entire script I wrote again.
boxplot(science ~ factor(ses,c('Low','Medium','High')), data,xlab="SES")
Boxplot: Comparing Science scores by SES

Figure 7.2: Boxplot: Comparing Science scores by SES

As we can see, there is clearly an increasing trend that matches the relationship described by the literature above. Now, since I already mentioned the rstatix package, the following code evokes the same results using the anova_test function and is written using typical tidyverse syntax, although that is not necessary.

library(rstatix)
data %>% anova_test(science ~ ses)
ANOVA Table (type II tests)

  Effect DFn DFd     F       p p<.05  ges
1    ses   2 197 8.571 0.00027     * 0.08

TIP: The output from the anova_test function from rstatix is quite terse; however, it has all the important pieces and some advantages include:

  • Providing an effect size (i.e. ges, which stands for generalized eta squared) that can be changed (i.e. the other option is pes, which stands for partial eta squared)

  • It does not require the extra step of call lm, and

  • It is very flexible, as the same function can be used to perform the standard independent-measures ANOVA, repeated measures ANOVA, mixed ANOVA, and ANCOVA.

WARNING: (Sort of) Another advantage of the anova_test function, or rather a disadvantage of the aov function, is that the latter requires a balanced design. I am not going to dive into this mess, but when data are unbalanced (i.e. group sample sizes differ), Type I ANOVA can be biased and so Type II/III ANOVAs are often preferred. In reality, the aov function is decently robust as long as the imbalances are not extreme and the problems are more applicable to higher-order (i.e. bigger than a one-way ANOVA). Unfortunately, the term “extreme” is a very vague and subjective; however, in the example above we have an unbalanced design (i.e. Low = 47, Medium = 95, High = 58), and yet the results (i.e. aov = Type I, anova_test = Type II) are the same. In the next section, we’ll highlight the difference the unbalanced design can have on the two functions, particularly when we consider interactions.


Assumptions: I’m not going to dive into the specifics of the underlying assumptions of ANOVA but rather list out some functions that will be useful; however, for more information, both Stats and R and the R-Project have additional details and examples involving several of the functions listed below.

  • hist: Histogram

  • qqnorm and resid: Produces a normal QQ plot based on inputted values, such as residuals, that can be calculated from any ANOVA model. An alternative to qqnorm, which is a little prettier and more akin to ggplot2 is the function ggqqplot from the ggpubr package.

  • shapiro.test: Shapiro-Wilk Normality Test (shapiro_test is the equivalent from rstatix)

  • leveneTest from car package: Levene’s test for Homogeneity

  • check_sphericity: Mauchly’s Test of Sphericity

Distributions: Similar to above for the t-test, some important codes related to the F-distribution are:

  • pf: Distribution function for the F-distribution where the (main) inputs are q = F-value of interest, df1 and df2 = degrees of freedom, and lower.tail = TRUE to determine \(P(F \le)\) or, if FALSE, \(P(F >)\).

  • qf: Quantile function for the F-distribution where the (main) inputs are p = probability, df1 and df2 = degrees of freedom, and lower.tail = TRUE to determine the (critical) left or, if FALSE, the right-sided F-value.


7.1.5 ANOVA: Post-hoc comparison

To finish off, let’s quickly figure out where the differences lie and then talk testing assumptions. Per your own training, you’ve probably come across a variety of post-hoc tests, but I’m only going to focus on Tukey’s and leave it to you to explore the other variations. The function TukeyHSD is the base version and works with the aov model and works like:

TukeyHSD(model)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = science ~ ses, data = data)

$ses
                 diff          lwr        upr     p adj
Low-High    -7.746148 -12.16980500 -3.3224914 0.0001547
Medium-High -3.743013  -7.49896111  0.0129357 0.0510149
Medium-Low   4.003135  -0.01646732  8.0227383 0.0512092

There is also a rstatix equivalent called tukey_hsd; however, it actually calls on the original function. Recall: Although the type of ANOVA will differ between aov and anova_test (i.e. Type I vs II), Tukey’s HSD is about finding where the differences (if they exist) lie. To invoke the rstatix version in typical tidyverse syntax, we can do:

data %>% tukey_hsd(science ~ ses)
# A tibble: 3 × 9
  term  group1 group2 null.value estimate conf.low conf.high    p.adj
* <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
1 ses   High   Low             0    -7.75 -12.2      -3.32   0.000155
2 ses   High   Medium          0    -3.74  -7.50      0.0129 0.051   
3 ses   Low    Medium          0     4.00  -0.0165    8.02   0.0512  
# ℹ 1 more variable: p.adj.signif <chr>
TIP: Notice that the tibble message indicates that there is one more variable. This is where the tidyverse does annoy me since we probably want to see the entire table. To deal with this issue, you can tell R specifically how you want to print out the tibble. To find more of the nuances, you can do a help on the print option for a tibble by doing: help(print.tbl) but, for now, I’ll just show you this code, which specifies to print out all the columns - you’re guess as to why the option is Inf and not “all” is as good as mine:
print(data %>% tukey_hsd(science ~ ses),width=Inf)
# A tibble: 3 × 9
  term  group1 group2 null.value estimate conf.low conf.high    p.adj
* <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>    <dbl>
1 ses   High   Low             0    -7.75 -12.2      -3.32   0.000155
2 ses   High   Medium          0    -3.74  -7.50      0.0129 0.051   
3 ses   Low    Medium          0     4.00  -0.0165    8.02   0.0512  
  p.adj.signif
* <chr>       
1 ***         
2 ns          
3 ns          

Distributions: Some important codes related to the the Studentized range distribution use in Tukey’s are:

  • ptukey: Distribution function for the Studentized range distribution where the (main) inputs are q = Q-value of interest, nmeans = number of \(k\) treatments (and therefore means), and df = degrees of freedom, and lower.tail = TRUE to determine \(P(Q \le)\) or, if FALSE, \(P(Q >)\).

  • qtukey: Quantile function for the Studentized range distribution where the (main) inputs are p = probability, `nmeans = number of \(k\) treatments, and df = degrees of freedom, and lower.tail = TRUE to determine the (critical) left or, if FALSE, the right-sided Q-value.

If you are interested in analyzing a control group and several treatment groups, then you can find the corresponding Dunnett’s test in the package multcomp that is invoked through the function glht or via DunnettTest from the DescTools package. Personally, I like the latter because there are lot of extra steps to do Dunnett’s test via multcomp. If you really want to go down that road, see the post from Stats and R. As for the latter, it can be involved in a similar fashion as tukey_hsd by simply specifying the formula:

library("DescTools")
DunnettTest(science ~ ses,data)

  Dunnett's test for comparing several treatments with a control :  
    95% family-wise confidence level

$High
                 diff     lwr.ci     upr.ci   pval    
Low-High    -7.746148 -11.913671 -3.5786257 0.0001 ***
Medium-High -3.743013  -7.281488 -0.2045376 0.0363 *  

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: Right now the output isn’t really “interpretable” since there is no proper reference group and, since I don’t have ses setup as a factor, the reference group defaults to High.


7.1.6 ANOVA: N-Way

For anything bigger than a one-way ANOVA, we follow the same procedures to examine the main and/or interaction effects with minor modifications to the syntax. For starters, in general, the formula to add in any main effects looks like:

\[y \sim x_1 + x_2 + \dots + x_n\] and if we want to look at interaction effects, there are two coding options: asterisk (*) or colon (:); however, the latter is for an interaction-effect only. That is:

Option 1: x1*x2 is interpreted as: \(y \sim x_1 + x_2 + x_1 \cdot x_2\)

Option 2: x1:x2 is interpreted as: \(y \sim x_1 \cdot x_2\)


7.1.6.1 Example 1: Two-way with interaction

Using the same hsb dataset, let’s examine the relationship between ses and race on science scores using a two-way ANOVA with interaction.

Note: Previously, in the one-way ANOVA, I adjusted the reference group for the boxplot so that ses was a factor with levels: Low,Medium, High; however, the “default” (since ses is not a factor and doesn’t have a proper “order”) is to order based on the name, indicating that the “order” in the results below is based on: Low, High, Medium. Obviously, the base/reference group may be relevant, particularly when it comes to interpreting the results, and so you should adjust your categorical variables accordingly.

Given we’ve established the data is not balanced, we’ll opt for the anova_test function, which can we invoke using either of the following options:

data %>% anova_test(science ~ ses*race)               # Option 1
data %>% anova_test(science ~ ses + race + ses:race)  # Option 2
ANOVA Table (type II tests)

    Effect DFn DFd      F        p p<.05   ges
1      ses   2 188  4.000 2.00e-02     * 0.041
2     race   3 188 10.530 1.96e-06     * 0.144
3 ses:race   6 188  0.829 5.49e-01       0.026

WARNING: As previously mentioned above, an advantage of anova_test is that we don’t require a balanced design. Moreover, type = 2 is the default for the function, which will yield identical as type = 1 when data are balanced but type = 2 will additionally yield various assumption tests where appropriate. Additionally, for those familiar with SPSS/STATA, type = 3 is also available, which produces the following:

data %>% anova_test(science ~ ses*race, type=3)
ANOVA Table (type III tests)

    Effect DFn DFd     F        p p<.05   ges
1      ses   2 188 0.661 5.17e-01       0.007
2     race   3 188 9.051 1.26e-05     * 0.126
3 ses:race   6 188 0.829 5.49e-01       0.026

I will not get into the debate of which type should be used, as I am biased and am not a fan of the interpretation for Type III; that is, under Type III, if a significant interaction is present, the main effects should not be analysed. If you want to read more, feel free to look at the following references to formulate your own opinion \([\)3,4\(]\).


7.1.6.2 Example 2: Two-way with interaction and confounder

Let’s repeat Example 1 but add in prog as a confounder. Although there was no significant interaction in Example 1, I’m going to continue with an interaction model. As such, one way to do it is:

data %>% anova_test(science ~ ses*race + prog)
ANOVA Table (type II tests)

    Effect DFn DFd     F        p p<.05   ges
1      ses   2 186 2.759 6.60e-02       0.029
2     race   3 186 9.989 3.89e-06     * 0.139
3     prog   2 186 6.011 3.00e-03     * 0.061
4 ses:race   6 186 0.926 4.78e-01       0.029

As mentioned above, for each covariate/confounder (where an interaction term isn’t of interest), we simply “add” the variable to our formula above. In this case, there is clearly a significant association between science and prog, along with the expected (main) results for ses and race and their interaction (or lack thereof).


7.1.6.3 Example 3: Two-way ANCOVA with interaction and confounder

The nice thing about R is that we call ANOVA and ANCOVA the same way, so although our confounder in Example 2 was categorical, even if they were continuous we’d still run a similiar code to get an ANCOVA. For example, if we wanted to do an ANCOVA using read as a continuous independent variable, we would do:

data %>% anova_test(science ~ ses*race+read)
ANOVA Table (type II tests)

    Effect DFn DFd      F        p p<.05   ges
1      ses   2 187  0.493 6.12e-01       0.005
2     race   3 187  6.564 3.05e-04     * 0.095
3     read   1 187 89.624 1.28e-17     * 0.324
4 ses:race   6 187  0.959 4.54e-01       0.030


7.1.7 Chi-square test

Returning to the hotdog example, we could have probably seen that based on the contingency table (under Quick (Contingency) Tables) that the poultry hotdogs were the healthier of the sets from a calorie perspective. We could have also looked at the averages:

with(dat2,tapply(Calories, Type, mean))
     Beef      Meat   Poultry 
156.85000 158.70588  92.82353 

or simply looked at it visually:


Boxplot: Comparing calories by hotdog type

Figure 7.3: Boxplot: Comparing calories by hotdog type

But, despite how intuitive this may all feel, we need to test this formally. The chi-square (\(\chi^2\)) test (and the closely related Fisher’s exact test) is used to test independence or association. Essentially, we can use it to determine if there is a relationship between Calorie count and Type of hotdog. We can do this by:

chisq.test(cont)

    Pearson's Chi-squared test

data:  cont
X-squared = 14.611, df = 2, p-value = 0.0006717

It is worth noting there is also the chisq_test function from rstatix as well.

fisher.test(cont)

    Fisher's Exact Test for Count Data

data:  cont
p-value = 0.0001979
alternative hypothesis: two.sided

TIP: My work in genetic epidemiology involves GWAS, and being able to (for example) extract “important” information from literal hundreds of thousands of these types of tests is quite useful. A key code I’ve found incredible useful is the str function, as it gives a breakdown of the structure of an object. In this case, if we do:

str(chisq.test(cont))

We can see a list of 9 different data types, which re-highlights the usefulness of a list we described in the Basics of Programming and Data Wrangling, but for now, let’s focus on the third entry: p.value.

List of 9
 $ statistic: Named num 14.6
  ..- attr(*, "names")= chr "X-squared"
 $ parameter: Named int 2
  ..- attr(*, "names")= chr "df"
 $ p.value  : num 0.000672
 $ method   : chr "Pearson's Chi-squared test"
 $ data.name: chr "cont"
 $ observed : 'table' int [1:2, 1:3] 11 9 9 8 0 17
  ..- attr(*, "dimnames")=List of 2
  .. ..$ Healthy: chr [1:2] "FALSE" "TRUE"
  .. ..$ Type   : chr [1:3] "Beef" "Meat" "Poultry"
 $ expected : num [1:2, 1:3] 7.41 12.59 6.3 10.7 6.3 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ Healthy: chr [1:2] "FALSE" "TRUE"
  .. ..$ Type   : chr [1:3] "Beef" "Meat" "Poultry"
 $ residuals: 'table' num [1:2, 1:3] 1.32 -1.012 1.077 -0.826 -2.509 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ Healthy: chr [1:2] "FALSE" "TRUE"
  .. ..$ Type   : chr [1:3] "Beef" "Meat" "Poultry"
 $ stdres   : 'table' num [1:2, 1:3] 2.1 -2.1 1.64 -1.64 -3.82 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ Healthy: chr [1:2] "FALSE" "TRUE"
  .. ..$ Type   : chr [1:3] "Beef" "Meat" "Poultry"
 - attr(*, "class")= chr "htest"

By using the str function, I can easily extract the \(p\)-value if I were (for example) running a loop and testing multiple genetic variants for their association with cancer by doing the following code:

chisq.test(cont)$p.value    # Calling the object/variable name
chisq.test(cont)[[3]]       # Knowing it's in the third position of the list
[1] 0.0006717336

TIP: If we recall, the only “real” assumption for a \(\chi^2\)-test of association is that the cells for all expected observations be at least 5 (i.e. \(E_{ij} \ge 5\)), which we can confirm with teh following:

chisq.test(cont)$expected
       Type
Healthy      Beef      Meat   Poultry
  FALSE  7.407407  6.296296  6.296296
  TRUE  12.592593 10.703704 10.703704

If this fails (and hence we’ll use the Fisher’s Exact test), you’d probably get the following (faked) warning message:

Warning message:
In chisq.test(cont) :
  Chi-squared approximation may be incorrect
The only reason I bring this up is because multiple times, as a reviewer, I’ve spotted some issues with tables reported and decided to check their numbers to see that an incorrect \(p\)-value is reported because of the assumption issue above. Recall that the \(\chi^2\) is a continuous distribution while the underlying structure of our table is discrete. With large enough cell sizes, this tends not to matter very much, but with small cell sizes, it can matter quite a bit, hence the warning from R and my comments about Fisher’s exact.

Distributions: Some important codes related to the \(\chi^2\)-distribution are:

  • pchisq: Distribution function for the \(\chi^2\)-distribution where the (main) inputs are q = F-value of interest, df = degrees of freedom, and lower.tail = TRUE to determine \(P(\chi \le)\) or, if FALSE, \(P(\chi >)\).

  • qchisq: Quantile function for the \(\chi^2\)-distribution where the (main) inputs are p = probability, df = degrees of freedom, and lower.tail = TRUE to determine the (critical) left or, if FALSE, the right-sided \(\chi^2\)-value.


7.1.8 McNemar test

Although the following example continues with our theme of working with contingency tables, the current datasets aren’t appropriate for the McNemar, as there is no repeated/paired data within the aforementioned datasets, so let’s create some.

cont2 <- matrix(c(25, 21, 6, 10),2,2)
label <- c('non-Smoker','Smoker')
dimnames(cont2) <- list(Pre=label,Post=label)

Similar to the \(\chi^2\)-test of association above, we can use the default mcnemar.test or mcnemar_test (from rstatix) and call them the exact same way:

mcnemar.test(cont2)   # Base option
mcnemar_test(cont2)   # rstatix version

    McNemar's Chi-squared test with continuity correction

data:  cont2
McNemar's chi-squared = 7.2593, df = 1, p-value = 0.007054

In this case, I used the base option mcnemar.test just to highlight the output message regarding the: continuity correction. Similar to the \(\chi^2\)-test above, issues when using a continuous distribution to approximate a discrete random variable rear its head when the sample size \(n\) is small, hence the need for a continuity correction; however, when \(n\) is large, differences become negligible.


7.1.9 Wilcoxon signed-rank test

Shifting to nonparametric statistics, we’ll now repeat the same parametric tests above using their nonparametric counterparts starting with the classic Wilcoxon signed-rank test for one-sample or paired testing.

Before we dive into our examples, let’s quickly look at the help output for wilcox.test:

wilcox.test(x, y = NULL,
            alternative = c("two.sided", "less", "greater"),
            mu = 0, paired = FALSE, exact = NULL, correct = TRUE,
            conf.int = FALSE, conf.level = 0.95,
            tol.root = 1e-4, digits.rank = Inf, ...)
wilcox.test(formula, data, subset, na.action = na.pass, ...)

As can be seen, the inputs are nearly identical to the t.test with the (main) exception of correct = TRUE, which relates to the same continuity issues highlighted previously. Now that we see the code, the following reproduces the nonparametric equivalents to the three one-sample examples and the paired t-test using the mice weight-loss example.


7.1.9.1 Example 1: One-sample Wilcox-test (\(\textrm{H}_a {:} \mu \neq 0\))

set.seed(1)
x <- rnorm(30,1,1)
wilcox.test(x)                    
# Comment: t-test, p-value = 5.126e-07

    Wilcoxon signed rank exact test

data:  x
V = 433, p-value = 5.145e-06
alternative hypothesis: true location is not equal to 0


7.1.9.2 Example 2: One-sample Wilcox-test (\(\textrm{H}_a {:} \mu \neq 1\))

wilcox.test(x,mu=1)               
# Comment: t-test, p-value = 0.6287

    Wilcoxon signed rank exact test

data:  x
V = 276, p-value = 0.3818
alternative hypothesis: true location is not equal to 1


7.1.9.3 Example 3: One-sample Wilcox-test (\(\textrm{H}_a {:} \mu > 1\))

wilcox.test(x,mu=1,alt='great')   
# Comment: t-test, p-value = 0.3144

    Wilcoxon signed rank exact test

data:  x
V = 276, p-value = 0.1909
alternative hypothesis: true location is greater than 1


7.1.9.4 Example 4: Paired mice weight-loss

wilcox.test(Post,Pre,paired=TRUE)   # Option 1
wilcox.test(d)                      # Option 2
# Comment: t-test, p-value = 6.2e-09

    Wilcoxon signed rank exact test

data:  Post and Pre
V = 0, p-value = 0.001953
alternative hypothesis: true location shift is not equal to 0

Not surprising, since the main assumption is that the distribution is symmetric and the data generated (at least for the first 3 examples) come from a normal distribution, then (A) the assumption is met and (B) we get similiar results, albeit the parametric versions are (generally speaking) more powerful.

Distributions: Some important codes related to the Wilcox test are:

  • psignrank: Distribution function for the distribution of the Wilcoxon Signed Rank Statistic where the (main) inputs are q = V-value of interest, n = number of observations, and lower.tail = TRUE to determine \(P(V \le)\) or, if FALSE, \(P(V >)\).

  • qsignrank: Quantile function for the distribution of the Wilcoxon Signed Rank Statistic where the (main) inputs are p = probability, n = number of observations, and lower.tail = TRUE to determine the (critical) left or, if FALSE, the right-sided V-value.


7.1.10 Mann-Whitney U test

The Mann-Whitney U (MWU) test is also known as the Wilcoxon rank-sum test, and this part is relevant because R calls the MW test using the function wilcox.test, i.e. the same test used above. There are historical components related to this that make it very annoying, but in short, the W-statistic from Wilcoxon Rank-Sum test is equivalent to the U-statistic from MWU test. The following are two references that discuss the “difference” \([\)5,6\(]\).

Returning to the hsb dataset used in the two-sample independent t-test, we’ll similarly invoke this test using both input options:

wilcox.test(math ~ sex,data)
# Comment: t-test, p-value = 0.6816 (unequal var) or 0.6801 (equal var)

    Wilcoxon rank sum test with continuity correction

data:  math by sex
W = 4825, p-value = 0.7422
alternative hypothesis: true location shift is not equal to 0

and the x and y input option:

wilcox.test(data$math[data$sex=='Male'],data$math[data$sex=='Female'],paired=FALSE)
# Comment: t-test, p-value = 0.6816 (unequal var) or 0.6801 (equal var)

    Wilcoxon rank sum test with continuity correction

data:  data$math[data$sex == "Male"] and data$math[data$sex == "Female"]
W = 5094, p-value = 0.7422
alternative hypothesis: true location shift is not equal to 0

Note that paired=FALSE is the default for wilcox.test since we have to distinguish if x and y belong to the same individuals or if they are independent; however, paired isn’t necessary for a formula, hence the input does not exist in the code.

Distributions: Some important codes related to the MWU test are:

  • pwilcox: Distribution function for the distribution of the Wilcoxon Rank Sum/MWU Statistic where the (main) inputs are q = W-value of interest, m, n = number of observations in the first and second sample, respectively, and lower.tail = TRUE to determine \(P(W \le)\) or, if FALSE, \(P(W >)\).

  • qwilcox: Quantile function for the distribution of the Wilcoxon Rank Sum/MWU Statistic where the (main) inputs are p = probability, m,n = number of observations in the first and second sample, respectively, and lower.tail = TRUE to determine the (critical) left or, if FALSE, the right-sided W-value.

WARNING: The continuity correction rears its head here (and technically in the Wilcoxon sign rank test as well), when it comes to some of the code above. Consider the following data comparing a treatment (Tx) and control group that has been rank transformed. If you go through the manual procedure for calculating the test-statistic, based on this \(m = 8\), \(n=7\) MWU test, the test-statistic is either \(U = 38\) or its complement \(U' = 18\):

Tx <- c(1,3,4,5,8,9,11,13)
Ct <- c(2,6,7,10,12,14,15)
wilcox.test(Tx,Ct)

    Wilcoxon rank sum exact test

data:  Tx and Ct
W = 18, p-value = 0.281
alternative hypothesis: true location shift is not equal to 0

However, depending on the reference used, to calculate a \(p\)-value (for a two-sided hypothesis test), we can do:

\[p\textrm{-value} = 2\cdot P(W>38) \equiv 2\cdot P(W \leq 18)\] To do this in R, we can use the pwilcox code above and:

2*pwilcox(18,8,7)                   # Option 1
2*pwilcox(38,8,7,lower=FALSE)       # Option 2
[1] 0.2809635
[1] 0.231857

The difference is due to the continuity issue we’ve raised before. You can find the correction for the different probabilities of interest elsewhere; however, in this case it’s: \(P(W > w - 0.5)\), thus:

2*pwilcox(18,8,7)                   # Option 1: Same as before
2*pwilcox(38-0.5,8,7,lower=FALSE)   # Option 2: Corrected by -0.5
[1] 0.2809635
[1] 0.2809635


7.1.11 Kruskal-Wallis test

We’ll now repeat the prior parametric ANOVA example using its nonparametric equivalent, the Kruskal-Wallis (KW) test. Not surprisingly, the KW test, invoked by kruskal.test (or the rstatix equivalent: kruskal_test), uses the same formula input as aov and Anova:

kruskal.test(science ~ ses, data)
# Comment: ANOVA, p-value = 0.0002696

    Kruskal-Wallis rank sum test

data:  science by ses
Kruskal-Wallis chi-squared = 16.782, df = 2, p-value = 0.0002269

None of this is surprising given the prior results using the parametric ANOVA; however, as a reminder, the null hypotheses do differ, as the KW does not test means/where are their location(s) but rather, the KW tests the more general question of whether the sampled data come from different distributions. With that said, if the shapes of the distributions are similar, then KW does allow testing for location and (if symmetric) we can test the mean(s).

Distributions: Although the probabilities from the exact \(H\)-distribution can be calculated (see Meyer and Seamen); however, like most programs (e.g. STATA), R approximates the \(H\)-distribution by a \(\chi^2\)-distribution with \((k-1)\) degrees of freedom (where \(k\) = number of groups). To my knowledge, there is no \(F\)-distribution approximation available in the base packages of R; however, PMCMRplus is a fairly new package that does offer the option to specify a distribution, which is useful given the next comments.

WARNING: Although assumptions for the KW test are minimal, it is worth reminding you that sample sizes can be an issue, particularly when it comes to determining the critical value or \(p\)-values. Per Meyer and Seamen, we should be critical of any approximations when the sample sizes are small, as there can be differences when compared to the exact probabilities of the \(H\)-distribution. To my knowledge, there is debate about the actual minimum sample size for KW; however, using Table 1 provided by Meyer and Seaman, we can see that for some \(H\)-stat = 8 (for \(k = 3\) and \(n_i = 5 \textrm{ } \forall \textrm{ } i \in 1, 2, 3\)), the exact \(p\)-value is 0.009459 while the \(\chi^2\)-approximation yields a \(p\)-value almost double at 0.01831453. As such, caution is urged when using any approximation method.


7.1.12 KW: Post-hoc comparisons

In this section, we’ll introduce two post-hoc KW methods, the first is Nemenyi’s test, which is analogous to Tukey’s HSD, as well as Dunn’s test, which requires extra steps for post-hoc \(p\)-value adjustment, which we’ll briefly touch on momentarily.

To get Nemenyi’s test, there (as always) are multiple versions out there and, previously, we introduced the DunnettTest from the DescTools package, which also has the function NemenyiTest; however, there are some “problems” with this function. Per the requirements, although the KW test can be corrected for ties, Nemenyi’s test cannot handle ties. Keeping that in mind, let’s run the continue ahead being blissfully ignorant.

NemenyiTest(science~ses,data)

 Nemenyi's test of multiple comparisons for independent samples (tukey)  

            mean.rank.diff    pval    
Low-High         -45.82630 0.00016 ***
Medium-High      -25.87613 0.01997 *  
Medium-Low        19.95017 0.12949    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see, there is a difference between Low-High and Medium-High, but Medium-Low are really no different. There is nothing obviously wrong, but as mentioned above, ties are an issue, and if we do a table of the science scores:

table(data$science)

26 29 31 33 34 35 36 39 40 42 44 45 46 47 48 49 50 51 53 54 55 56 57 58 59 61 
 1  1  2  2  5  1  4 13  2 12 11  1  1 11  2  2 21  2 15  3 18  2  1 19  1 14 
63 64 65 66 67 69 72 74 
12  1  1  9  1  6  2  1 

Tidyverse: For those wanting to create a frequency table via tidyverse, you can use the count function and I’ll make the results a little more fancy by chaining in the mutate function to add in the proportions - there is no need for it, but to expose you to more of the tidyverse:

data %>% count(science) %>% mutate(prop = prop.table(n))
# A tibble: 34 × 3
   science     n  prop
     <dbl> <int> <dbl>
 1      26     1 0.005
 2      29     1 0.005
 3      31     2 0.01 
 4      33     2 0.01 
 5      34     5 0.025
 6      35     1 0.005
 7      36     4 0.02 
 8      39    13 0.065
 9      40     2 0.01 
10      42    12 0.06 
# ℹ 24 more rows

Thus, we clearly have ties but are “none the wiser”. The point of this if that you need to be aware of the assumptions and/or requirements, which may not always be obvious, to ensure the validity of your work. A “better” option is a fairly new package called PMCMRplus, which is a recent update to the original PMCMR package.

Repeating the process above, including using the kruskalTest from PMPCRplus, we get the following output for our KW test:

library(PMCMRplus)
kruskalTest(science ~ factor(ses), data)

    Kruskal-Wallis test

data:  science by factor(ses)
chi-squared = 16.782, df = 2, p-value = 0.0002269

Warning message:
In kruskalTest.default(c(47, 63, 58, 53, 53, 63, 53, 39, 58, 50,  :
  Ties are present. Quantiles were corrected for ties.

Although the test-statistic and p-value are the same (default: dist = "Chisquare"), the warning below the results highlights the ties that we may have been aware of. Note: I had to include the factor version of ses - for whatever reason, the PMCMRplus package wants the group variable in this format. Let’s now change the distribution options:

kruskalTest(science ~ factor(ses), data, dist="KruskalWallis")

    Kruskal-Wallis test

data:  science by factor(ses)
H = 16.782, k = 3.000000, U = 0.049044, N = 200.000000,
p-value = 0.0001752

Warning message:
In kruskalTest.default(c(47, 63, 58, 53, 53, 63, 53, 39, 58, 50,  :
  Ties are present. Quantiles were corrected for ties.

In this case, we still get the same test-statistic (again) but the distribution change results in a different \(p\)-value. Note: The change in distribution results in the use of a “simplified” beta-approximation per Wallace, which differs from the previously mentioned exact probabilities generated by Meyer and Seamen.

Taking a look at our Nemenyi’s test results:

kwAllPairsNemenyiTest(science ~ factor(ses), data)

    Pairwise comparisons using Tukey-Kramer-Nemenyi all-pairs test with Tukey-Dist approximation

data: science by factor(ses)

       High    Low    
Low    0.00016 -      
Medium 0.01997 0.12949

P value adjustment method: single-step
alternative hypothesis: two.sided
Warning message:
In kwAllPairsNemenyiTest.default(c(47, 63, 58, 53, 53, 63, 53, 39,  :
  Ties are present, p-values are not corrected.

we can see the exact same results as NemenyiTest; however, this time we (again) get the appropriate warning, bring into question (or at least urging caution of) our results. I’m not going (nor have the time) to dive into the issue of ties, I simply want you to be aware of this issue; however, if you read more into the help file for kwAllPairsNemenyiTest, it does mention options for correcting for this by setting the dist option.

Alternatively, we can look at Dunn’s test if you’re not feeling Nemenyi’s test. Again, there are multiple versions, including dunn_test (from rstatix), DunnTest (from DescTools), and kwAllPairsDunnTest (from PMCMRplus) - you can probably guess which one I’m going to use for our example.

kwAllPairsDunnTest(science ~ factor(ses), data)

    Pairwise comparisons using Dunn's all-pairs test

data: science by factor(ses)

       High    Low    
Low    0.00016 -      
Medium 0.01432 0.05270

P value adjustment method: holm
alternative hypothesis: two.sided
Warning message:
In kwAllPairsDunnTest.default(c(47, 63, 58, 53, 53, 63, 53, 39,  :
  Ties are present. z-quantiles were corrected for ties.

Importantly, notice that although ties are present (and a warning issued), the z-quantiles, and subsequent \(p\)-values are corrected for the ties. The mentioning of the z-quantiles should not be surprising since Dunn’s Test is more akin to our parametric methods but instead of means uses the rank means. Methodological aspects aside, the important part of Dunn’s Test is the fact that it requires post-hoc p-value adjustment (unlike with our Tukey and Nemenyi results that used the studentized range distribution). The \(p\)-value adjustments, which are called by the p.adjust.method option, are based on the function p.adjust, which gives options to use multiple, well-established correction methods, including Holm, Bonferroni, and Benjamini-Hochberg.


7.1.13 Friedman test

Friedman test is an extension of our nonparametric ANOVA; however, if can be used in two capacities: repeated measures or blocking. The focus here is blocking to deal with potential confounders, as repeated measures requires a significant larger amount of background information, particularly to address the covariance-structure required to deal with the correlation of repeated measures.

For our example, since we need a complete block design, our current datasets are not appropriate, as such, let’s first create a dataset that examines the annual power consumption for five brands of dehumidifiers (i.e. factor) at different humidity levels (i.e. block).

cr.data <- data.frame(kWh=c(685,792,838,875,
                            722,806,893,953,
                            733,802,880,941,
                            811,888,952,1005,
                            828,920,978,1023),
                      Brand=factor(rep(1:5,each=4)),
                      Humidity=factor(rep(1:4,times=5)))

Let’s run a parametric ANOVA on it first with only Brand:

cr.data %>% anova_test(kWh ~ Brand)
ANOVA Table (type II tests)

  Effect DFn DFd     F     p p<.05   ges
1  Brand   4  15 1.693 0.204       0.311

which shows us that Brand has no impact/relationship with power consumption; however, since humidity has obvious impacts on power consumption, let’s block it (which we do by just adding it to the formula).

cr.data %>% anova_test(kWh ~ Brand + Humidity)
ANOVA Table (type II tests)

    Effect DFn DFd       F        p p<.05   ges
1    Brand   4  12  95.567 5.42e-09     * 0.970
2 Humidity   3  12 278.199 2.36e-11     * 0.986

So, clearly there was a large impact/confounding due to humidity.

To perform the same type of analysis using the friedman test, we can use the base R function friedman.test, but first let’s start with the help output:

friedman.test(y, groups, blocks, ...)

friedman.test(formula, data, subset, na.action, ...)

Unfortunately, if you look at the first, non-formula option, the input for for data isn’t available - no clue why, but that means we had to tweak our code a little bit from the typical presentation:

with(cr.data,friedman.test(kWh, Brand, Humidity))

    Friedman rank sum test

data:  kWh, Brand and Humidity
Friedman chi-squared = 15.4, df = 4, p-value = 0.00394

Similar to the blocked ANOVA, we can see a significance relationship between Brand after blocking for humidity. Note: If you did a KW test on just Brand, you’d get similar results as the initial one-way ANOVA where we would erroneously conclude there is no association between Brand and power consumption when we failed to account for Humidity (\(p\)-value = 0.183).

With that said, we can still use the formula option but need to tweak the code just a little bit to get it to specify the block variable. As mentioned above, the Friedman test can be used to analyze repeated measures; however, the unit we measure repeatedly is often not a factor we are interested in, which is what a block (or nuance) variable is. To tell R that we want a grouping factor, we invoke this using a vertical bar (|), which is the same vertical bar used for the conditional OR, and will be used in other packages (e.g. lme4) for other purposes:

friedman.test(kWh ~ Brand | Humidity, cr.data)

    Friedman rank sum test

data:  kWh and Brand and Humidity
Friedman chi-squared = 15.4, df = 4, p-value = 0.00394

Not surprisingly, there is also a rstatix and PMCMRplus version; however, similar to above, I’m going to mention the PMCMRplus version, which is called friedmanTest so that I can also mention the corresponding post-hoc Conover test (there is also a Nemenyi version for Friedman test but Conover is generally more sensitive).

I’ll leave it to you to explore friedmanTest, which (again) has extra figures over the base version, including the ability to change the distribution approximation method; however, (to my knowledge) it does not appear you can use the previous grouping factor (|) with the formula option - no clue why. To perform our Conover test, we’ll use the frdAllPairsConoverTest function.

with(cr.data, frdAllPairsConoverTest(kWh, Brand, Humidity))

    Pairwise comparisons using Conover's all-pairs test for a two-way balanced complete block design

data: y, groups and blocks

  1       2       3       4      
2 2.6e-11 -       -       -      
3 5.7e-06 0.26589 -       -      
4 < 2e-16 5.7e-06 2.6e-11 -      
5 < 2e-16 3.3e-14 3.7e-14 0.00061

P value adjustment method: single-step

Similar to before, the p.adjust.method option will allow you to select the specific post-hoc adjustment method of choice.


Nonparametric ANOVA: N-way: Unlike ANOVA above, dealing with more than a single confounder is substantially harder with our nonparametric counterpart. With that said, it is not impossible, as the rank transformations used in KW and the parallels with parametric methods using Dunn’s Test, has resulted in new methods to deal with multiple confounders using nonparametric tests. One such method is the Aligned Rank Transform (ART) ANOVA, which was proposed only in 2011, and the package ARTool is one of the first R packages for this new methodology.


7.1.14 Cochran-Mantel-Haenszel Test

Extending the chi-square test of association to include the conversation of dealing with confounders, the Cochran–Mantel–Haenszel test (CMH) can allow us to deal with confounding by performing a stratified analysis and summarizing the results as a weighted/adjusted estimate.

Similar to before, we don’t have a good dataset, so we’ll steal one from R, which natively has several datasets we can play with. Specifically, we’ll use the Titanic dataset, which is a pre-populated table that includes the Class (4-levels: 1st, 2nd, 3rd, Crew), Sex (2-levels: Male, Female), Age (2-levels: Child, Adult), and Survival (2-levels: No, Yes) of the passengers of the 1912 maritime disaster.

data(Titanic)
print(Titanic)
, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

Although we can use this table, the interpretation is a little funny and Survival is the perfect outcome measure. If we wanted to look at the relationship between Survival and Sex, then we need to reformat this table and retabulate certain margins within these contingency tables to get our classic 2x2 table, which we can use the function margin.table. To understand the margins we want (i.e. which dimension we want to retabulate), let’s first look at the structure of the table:

str(Titanic)
 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
 - attr(*, "dimnames")=List of 4
  ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
  ..$ Sex     : chr [1:2] "Male" "Female"
  ..$ Age     : chr [1:2] "Child" "Adult"
  ..$ Survived: chr [1:2] "No" "Yes"

Notice that Survival is the 4\(^{\text{th}}\) entry in the list, so if wanted to look at the contingency table for Suvival and Sex, then we’d want margins 2 and 4:

margin.table(Titanic, margin=c(2,4))
        Survived
Sex        No  Yes
  Male   1364  367
  Female  126  344

It’s not hard for us to see that the likelihood for survival was drastically lower for males in this situation but let’s re-examine this table using the prior prop.table function I used with the mutate and count functions from the tidyverse package for fun in KW: Post-hoc comparisons:

margin.table(Titanic, c(2,4)) %>% prop.table(margin=1)  # margin = 1 applies it across the row
        Survived
Sex             No       Yes
  Male   0.7879838 0.2120162
  Female 0.2680851 0.7319149

This is a bit easier to interpret since the proportion (or rate) is more comparable and, talking like an epidemiologist, we can say that women were 3.5 times more likely to survive than men.

For the purposes of making things comparable when introducing the CMH function, I’ll also mention the corresponding OR = 10.15, which is our crude/unadjusted metric that we’ll use to compare against the stratified versions - this will make sense in a moment once we see the R outputs. To start the process, let’s first retabulate the Titanic dataset so that we look at the Survival x Sex association stratify by Age (item 3 in the list above):

strat.table <- margin.table(Titanic, c(2,4,3))
print(strat.table)
, , Age = Child

        Survived
Sex        No  Yes
  Male     35   29
  Female   17   28

, , Age = Adult

        Survived
Sex        No  Yes
  Male   1329  338
  Female  109  316

From here, we can see that the stratified ORs are 2 and 11.4, respectively, which indicate not only a different effect size of Survival by Sex when stratified by Age but an effect modification/an interaction effect by Age instead of confounding.

Extra: Continuing to use the OR as our metric for risk, in the introductory epidemiology class I teach, I highlight the fact that the OR does not follow a normal distribution but \(ln(\textrm{OR})\) does. Thus, if we want to (for example) calculate a 95% CI, we can’t follow the “typical” process; however, since we know \(ln(\textrm{OR}) \sim N(\mu,\sigma)\), then we can still use the “typical” process by first ddetermining \(SE(ln(\textrm{OR}))\), which is:

\[SE(ln(\textrm{OR})) = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}}\] Thus, we can build a 95% CI for \(ln(\textrm{OR})\) and then transform that to the 95% CI for \(OR\) by transforming the former CI by using the exponential function like so:

\[\textrm{95% CI for OR} = e^{ln(\textrm{OR}) \pm 1.96\cdot SE(ln(\textrm{OR}))} \] To make things easier, let’s quickly build a function that takes a table and generates the corresponding OR and CI:

OR <- function(x){
  a <- x[1,1]
  b <- x[1,2]
  c <- x[2,1]
  d <- x[2,2]
  
  OR <- (a*d)/(b*c)
  est <- log(OR)
  SE  <- sqrt(1/a + 1/b + 1/c + 1/d)
  ci  <- c(OR,exp(c(est-1.96*SE,est+1.96*SE)))
  names(ci) <- c("OR","LL95","UL95")  
  return(ci)
}

Applying this to our original 2x2 table for our crude/unadjusted OR we get the following:

OR(margin.table(Titanic, margin=c(2,4)))
       OR      LL95      UL95 
10.146966  8.026762 12.827205 

We can clearly see that there is a significant association (as \(\textrm{H}_0 {:} \textrm{ OR} = 1\) is not contained within the CI). Applying it to our two-stratified contingency tables:

apply(strat.table,3,OR) # Margin = 3 applies to the 3rd dimension
      Age
           Child    Adult
  OR   1.9878296 11.39906
  LL95 0.9129861  8.89262
  UL95 4.3280688 14.61194
Importantly, we can see that for the adult-strata, sex of the adult was a significantly associated with survival while in the child-strata, sex of the child was not significantly associated with survival (based on the 95% CI containing \(\textrm{H}_0 {:} \textrm{ OR} = 1\)), ergo there is a difference in effect of Sex on Survival depending on which Age strata you are on. In the context of the dataset, that should make sense since the adage was, “women and children first” (known as the Birkenhead drill) when abandoning (for example) a ship.

Now that we have stratified 2x2 tables, let’s apply the CHM function using the base R function, mantelhaen.test, which we can apply directly to our stratified tables.

Note: Per the help, an alternative to inputting x as a 3-dimensional contingency table is to input a factor object as our x, along with 2 other factor objects as a y and z input to create the same 3D table. There isn’t anything wrong with that, but the option described below is a little easier/cleaner.

Applying our function to the previously created 3D contingency table strat.table:

mantelhaen.test(strat.table)

    Mantel-Haenszel chi-squared test with continuity correction

data:  strat.table
Mantel-Haenszel X-squared = 440.57, df = 1, p-value < 2.2e-16
alternative hypothesis: true common odds ratio is not equal to 1
95 percent confidence interval:
  7.487267 11.992767
sample estimates:
common odds ratio 
          9.47592 

As we can see from the output, there is (still) a significant association between Sex and Survival once we control for Age. Overeall, the effects do appear to differ a little, as the estimated OR is now 9.48 when the crude was originally 10.15; however, the magnitude of “confounding” would be (by some) considered negligible (as it is less than 10%) but that isn’t my concern here.

WARNING: This example is a clear warning against simply calculating the CMH estimates blindly without considering the full picture/doing your due diligence. If you just applied the CMH test to estimate the adjusted OR (9.48) and compared it to the crude OR (10.15), we could have erroneously concluded thatAge had no confounding effect and there is not important. However, as highlighted by our stratified analysis, Age was not a confounder but an effect modifier, which means the adjusted OR is hiding a real relationship that is worth mentioning/studying.


Epidemiology packages:
Given that the CMH test is quite useful for stratified analyses and is a common calculation in epidemiology, we’ll take this time to list some commonly used epidemiology-related packages and save ourselves from having to custom build the OR function I created above.

There are a variety of packages out there and an amazing list has been compiled by R under their CRAN Task View: Epidemiology page; however, the ones I’ve used before (or have heard of) include:

  • dataquieR: Data quality framework and tools to systematically check health data for issues regarding data integrity, completeness, consistency or accuracy.

  • epibasix: Contains elementary tools for analyzing common epidemiological problems, ranging from sample size estimation, through 2x2 contingency table analysis and basic measures of agreement (kappa, sensitivity/specificity). Appropriate print and summary statements are also written to facilitate interpretation wherever possible. The target audience includes advanced undergraduate and graduate students in epidemiology or biostatistics courses, and clinical researchers.

  • epiDisplay: Package for data exploration and result presentation.

  • epiR: Tools for the analysis of epidemiological and surveillance data. Contains functions for directly and indirectly adjusting measures of disease frequency, quantifying measures of association on the basis of single or multiple strata of count data presented in a contingency table, computation of confidence intervals around incidence risk and incidence rate estimates and sample size calculations for cross-sectional, case-control and cohort studies. Surveillance tools include functions to calculate an appropriate sample size for 1- and 2-stage representative freedom surveys, functions to estimate surveillance system sensitivity and functions to support scenario tree modeling analyses.

  • episensr: Basic sensitivity analysis of the observed relative risks adjusting for unmeasured confounding and misclassification of the exposure/outcome, or both. It follows the bias analysis methods and examples from the book by Lash, Fox, and Fink. “Applying Quantitative Bias Analysis to Epidemiologic Data” (Springer, 2009).

  • samplesizeCMH: Calculates the power and sample size for Cochran-Mantel-Haenszel tests. There are also several helper functions for working with probability, odds, relative risk, and odds ratio values.

With that said, although authors and those that maintain R do their best to ensure accuracy of packages, it is up to you to ensure that the package/functions mentioned above are appropriate for your needs, and (just as important) your data meets the requirements and assumptions to be valid for the methods proposed.


7.2 Advanced Methods

Through most parts, everything covered under Basic Bivariate Tests are fairly standard methods that most analysts will use. The following section are either Advanced Methods or, at the very least, less common in every day pratice.


7.2.1 Statistical process chart

SPC is the use of statistical methods to control some sort of process and typically falls under the umbrella term of a control chart (think quality control charts) and so an SPC chart is a visual tools used to monitor and analyze some process over time.

My experience in this area is (admittedly) limited and I have only strayed into this area a handful of times. With that disclaimer out of the way, SPC options I’ve come across are from external packages and include qcc and qicharts (and it’s replacement qicharts2); however, there is a fairly new one called spc but I haven’t explored it. I’m going to focus on the qcc package because it was written by Luca Scrucca, an amazing statistician and contribute to the cmprsk package for estimating Competing Risk Models, which will be discussed later when Survival Analysis is introduced - he also wrote a great guide for clinicians on how to estimate competing risk models using R, as well as a guide for the Qcc: An R package for quality control charting and statistical process control, which we’ll mimic a bit below.

Let’s first start by calling the package and generating some data that’s health related. In this case, I’m going to generate random birthweight data for babies and the day they were “born”. To do this, we will:

  1. Generate 24 groups of differing sample sizes. To do this, we’ll use the function rpois, which is similar to the prior rnorm whereby we can generate random data from (in this case) a Poisson distribution.

  2. Create a set of days but replicate it based on the randomly generated sample size from step 1.

  3. Generate the birthweight for these babies, which we’ll assume is normally distributed: \(X \sim N(3400,400)\)

To ensure our graphs are comparable, we’ll again set.seed:

library(qcc)
set.seed(1)

ni <- rpois(24, 12) # create 24 groups where the avg ni = 12.
N  <- sum(ni)       # total sample size

DOB <- seq(as.Date('2025-1-1'), length.out = 24)  # first 24 days of 2025
DOB <- rep(DOB, ni)                               # reps DOB by ni
BW  <- round(rnorm(N, 3400, 400))                 # Birthweight

bw.data <- data.frame(BW,DOB)
head(bw.data)
    BW        DOB
1 3943 2025-01-01
2 3359 2025-01-01
3 3555 2025-01-01
4 3378 2025-01-01
5 2849 2025-01-01
6 3234 2025-01-01

Let’s first visualize the data - I’m going to use the basic plots I introduced, but feel free to use ggplot2 for more options, and I’m going to add in the mean value for each DOB since we’ll see multiple (i.e. \(n_i\)) birthweights for each date. To do this, I’m going to use the previous introduced tapply (see here for more details) so that I can calculate the mean birthweight for each date and introduce a new function called line - in short, this function is often used in conjunction with plot to add in extra lines to an initial plot. Given the end goal is a SPC, we could also add in a central line, which is just the average of all the values. To do this, we could use lines again, but as we’ll see below, you need an x and y, but with the function abline, we can add in a horizontal line at a specific value. Thus, the following produces a nice visualization of our data:

plot(BW ~ DOB, bw.data, cex=0.7)
with(bw.data,lines(unique(DOB),tapply(BW,DOB,mean),col="blue"))
abline(h=mean(BW),bw.data,col='red')

Note:

  • The option cex within plot allows me to customize the size of the dots. To learn more, you can search the help output for par, which sets the graphical parameters of all plots.

  • The function line, like plot, needs an x and a y; however, if you calculate the average BW for each DOB, then you have a single summary value (i.e. \(\bar{x}\) for BW) but DOB is repeated, hence the use of the function unique.

WARNING: I am assuming (rightfully) that order isn’t an issue but I don’t know if that is always true. If you really want to make sure the order of the DOB won’t screw things up, keep in mind that the output actually has the names of the DOB for each average weight. Thus, a “safe” option could be:

plot(BW ~ DOB, bw.data, cex=0.7)
bw.mean <- with(bw.data,tapply(BW,DOB,mean))
lines(as.Date(names(bw.mean)),bw.mean,col='blue')
abline(h=mean(bw.data$BW),data,col='red')

The plot is obviously a little busy since we have repeated measures; however, we effectively have our first SPC chart in the form of a process mean chart (i.e. xchart from the qc function in STATA). Obviously, there is a better way, so let’s use our actual qcc functions. If we look at the help output for qcc, you’ll see there are a lot of options:

qcc(data, type, sizes, center, std.dev, limits,
    data.name, labels, newdata, newsizes, newdata.name,
    newlabels, nsigmas = 3, confidence.level,
    rules = shewhart.rules, plot = TRUE, ...)

with the most important being data, which requires data.frame, matrix, or vector where each row refers to the grouping variable, which is DOB. This is not necessarily problematic, but annoying because it is essentially forcing us to go from the long format to the wide format. To reformat our data so that all of the values of BW are grouped with DOB on a single row we could use the tidyverse functions (see here for more details); however, the function qcc.groups was purpose built, thus:

bw.data.refrmt <- with(bw.data, qcc.groups(BW, DOB))
head(bw.data.refrmt)[,1:13]
           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
2025-01-01 3943 3359 3555 3378 2849 3234 3242 3376 3840    NA    NA    NA    NA
2025-01-02 3705 3334 3299 3679 3623 3124 3117 3546 3707  3355  3752  3559  3155
2025-01-03 4192 3253 2982 3628 3346 4361 3384 3676 3411  3103  3476  2678  3986
2025-01-04 3116 3644 3026 2899 3517 3223 3400 3430 3164  3173  3346  3871  2791
2025-01-05 3638 3533 3825 3278 3548 3507 3183 3883 3864  3680  4035  3623  2889
2025-01-06 3152 3417 3036 3463 3138 4107 3687 3764 3554  4073  3146  3215  3973

I limited the width of the head to 13 columns to save space; however, as the range of DOBs is 5 to 17, we shouldn’t be surprised by the NA for columns 10 through 17 for the first row (i.e. DOB = January 1, 2025). Comparing this to the head(data) print out above, we can see that the first row (which has 9 measures) corresponds to the first 9 rows of data (i.e. the birth weight measures among those born on January 1, 2025).

Now that we have the formatting done, let’s run through the process mean chart (default chart for qcc), as we did above - I’ll leave it to you to explore or reference the aforementioned guide by Luca Scrucca for the other options:

qcc(bw.data.refrmt, type='xbar')   # Default figure

Note: You will see an additional output that is reminiscent of the output when we’ve used the str function. I have no clue why this happens as a default; however, if you are annoyed with it (as I am), you can run the following code to bypass the output:

qplot1 <- qcc(bw.data.refrmt, type='xbar')   # Default figure
plot(qplot1)

Before we proceed, the data looks “fine” and nothing is out of the ordinary, which is a bit boring, so let’s do a quick tweak to the data where we make the last 2 days a bit bigger than originally generated:

bw.data2 <- bw.data
swap  <- bw.data2$DOB %in% c('2025-01-23','2025-01-24')
bw.data2$BW[swap] <- bw.data2$BW[swap] + 500
bw.data2.refrmt <- with(bw.data2, qcc.groups(BW, DOB))
qcc(bw.data2.refrmt, type='xbar')

As we can see, the last two days are now being flagged in our SPC chart and, per the stats underneath the plot, Number beyond limits = 2.


Extra: Quickly, we can also do a similar plot using the qicharts (and newer qicharts2) package. This is where the qicharts package has an advantage over qcc since we do not have to reformat the data and an specify the grouping variable. In this case, we can do:

library(qicharts)
qic(y = BW,
    x = DOB,         # Grouping variable
    chart = 'xbar',  # Default: Run chart (NOT a control)
    data = bw.data)

For our “modified” data, we can see the same flagged DOBs; however, they aren’t as obvious as the SPC from qcc:

library(qicharts)
qic(y = BW,
    x = DOB,         # Grouping variable
    chart = 'xbar',  # Default: Run chart (NOT a control)
    data = bw.data2)


7.2.2 Randomization schedules

Randomization schedules are lists or tables that dictates how participants are assigned to different treatment groups that are done in a random way to reduce potential biases, including investigator bias. Given it is essentially generating a random list, we could build our own functions using sample and other random number generators native to R and referencing articles like the following from one on Randomization in clinical studies by Lim and In; however, similar to the epidemiology packages mentioned above, there are a variety of packages out there for generating and managing randomization schedules in CRAN Task View: Clinical Trial Design, Monitoring, and Analysis, including

  • Minirand

  • blockrand

  • bfha

  • bnha

  • randomizeR

Alternatively, there is an amazing R Shiny (a framework for building interactive web applications using R) app called Random table for blind clinical trials created by A. Baluja that is available on GitHub for both usage and modification (see here for more details). If you are looking for a simple randomization (i.e. no blocks, restrictions, etc) then the R Shiny app does a great job; however, if you are looking for more complex scheduling, then the randomizeR package is one of the more comprehensive options, as outlined by their article.

For now, I’m going to focus on the common block-randomization using blockrand. There is a really well written example by Peter Higgins in his Reproducible Medical Research with R, so I’m going to mimic his write up with a simpler example.

You want to randomize Ulcerative Colitis (UC) to either a control (A) or treatment (B) arm. Sex-differences exist among older patients and changes in hormone levels during menstruation or pregnancy can sometimes worsen symptoms, and so you want to stratify by sex. For simplicity, a standard block design will be implemented (as opposed to a permuted variant) using a block size of 4. The possible assignments within a block could be:

  1. AABB

  2. ABAB

  3. ABBA

  4. BAAB

  5. BABA

  6. BBAA

Assuming dropouts and imbalanced enrollment, to ensure sufficient power, you estimate the need for 24 assignments for each stratum (2) for a total of 48 participants. The following code will be for the female stratum, and so we’ll need to repeat this for the male version by adjusting the code accordingly.

# Comment: The option block.sizes is a little weird, as it's a multiplier and so in our example, 
# since we want to have each block involving 4 participants, then since there are 2 treatment 
# levels, block.sizes = 2.

library(blockrand)
blk.id.f <- blockrand(n = 24,
                      num.levels = 2,          # A: Ctl vs. B: Tx
                      levels = c("A", "B"),    # arm names
                      stratum = "Female",      # stratum name
                      block.sizes = 2)         # see comment above
print(blk.id.f)
   id stratum block.id block.size treatment
1   1  Female        1          4         B
2   2  Female        1          4         A
3   3  Female        1          4         A
4   4  Female        1          4         B
5   5  Female        2          4         B
6   6  Female        2          4         B
7   7  Female        2          4         A
8   8  Female        2          4         A
9   9  Female        3          4         B
10 10  Female        3          4         A
11 11  Female        3          4         B
12 12  Female        3          4         A
13 13  Female        4          4         A
14 14  Female        4          4         A
15 15  Female        4          4         B
16 16  Female        4          4         B
17 17  Female        5          4         A
18 18  Female        5          4         B
19 19  Female        5          4         A
20 20  Female        5          4         B
21 21  Female        6          4         A
22 22  Female        6          4         A
23 23  Female        6          4         B
24 24  Female        6          4         B
Note: Per the comment above and the help output for blockrand, the final block sizes will actually be the product of num.levels and block.sizes (e.g. if there are 2 levels and the block sizes used are 1, 2, 3, and 4 (i.e. permutated block design), then the actual block sizes will be randomly chosen from the set: 2, 4, 6, or 8.

Notice that I previously listed out the 6 possible assignments within a block, and if we look at our output above, block.id = 1 corresponds to assignment option #4, block.id = 2 = #6, block.id = 3 = #5, block.id = 4 = #1, block.id = 5 = #2, and block.id = 6 = #1. As this is based on random assignment, if you attempt to reproduce this example, your treatments listed for each block (and their order) will differ.

I’ll leave it to you to explore the example by Peter Higgins mentioned above; however, if you don’t, I would recommend looking at part of his post that explains how to use R to print out the randomized block assignment using the plotblockrand function, which creates a PDF designed so that the top text (the assignment) can be folded over to present against anyone trying to peek through (for example) the envelope to guess the assignment.


7.2.3 Generalized linear models

Regression models show how (or whether) changes in independent variables (i.e. \(x_i\)’s) are associated with changes in the dependent variable, with most associating regression with linear models. We already gave a preview of linear models in ANOVA: One-Way, when we used the aov function; however, nowadays, the lm function no longer needs to be called. Regardless, per the output, the important part of the inputs are formula and data:

lm(formula, data, subset, weights, na.action,
   method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
   singular.ok = TRUE, contrasts = NULL, offset, ...)

and similar to what we saw in ANOVA: N-Way, we can easily extend the simple linear model to any multivariable linear model by adding each independent variable \(x_i\).

Generalized linear models (GLMs) are simply an extension of this and its associated function, glm by default sets the family (i.e. the link function) to the Normal distribution (i.e. family = gaussian):

glm(formula, family = gaussian, data, weights, subset,
    na.action, start = NULL, etastart, mustart, offset,
    control = list(...), model = TRUE, method = "glm.fit",
    x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)

Let’s quickly go over the basics of linear models using the hsb dataset, and for starters we’ll do a quick plot of science vs. read:

with(data,plot(read,science))

The straight line/linear relationship is obvious; however, knowing how much an increase in read scores impacts science scores (i.e. the effect size) is not so obvious (aside from it being positively associated). To figure out the slope (i.e. \(\beta_1\)) for our simple model:

\[\textrm{science} = \beta_0 + \beta_1 \cdot \textrm{read}\] We’ll use the lm function, along with the summary function:

lm.mod <- lm(science ~ read, data)
summary(lm.mod)

Call:
lm(formula = science ~ read, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.7993  -5.3079   0.2474   4.4644  26.3752 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20.06696    2.83600   7.076 2.51e-11 ***
read         0.60852    0.05329  11.420  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.707 on 198 degrees of freedom
Multiple R-squared:  0.3971,    Adjusted R-squared:  0.3941 
F-statistic: 130.4 on 1 and 198 DF,  p-value: < 2.2e-16

Thus, we can update our simple model with our coefficients:

\[\textrm{science} = \textrm{20.067} + \textrm{0.609} \cdot \textrm{read}\]

Now that we know the model, let’s update our prior figure to include the line using the previously introduced abline:

with(data,plot(read,science))
abline(lm.mod)

Extra: We established above that there is a positive association, also often referred to as a positive correlation. If we want to, we can quantify the correlation using the functions cor or cor.test, the latter of which gives us the same estimate of the (Pearson) correlation coefficient, but also formally tests it like so:

with(data,cor.test(read, science))

    Pearson's product-moment correlation

data:  read and science
t = 11.42, df = 198, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5384970 0.7070798
sample estimates:
      cor 
0.6301579 
Not surprisingly, we can see that there is a significant association (based on the \(p\)-value), and the corresponding test-statistics here and in the summary of our linear model are both \(t\)-value = 11.42 (as they should be).

Notice:

  1. The Multiple R-square (aka: \(R^2\) = 0.397) is the square of the correlation coefficient we just calculated (\(r\) = 0.63) under a simple linear model.

  2. The F-statistics from the ANOVA table for our model \(F\) = 130.412 is nothing more than the square of the \(t\)-value = 11.42 associated with our correlation coefficient and/or \(\beta_1\)-coefficient under a simple linear model.

Extra: To extract the relevant data, we can use the structure of the summary output

summary(lm.mod)$coefficients
              Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 20.0669641  2.8360034  7.07579 2.511438e-11
read         0.6085207  0.0532864 11.41981 1.565782e-23

Similar to how you can manipulate and extract certain pieces of information from a data.frame, we can use the $ symbol and the name of the column that contains that piece of information. If we want to go further and (for example) get the test-statistic for our \(\beta\)-coefficients, we can manipulate a data.frame by using the hard brackets [ and ] to isolate specific data based on their coordinates. Thus, if I were interested in the test-statistic for our estimate of \(\beta_1\), I would look for position [2,3]:

summary(lm.mod)$coefficients[2,3]
[1] 11.41981

Note that the first “column” and “row” just contains names and so they don’t count. To extract other pieces of information, you’ll need to know the names of other “variables”, which you do by using the names function:

names(summary(lm.mod))
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 

Now that we’ve gone over the lm function, let’s reproduce this using the glm function:

glm.mod <- glm(science ~ read, family='gaussian',data)
summary(glm.mod)

Call:
glm(formula = science ~ read, family = "gaussian", data = data)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 20.06696    2.83600   7.076 2.51e-11 ***
read         0.60852    0.05329  11.420  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 59.39946)

    Null deviance: 19507  on 199  degrees of freedom
Residual deviance: 11761  on 198  degrees of freedom
AIC: 1388.4

Number of Fisher Scoring iterations: 2


I’m going to reuse some of the code above to make this a little prettier and isolate the coefficients:

           LM.beta0 LM.beta1 GLM.beta0 GLM.beta1
Estimate    20.0670   0.6085   20.0670    0.6085
Std. Error   2.8360   0.0533    2.8360    0.0533
t value      7.0758  11.4198    7.0758   11.4198
Click to see unnecessary code
tab <- data.frame(t(summary(lm.mod)$coefficients), t(summary(glm.mod)$coefficients))
tab <- round(tab[1:3,],4)
names(tab) <- paste(rep(c("LM","GLM"),each=2),c("beta0","beta1"),sep=".")
print(tab)

Note: Breaking down the code:

  • Since the summary for a regression model tends to be a data.frame, by using the t function, I transposed the data.frame so that the important characteristics (i.e. coefficient, SE, and test-stat) became the row and the \(\beta_0\) and \(\beta_1\) become the columns.

  • I used the rep to replicate the labels LM and GLM twice using the option each, and

  • Using the paste function, I can concatenate everything together to create the header of the new data.frame printed out.


I’ve excluded the \(p\)-value since it’s something we can calculate on our own but, more importantly, this is just to show that the coefficients for both glm and lm are the same when we use family = 'gaussian' for the glm function. With that said, let’s move on to more interesting examples and introduce logistic and Poisson regression models, which are much more common, particularly in health settings/analytics.


7.2.3.1 GLM: Logistic Regression

To be added.


7.2.3.2 GLM: Poisson Regression

To be added.