Lesson 2.3: Descriptive statistics and summarizing by group#

We know that bull trout prefer cold clean water. Temperatures above 14 degrees C and dissolved oxygen levels below 6.2 mg/L tend to be bad for bull trout, though these are only general approximations and the healthiness of stream water for bull trout depends on many factors.

We’ll start our analysis of water quality by using the statistical summary commands that will be familiar from Module 1. First, we’ll calculate the mean (or average) water temperature.

# Load packages 
if(!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)

# if running in google colab, uncomment and use the following line:
# streams <- read.csv("https://raw.githubusercontent.com/rachtorr/IndigenousEnvDataSci.github.io/refs/heads/main/MOD2/streams_data.csv")

# read in stream temp and DO combined data 
streams <- read.csv("streams_data.csv")
Error in library(tidyverse): there is no package called tidyverse
Traceback:

1. stop(packageNotFoundError(package, lib.loc, sys.call()))
mean(streams$temperature_C)
<NA>

Dealing with NAs#

Ah! Our code didn’t work. R gave us an <NA> instead of a number. Not getting an expected output, or even error messages, are a normal part of programming. Although they might seem surprising or frustrating, you can think about them as a challenge for you to figure out.

Here, R is telling you that the dataset has missing data, which is indicated in R by the NA, which means “not available”. As you may have already noticed when looking at the “streams” dataset, you’ll see that there are, indeed, some NA values.

It is very common to have missing data in datasets. Sometimes the data contains NAs in them by the scientists that entered the data (as in our case), but R will also enter NA in a row if there was no data there to begin with (it was left blank when the data was collected).

The reason this causes R to output <NA> is because the command mean() assumes that there are actual numbers in each cell that R is computing the mean of. R is telling you that it doesn’t know how to calculate the value of something called “NA”.

In the next section, we’ll go over some easy ways to fix this problem in our “streams” data set by telling R to ignore the NAs.

The command na.omit() deletes any row that has NAs in it. The code below doesn’t change the streams dataset, it just prints streams without any rows that have NAs.

na.omit(streams)
A data.frame: 27 × 4
yearsitetemperature_Cdissolved_oxygen
<int><chr><dbl><dbl>
12007StreamA13.207.18
32008StreamA12.507.27
52009StreamA13.907.46
72010StreamA11.367.00
92011StreamA12.976.53
112012StreamA13.107.50
122012StreamB11.109.50
132013StreamA14.476.16
142013StreamB11.428.78
152014StreamA13.377.05
162014StreamB11.378.35
172015StreamA12.487.34
182015StreamB12.188.44
192016StreamA12.636.15
202016StreamB12.038.23
212017StreamA13.166.89
222017StreamB11.968.28
232018StreamA12.946.03
242018StreamB12.567.83
252019StreamA14.107.26
262019StreamB13.007.01
272020StreamA12.077.53
282020StreamB12.877.30
292021StreamA13.437.02
302021StreamB13.257.42
312022StreamA13.286.32
322022StreamB13.786.63

Great! Now let’s try to run our command to get the mean for temperature again. We don’t want to remove these rows from the streams data permanently, so we’ll nest the na.omit() command within the mean() command, allowing us to calculate the mean of all the temperatures for both streams.

mean(na.omit(streams$temperature_C))
12.6327586206897

Our code worked! Now you’ve learned how to handle NAs in a dataset in R. Let’s carry on with our summary statistics.

💻 Your turn! How would you figure out what the mean amount of dissolved oxygen is in the water across both streams?

# write your code here to find mean DO 
streams
A data.frame: 32 × 4
yearsitetemperature_Cdissolved_oxygen
<int><chr><dbl><dbl>
2007StreamA13.207.18
2007StreamB10.20 NA
2008StreamA12.507.27
2008StreamB NA8.07
2009StreamA13.907.46
2009StreamB NA7.26
2010StreamA11.367.00
2010StreamB11.67 NA
2011StreamA12.976.53
2011StreamB NA NA
2012StreamA13.107.50
2012StreamB11.109.50
2013StreamA14.476.16
2013StreamB11.428.78
2014StreamA13.377.05
2014StreamB11.378.35
2015StreamA12.487.34
2015StreamB12.188.44
2016StreamA12.636.15
2016StreamB12.038.23
2017StreamA13.166.89
2017StreamB11.968.28
2018StreamA12.946.03
2018StreamB12.567.83
2019StreamA14.107.26
2019StreamB13.007.01
2020StreamA12.077.53
2020StreamB12.877.30
2021StreamA13.437.02
2021StreamB13.257.42
2022StreamA13.286.32
2022StreamB13.786.63

Now lets use another familiar command to show the median of temperature measured across all streams and years?

median(na.omit(streams$dissolved_oxygen))
7.27

🧠✍️Class Question:

  • What is the median amount of dissolved oxygen across all samples?

# write your code here to find median DO 

Summarizing by group#

We previously used summary() to get the descriptive statistics of a column or data frame, but this groups all the rows together. Since our goal is to compare the water quality of Stream A and Stream B, we want to get descriptive statistics for them separately. We can do this using group_by() and summarize(). We explain the code and the commands we will use in the code chunk below.

Read more about the summarize function from the dplyr package in R here.

streams %>%
    # the column that includes our categorical group, site, is input to 'group_by()'
    group_by(site) %>%
    # we then use summarize() and input any function to apply to a column. In these examples, all the same statistics from the 'summary()' function are applied to temperature_C
    summarize(min_temp = min(temperature_C, na.rm=T),
              first_q_temp = quantile(temperature_C, c(0.25), na.rm=T),
              median_temp = median(temperature_C, na.rm=T),
              mean_temp = mean(temperature_C, na.rm=T),
              third_q_temp = quantile(temperature_C, c(0.75), na.rm=T),
              max_temp = max(temperature_C, na.rm=T)) 
                 
A tibble: 2 × 7
sitemin_tempfirst_q_tempmedian_tempmean_tempthird_q_tempmax_temp
<chr><dbl><dbl><dbl><dbl><dbl><dbl>
StreamA11.3612.597513.1313.0600013.38514.47
StreamB10.2011.420012.0312.1069212.87013.78

💻 Your turn! You can add columns to show descriptive statistics of dissolved oxygen, or redo the above code with dissolved_oxygen instead of temperature_C inside the descriptive statistic functions.

# get the descriptive statistics of dissolved oxygen 

Preliminary analysis of water quality data from Stream A and Stream B#

In the next step of this module, we will explore the water quality data in detail, and use this exploration of data to help inform the best decision about which stream is most promising for bull trout reintroduction. As a reminder, temperatures above 14 degrees C and dissolved oxygen levels below 6.2 mg/L tend to be bad for bull trout.

For now, let’s use the information in the summary tables above to get a preliminary sense of which stream might have better water quality for bull trout.

🧠✍️Class Questions

  • From what we’ve learned about the streams so far, which one do you think would be better to re-introduce bull trout into?

  • Which of the summary statistics do you think is most useful?

  • Is there any other information you’d like to know in order to make the best recommendation to the tribe?

Recap Lesson 2.3#

In this section, we have reviewed:

  • how to remove NAs

  • how to run summary statistics by group, using group_by() and summarize()