Lesson 2.3: Descriptive statistics and summarizing by group

Lesson 2.3: Descriptive statistics and summarizing by group#

We know that bull trout prefer cold clean water. Temperatures above 14 degrees C and dissolved oxygen levels below 6.2 mg/L tend to be bad for bull trout, though these are only general approximations and the healthiness of stream water for bull trout depends on many factors.

We’ll start our analysis of water quality by using the statistical summary commands that will be familiar from Module 1. First, we’ll calculate the mean (or average) water temperature.

# Load packages 
if(!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)

# if running in google colab, uncomment and use the following line:
# streams <- read.csv("https://raw.githubusercontent.com/rachtorr/IndigenousEnvDataSci.github.io/refs/heads/main/MOD2/streams_data.csv")

# read in stream temp and DO combined data 
streams <- read.csv("streams_data.csv")

Error in library(tidyverse): there is no package called ‘tidyverse’
Traceback:

1. stop(packageNotFoundError(package, lib.loc, sys.call()))

mean(streams$temperature_C)

<NA>

Dealing with NAs#

Ah! Our code didn’t work. R gave us an <NA> instead of a number. Not getting an expected output, or even error messages, are a normal part of programming. Although they might seem surprising or frustrating, you can think about them as a challenge for you to figure out.

Here, R is telling you that the dataset has missing data, which is indicated in R by the NA, which means “not available”. As you may have already noticed when looking at the “streams” dataset, you’ll see that there are, indeed, some NA values.

It is very common to have missing data in datasets. Sometimes the data contains NAs in them by the scientists that entered the data (as in our case), but R will also enter NA in a row if there was no data there to begin with (it was left blank when the data was collected).

The reason this causes R to output <NA> is because the command mean() assumes that there are actual numbers in each cell that R is computing the mean of. R is telling you that it doesn’t know how to calculate the value of something called “NA”.

In the next section, we’ll go over some easy ways to fix this problem in our “streams” data set by telling R to ignore the NAs.

The command na.omit() deletes any row that has NAs in it. The code below doesn’t change the streams dataset, it just prints streams without any rows that have NAs.

na.omit(streams)

A data.frame: 27 × 4
	year	site	temperature_C	dissolved_oxygen
	<int>	<chr>	<dbl>	<dbl>
1	2007	StreamA	13.20	7.18
3	2008	StreamA	12.50	7.27
5	2009	StreamA	13.90	7.46
7	2010	StreamA	11.36	7.00
9	2011	StreamA	12.97	6.53
11	2012	StreamA	13.10	7.50
12	2012	StreamB	11.10	9.50
13	2013	StreamA	14.47	6.16
14	2013	StreamB	11.42	8.78
15	2014	StreamA	13.37	7.05
16	2014	StreamB	11.37	8.35
17	2015	StreamA	12.48	7.34
18	2015	StreamB	12.18	8.44
19	2016	StreamA	12.63	6.15
20	2016	StreamB	12.03	8.23
21	2017	StreamA	13.16	6.89
22	2017	StreamB	11.96	8.28
23	2018	StreamA	12.94	6.03
24	2018	StreamB	12.56	7.83
25	2019	StreamA	14.10	7.26
26	2019	StreamB	13.00	7.01
27	2020	StreamA	12.07	7.53
28	2020	StreamB	12.87	7.30
29	2021	StreamA	13.43	7.02
30	2021	StreamB	13.25	7.42
31	2022	StreamA	13.28	6.32
32	2022	StreamB	13.78	6.63

Great! Now let’s try to run our command to get the mean for temperature again. We don’t want to remove these rows from the streams data permanently, so we’ll nest the na.omit() command within the mean() command, allowing us to calculate the mean of all the temperatures for both streams.

mean(na.omit(streams$temperature_C))

12.6327586206897

Our code worked! Now you’ve learned how to handle NAs in a dataset in R. Let’s carry on with our summary statistics.

💻 Your turn! How would you figure out what the mean amount of dissolved oxygen is in the water across both streams?

# write your code here to find mean DO 
streams

A data.frame: 32 × 4
year	site	temperature_C	dissolved_oxygen
<int>	<chr>	<dbl>	<dbl>
2007	StreamA	13.20	7.18
2007	StreamB	10.20	NA
2008	StreamA	12.50	7.27
2008	StreamB	NA	8.07
2009	StreamA	13.90	7.46
2009	StreamB	NA	7.26
2010	StreamA	11.36	7.00
2010	StreamB	11.67	NA
2011	StreamA	12.97	6.53
2011	StreamB	NA	NA
2012	StreamA	13.10	7.50
2012	StreamB	11.10	9.50
2013	StreamA	14.47	6.16
2013	StreamB	11.42	8.78
2014	StreamA	13.37	7.05
2014	StreamB	11.37	8.35
2015	StreamA	12.48	7.34
2015	StreamB	12.18	8.44
2016	StreamA	12.63	6.15
2016	StreamB	12.03	8.23
2017	StreamA	13.16	6.89
2017	StreamB	11.96	8.28
2018	StreamA	12.94	6.03
2018	StreamB	12.56	7.83
2019	StreamA	14.10	7.26
2019	StreamB	13.00	7.01
2020	StreamA	12.07	7.53
2020	StreamB	12.87	7.30
2021	StreamA	13.43	7.02
2021	StreamB	13.25	7.42
2022	StreamA	13.28	6.32
2022	StreamB	13.78	6.63

Now lets use another familiar command to show the median of temperature measured across all streams and years?

median(na.omit(streams$dissolved_oxygen))

7.27

🧠✍️Class Question:

What is the median amount of dissolved oxygen across all samples?

# write your code here to find median DO

Summarizing by group#

We previously used summary() to get the descriptive statistics of a column or data frame, but this groups all the rows together. Since our goal is to compare the water quality of Stream A and Stream B, we want to get descriptive statistics for them separately. We can do this using group_by() and summarize(). We explain the code and the commands we will use in the code chunk below.

streams %>%
    # the column that includes our categorical group, site, is input to 'group_by()'
    group_by(site) %>%
    # we then use summarize() and input any function to apply to a column. In these examples, all the same statistics from the 'summary()' function are applied to temperature_C
    summarize(min_temp = min(temperature_C, na.rm=T),
              first_q_temp = quantile(temperature_C, c(0.25), na.rm=T),
              median_temp = median(temperature_C, na.rm=T),
              mean_temp = mean(temperature_C, na.rm=T),
              third_q_temp = quantile(temperature_C, c(0.75), na.rm=T),
              max_temp = max(temperature_C, na.rm=T)) 
                 

A tibble: 2 × 7
site	min_temp	first_q_temp	median_temp	mean_temp	third_q_temp	max_temp
<chr>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
StreamA	11.36	12.5975	13.13	13.06000	13.385	14.47
StreamB	10.20	11.4200	12.03	12.10692	12.870	13.78

💻 Your turn! You can add columns to show descriptive statistics of dissolved oxygen, or redo the above code with dissolved_oxygen instead of temperature_C inside the descriptive statistic functions.

# get the descriptive statistics of dissolved oxygen

Preliminary analysis of water quality data from Stream A and Stream B#

In the next step of this module, we will explore the water quality data in detail, and use this exploration of data to help inform the best decision about which stream is most promising for bull trout reintroduction. As a reminder, temperatures above 14 degrees C and dissolved oxygen levels below 6.2 mg/L tend to be bad for bull trout.

For now, let’s use the information in the summary tables above to get a preliminary sense of which stream might have better water quality for bull trout.

🧠✍️Class Questions

From what we’ve learned about the streams so far, which one do you think would be better to re-introduce bull trout into?
Which of the summary statistics do you think is most useful?
Is there any other information you’d like to know in order to make the best recommendation to the tribe?

Recap Lesson 2.3#

In this section, we have reviewed:

how to remove NAs
how to run summary statistics by group, using group_by() and summarize()