Lesson 1.2: Loading and exploring data

Lesson 1.2: Loading and exploring data#

These data are the length and mercury content of fish caught on the K’Avi reservation in 1998, soon after the polluting factory was shut down.

  • Each row is data from one fish.

  • The first column “length” is the length of the fish in cm.

  • The second column “mercury” is the concentration of mercury in the fish, in units of micrograms (ug) of mercury per gram of fish (g).

# if running in google colab, uncomment and use the following line:
# fish_1998 <- read.csv("https://raw.githubusercontent.com/rachtorr/IndigenousEnvDataSci.github.io/refs/heads/main/MOD1/fish_1998.csv")

# read in CSV file 
fish_1998 <- read.csv("fish_1998.csv", header = TRUE, sep = ',')

Let’s break down the different parts of what just happened:

  1. We told R to read the data (a .csv file) using the command read.csv()

  2. Within the (“”), we identify the data file from our working directory that we want to load into R

  3. header = TRUE tells R that the first row of data is the column names. The alternative is header = FALSE

  4. sep is short for the word “separate”, and it tells R that each column in the csv is separated by a comma ‘,’

  5. fish_1998 is the name we are giving that file in R, and is how we will refer to this data going forward.

  6. <- is like = in a math equation, it tells R that fish_1998 is the same as the data “fish_1998.csv”

  7. This command turns a file on your computer (fish_1998.csv) into a dataframe (fish_1998) that you can examine in RStudio. A ‘dataframe’ is the way R can see data that has rows and columns. When you run line of code above, R looks for the .csv file called “fish_1998.csv” and imports it into this RStudio session in a format (a database) that you can use for coding. The new dataframe (fish_1998) should appear in the RStudio “Environment” window in the upper right if you are working in RStudio.

Previewing the data#

Note: If working in RStudio, View() will open a spreadsheet in new tab in this window (the “source editor” window).

  • Look through these data to see if they make sense (any weird numbers? Any missing data?).

  • Then close that tab and come back to this window.

If we don’t want to look at all of the data and instead just want to see what is going on in the first couple rows we can use the following code.

The command head() shows the column names and the first 6 rows of the data. We can also use str() which shows the structure of the data, including column names, data type, ad dimensions.

head(fish_1998)

str(fish_1998)
A data.frame: 6 × 3
Xlengthmercury
<int><dbl><dbl>
1145.216430.4315457
2244.348760.4969477
3346.865630.4491865
4441.673290.4137213
5551.345140.7491135
6656.664280.6807240
'data.frame':	300 obs. of  3 variables:
 $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ length : num  45.2 44.3 46.9 41.7 51.3 ...
 $ mercury: num  0.432 0.497 0.449 0.414 0.749 ...

🧠✍️ Class Questions

  • What are the columns in the R dataframe called fish_1998?

  • What sorts of questions could the tribal fishery managers answer with this data?

Exploring dimensions of data#

To answer these questions, it would be helpful to know how many fish have mercury measurements.We can very quickly find that out with the following code.

The command dim() shows the dimension of the dataset. “Dimension” is the number of rows & number of columns.

dim(fish_1998)
  1. 300
  2. 3

🧠✍️ Class Questions

  • How many fish did the managers measure mercury on in 1998?

Lesson 1.2 Recap#

In Lesson 1.2 we learned how to:

  • Load a csv file as a data frame using read.csv()

  • Preview the data frame using head(), str()

  • Identify number of rows and columns using dim()