Great Lakes Fish

Author

Skylar Pietz

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

fishing <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-08/fishing.csv')

Rows: 65706 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): lake, species, comments, region
dbl (3): year, grand_total, values

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

stocked <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-06-08/stocked.csv')

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 56232 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): LAKE, STATE_PROV, SITE, ST_SITE, STAT_DIST, LS_MGMT, SPECIES, STRA...
dbl (14): SID, YEAR, MONTH, DAY, LATITUDE, LONGITUDE, GRID, NO_STOCKED, YEAR...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#install.packages("tidymodels")
library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──

✔ broom        1.0.2      ✔ recipes      1.0.4 
✔ dials        1.1.0      ✔ rsample      1.1.1 
✔ dplyr        1.0.10     ✔ tibble       3.1.8 
✔ ggplot2      3.4.0      ✔ tidyr        1.2.1 
✔ infer        1.0.4      ✔ tune         1.0.1 
✔ modeldata    1.1.0      ✔ workflows    1.1.2 
✔ parsnip      1.0.3      ✔ workflowsets 1.0.0 
✔ purrr        1.0.1      ✔ yardstick    1.1.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/

library(tidyverse)

── Attaching packages
───────────────────────────────────────
tidyverse 1.3.2 ──

✔ readr   2.1.3     ✔ forcats 0.5.2
✔ stringr 1.5.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()

library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

my_data_splits <- initial_split(fishing, prop= 0.5)
exploratory_data <- training(my_data_splits)
test_data <- testing(my_data_splits)

#See the first six rows of the data we've read in to our notebook
exploratory_data %>%
  head(10) %>%
  kable() %>%
  kable_styling(c("striped", "hover"))

year	lake	species	grand_total	comments	region	values
1931	Michigan	Burbot	NA	NA	Indiana (IN)	15
1988	Huron	Chubs	1653	NA	Total Canada (ONT)	1583
1972	Huron	Northern Pike	40	NA	U.S. Huron Proper (HP)	NA
1908	Huron	Lake Trout	5431	NA	U.S. Total (MI)	1383
1942	Michigan	Cisco	NA	NA	Green Bay (WI)	NA
2008	Erie	Rock Bass	1	NA	U.S. Total	0
1992	Erie	Quillback	323	NA	New York (NY)	0
2014	Huron	Round Whitefish	30	NA	U.S. Huron Proper (HP)	20
1994	Michigan	Freshwater Drum	NA	NA	Green Bay (WI)	1
1919	Michigan	Cisco and Chub	NA	NA	U.S. Total	10704

install.packages("skimr")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
(as 'lib' is unspecified)

library(skimr)
    exploratory_data %>%
      skim()

Data summary
Name	Piped data
Number of rows	32853
Number of columns	7
_______________________
Column type frequency:
character	4
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
lake	0	1.00	4	11	6
species	0	1.00	4	29	51
comments	30072	0.08	3	607	161
region	0	1.00	9	22	24

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	0	1.00	1954.42	38.60	1867	1922	1958	1988	2015	▂▆▆▇▇
grand_total	15880	0.52	1413.88	3581.93	0	10	106	943	48821	▇▁▁▁▁
values	11035	0.66	503.42	1819.77	-31	0	16	208	45548	▇▁▁▁▁

exploratory_data %>%
  count(species)

# A tibble: 51 × 2
   species          n
   <chr>        <int>
 1 Alewife        265
 2 Amercian Eel   131
 3 American Eel    28
 4 Blue Pike      366
 5 Bowfin         186
 6 Buffalo        233
 7 Bullhead        21
 8 Bullheads     1196
 9 Burbot        1212
10 Carp          1611
# … with 41 more rows

exploratory_data %>%
  count(lake)

# A tibble: 6 × 2
  lake            n
  <chr>       <int>
1 Erie         7345
2 Huron        8536
3 Michigan     9555
4 Ontario      2276
5 Saint Clair  1207
6 Superior     3934

exploratory_data %>%
ggplot() + 
  geom_bar(mapping = aes(x = species), color = "black", fill ="blue") + 
  labs(title = "Counts of Fish Species",
       x = "Species" , y = "Count") +
  coord_flip() +
  facet_wrap(~lake, scales = "free")

exploratory_data %>%
ggplot() + 
  geom_bar(mapping = aes(x = species), color = "black", fill ="blue") + 
  labs(title = "Counts of Fish Species",
       x = "Species" , y = "Count") +
  coord_flip()

exploratory_data %>%
ggplot() + 
  geom_boxplot(mapping = aes(x = species , y = values), color = "black", fill ="blue") + 
  labs(title = "Values of Fish Species",
       x = "Species" , y = "Value") +
  coord_flip() +
  scale_y_log10()

Warning in self$trans$transform(x): NaNs produced

Warning: Transformation introduced infinite values in continuous y-axis

Warning: Removed 16659 rows containing non-finite values (`stat_boxplot()`).

Hypothesis

I hypothesize that since the Lake Whitefish species is most abundant then this species will also have the greatest value (production amount).

Abstract

The Great Lakes Fish data set comes from Great Lakes Fish Commission and includes 7 variables on 65,706 fish. I have decided to ask and provide answers to different questions such as…

What species was observed the most throughout the lakes and regions? (overall grand total)
In what region were the most fish found overall?
Which specific lake housed the greatest amount of the most abundant species found in question 1?
What species yield the greatest production value?

Introduction

The data I am working with is a data set on fish in the Great Lakes. The variables included are qualitative such as fish species, the specific lake the species was found, the region the lake is located, as well as variables such as the year the fish species is found, the grand total, and values.

I split the data into exploratory and test data using the code above. My exploratory data includes 7 variables of 32,853 fish. These variable include year of measurement, lake name, species of fish, grand total of observed, region of the US/Canada, and value of production amounts.

Exploratory Analysis

I first used code to create a table that organized the count of each species in my exploratory data. One potential source of error that might affect my data analysis is that some of the species were grouped and counted together and/or the same species was counted two separate times. The code I used to create a species count table resulted in a 51 x 2 table but there are actually less than 51 species due to some of them being repeated. To the best of my ability, I tried to estimate which species was counted the most throughout the lakes and regions and determined that, based on my exploratory data, the Lake Whitefish was observed the most.
I then created another table that represents the count of fish in each lake. From the table I can see that 6 different lakes were observed with Lake Michigan holding the greatest amount of fish (9,590).

Lake n (count)

Erie 7263

Huron 8621

Michigan 9590

Ontario 2317

Saint Clair 1164

Superior 3898
In order to answer the third question I used ggplot code and geom_bar code to create a bar graph of counts of fish species in each of the six lakes. Doing so, I am allowed to visualize the amount of each type of fish found in each of the different bodies of water. Compared to the rest of the species found in each lake, the Lake Whitefish species was most abundant in Lake Erie, Lake Huron, and Lake Ontario. Interestingly, Lake Michigan actually had the most Lake Whitefish at around 640. However, the Lake Whitefish species was the second highest count in Lake Michigan with the Lake Trout species being the greatest count in this lake at about 650.
The last question was answered by using r code to create a box plot of fish species and their values. It was determined that the values variables represents the amount of fish in thousand pounds that was collected and sold to other fisheries and commercial fishery businesses in the industry throughout the regions. The visual was made by using ggplot code and geom_box plot to create a box plot of each species and their value (production amounts) rounded to the nearest thousands pounds throughout the whole lakes region. Based on the box plot, it is evident that the Alewife species yielded the highest production amount.

Lake	n (count)
Erie	7263
Huron	8621
Michigan	9590
Ontario	2317
Saint Clair	1164
Superior	3898

Hypothesis Analysis

In my hypothesis I stated that if the Lake Whitefish species is most abundant then this species will also have the greatest value (production amount). I determined that my hypothesis was incorrect. Although the bar graph shows the Lake Whitefish species was most abundant with the highest count across all the lakes, the box-plot shows that the Alewife species yields the highest value in production across all the lakes and regions. Therefore, I must reject my hypothesis.