Volcano_Eruptions

Author

Skylar Pietz

Volcano Eruptions

Abstract

The Volcano Eruptions data comes from The Smithsonian Institution and includes volcanic eruptions that were reported from the past 2,500 years. It includes 15 variables measured on 11,178 total observed volcanoes. This data set allows me to ask interesting questions like

  • Were there more confirmed or unconfirmed eruptions reported in total?

  • Which volcano recorded the highest number of confirmed eruptions?

  • In what area did the most reported eruptions occur?

Introduction

The data I am analyzing is a data set on Volcanic Eruptions. The variables include quantitative features (like latitude and longitude) of the eruptions, as well as qualitative features like volcano number (the volcano’s unique ID), volcano name, eruption category (type of eruption), area of activity, etc.

Since the data set is so large, I first began by using code to split the file in half to “exploratory data” and “test data”. The exploratory data now contains 15 variables measured on 5589 volcanoes.

More Background on Volcanoes

A volcano is a rupture in the crust of a planetary-mass object, such as Earth, that allows hot lava, volcanic ash, and gases to escape from a magma chamber below the surface.

Earth’s volcanoes occur because its crust is broken into 17 major, rigid tectonic plates that float on a hotter, softer layer in its mantle. Therefore, on Earth, volcanoes are generally found where tectonic plates are diverging or converging, and most are found underwater

Erupting volcanoes can pose many hazards, not only in the immediate vicinity of the eruption. One such hazard is that volcanic ash can be a threat to aircraft, in particular those with jet engines where ash particles can be melted by the high operating temperature; the melted particles then adhere to the turbine blades and alter their shape, disrupting the operation of the turbine. Large eruptions can affect temperature as ash and droplets of sulfuric acid obscure the sun and cool the Earth’s lower atmosphere (or troposphere); however, they also absorb heat radiated from the Earth, thereby warming the upper atmosphere (or stratosphere). Historically, volcanic winters have caused catastrophic famines.

The researchers detected 238 eruptions from the past 2,500 years, they report today in Nature. About half were in the mid- to high-latitudes in the northern hemisphere, while 81 were in the tropics. (Because of the rotation of the Earth, material from tropical volcanoes ends up in both Greenland and Antarctica, while material from northern volcanoes tends to stay in the north.)

Reading Data into the Notebook & Installing Packages

#Load the tidyverse
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(kableExtra)

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows
#install.packages("tidymodels")
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.2     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.2
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.3     ✔ yardstick    1.1.0
✔ recipes      1.0.4     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard()        masks purrr::discard()
✖ dplyr::filter()          masks stats::filter()
✖ recipes::fixed()         masks stringr::fixed()
✖ kableExtra::group_rows() masks dplyr::group_rows()
✖ dplyr::lag()             masks stats::lag()
✖ yardstick::spec()        masks readr::spec()
✖ recipes::step()          masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
#install.packages("skimr")
library(skimr)

volcano <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv')
Rows: 958 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): volcano_name, primary_volcano_type, last_eruption_year, country, r...
dbl  (8): volcano_number, latitude, longitude, elevation, population_within_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
eruptions <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/eruptions.csv')
Rows: 11178 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): volcano_name, eruption_category, area_of_activity, evidence_method...
dbl (11): volcano_number, eruption_number, vei, start_year, start_month, sta...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(skimr)
eruptions %>%
  skim()
Data summary
Name Piped data
Number of rows 11178
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
volcano_name 0 1.00 3 37 0 921 0
eruption_category 0 1.00 18 20 0 3 0
area_of_activity 6484 0.42 3 60 0 2592 0
evidence_method_dating 1280 0.89 5 25 0 20 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
volcano_number 0 1.00 300284.37 52321.19 210010.00 263310.00 290050.00 343030.00 600000.00 ▇▇▁▁▁
eruption_number 0 1.00 15666.91 3297.61 10001.00 12817.25 15650.50 18463.75 22355.00 ▇▇▇▇▅
vei 2906 0.74 1.95 1.16 0.00 1.00 2.00 2.00 7.00 ▅▇▃▁▁
start_year 1 1.00 622.85 2482.17 -11345.00 680.00 1847.00 1950.00 2020.00 ▁▁▁▁▇
start_month 193 0.98 3.45 4.07 0.00 0.00 1.00 7.00 12.00 ▇▁▂▁▂
start_day 196 0.98 7.02 9.65 0.00 0.00 0.00 15.00 31.00 ▇▁▂▁▁
end_year 6846 0.39 1917.33 157.65 -475.00 1895.00 1957.00 1992.00 2020.00 ▁▁▁▁▇
end_month 6849 0.39 6.22 3.69 0.00 3.00 6.00 9.00 12.00 ▇▅▇▆▇
end_day 6852 0.39 13.32 9.83 0.00 4.00 15.00 21.00 31.00 ▇▃▆▃▅
latitude 0 1.00 16.87 30.76 -77.53 -6.10 17.60 40.82 85.61 ▁▃▇▆▃
longitude 0 1.00 31.57 115.25 -179.97 -77.66 55.71 139.39 179.58 ▂▃▂▁▇
eruptions %>%
  head() %>%
  kable() %>%
  kable_styling(c("hover", "striped"))
volcano_number volcano_name eruption_number eruption_category area_of_activity vei start_year start_month start_day evidence_method_dating end_year end_month end_day latitude longitude
266030 Soputan 22354 Confirmed Eruption NA NA 2020 3 23 Historical Observations 2020 4 2 1.112 124.737
343100 San Miguel 22355 Confirmed Eruption NA NA 2020 2 22 Historical Observations 2020 2 22 13.434 -88.269
233020 Fournaise, Piton de la 22343 Confirmed Eruption NA NA 2020 2 10 Historical Observations 2020 4 6 -21.244 55.708
345020 Rincon de la Vieja 22346 Confirmed Eruption NA NA 2020 1 31 Historical Observations 2020 4 17 10.830 -85.324
353010 Fernandina 22347 Confirmed Eruption NA NA 2020 1 12 Historical Observations 2020 1 12 -0.370 -91.550
273070 Taal 22344 Confirmed Eruption NA NA 2020 1 12 Historical Observations 2020 1 22 14.002 120.993

Splitting the Data

Here is where I split the data file in half into exploratory data and test data. The exploratory data now contains 15 variables measured on 5589 volcanoes.

my_data_splits <- initial_split(eruptions, prop = 0.5)

exploratory_data <- training(my_data_splits)
test_data <- testing(my_data_splits)

This initial_split function, used above, created a single binary split of the data into an exploratory set and a test set. This way I have a smaller more specified list of eruptions to work with. I can also change “eruptions” to “volcano” to view some different data and variables on the volcanoes.

exploratory_data %>%
  head(10) %>%
  kable() %>%
  kable_styling(c("striped", "hover"))
volcano_number volcano_name eruption_number eruption_category area_of_activity vei start_year start_month start_day evidence_method_dating end_year end_month end_day latitude longitude
283030 Fujisan 17406 Confirmed Eruption NA NA -1850 0 0 Radiocarbon (uncorrected) NA NA NA 35.361 138.728
300023 Kurile Lake 18903 Confirmed Eruption NA 7 -6440 0 0 Radiocarbon (corrected) NA NA NA 51.450 157.120
211060 Etna 13779 Confirmed Eruption North flank (3110-2025 m) 1 1918 11 30 Historical Observations 1918 12 1 37.748 14.999
211042 Lipari 13395 Confirmed Eruption Gabellotto-Fiumebianco NA -5820 0 0 Radiocarbon (corrected) NA NA NA 38.490 14.933
283280 Hakkodasan 18133 Confirmed Eruption O-dake 1 450 0 0 Radiocarbon (corrected) NA NA NA 40.659 140.877
311130 Kasatochi 19741 Confirmed Eruption NA 0 1760 0 0 Historical Observations NA NA NA 52.177 -175.508
311320 Akutan 19838 Confirmed Eruption NA 0 1908 2 22 Historical Observations NA NA NA 54.134 -165.986
313030 Redoubt 20330 Confirmed Eruption NA NA -1560 0 0 Varve Count NA NA NA 60.485 -152.742
372030 Katla 12668 Confirmed Eruption NA 4 1550 0 0 Tephrochronology NA NA NA 63.633 -19.083
251070 Ritter Island 14976 Confirmed Eruption NA 1 2006 10 17 Historical Observations 2006 10 17 -5.519 148.115
install.packages("skimr")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
(as 'lib' is unspecified)
library(skimr)
    exploratory_data %>%
      skim()
Data summary
Name Piped data
Number of rows 5589
Number of columns 15
_______________________
Column type frequency:
character 4
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
volcano_name 0 1.00 3 37 0 771 0
eruption_category 0 1.00 18 20 0 3 0
area_of_activity 3238 0.42 3 60 0 1459 0
evidence_method_dating 658 0.88 5 25 0 20 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
volcano_number 0 1.00 300497.79 52399.19 210010.00 263310.00 290270.00 343100.00 600000.00 ▇▇▁▁▁
eruption_number 0 1.00 15643.63 3302.15 10002.00 12779.00 15602.00 18481.00 22352.00 ▇▇▇▇▅
vei 1470 0.74 1.96 1.17 0.00 1.00 2.00 2.00 7.00 ▅▇▃▁▁
start_year 1 1.00 595.74 2524.30 -11345.00 600.00 1847.00 1951.00 2020.00 ▁▁▁▁▇
start_month 102 0.98 3.46 4.06 0.00 0.00 1.00 7.00 12.00 ▇▁▂▁▂
start_day 104 0.98 7.17 9.71 0.00 0.00 0.00 15.00 31.00 ▇▁▂▁▁
end_year 3435 0.39 1914.96 169.88 -475.00 1895.00 1958.00 1991.00 2020.00 ▁▁▁▁▇
end_month 3437 0.39 6.17 3.65 0.00 3.00 6.00 9.00 12.00 ▇▅▇▆▇
end_day 3439 0.38 13.43 9.96 0.00 3.00 15.00 22.00 31.00 ▇▃▆▃▅
latitude 0 1.00 16.80 31.04 -77.53 -6.10 16.72 40.83 85.61 ▁▃▇▆▃
longitude 0 1.00 31.67 115.24 -179.97 -77.99 55.71 139.39 179.58 ▂▃▂▁▇

Hypothesis

  • I hypothesize that although volcanoes exist all over the globe, the most active volcanoes with confirmed eruptions are located in nearby areas/regions on the map.

Exploratory Analysis

After I used code to split the data, I then used code and piping to create a table that organized the exploratory data by it’s reported variables. The table generated appears as follows. It shows 5,589 rows by 15 columns. The rows represent the 5,589 volcanoes and the columns represent the 15 different variables in which the volcanoes were measured/observed.

exploratory_data
# A tibble: 5,589 × 15
   volca…¹ volca…² erupt…³ erupt…⁴ area_…⁵   vei start…⁶ start…⁷ start…⁸ evide…⁹
     <dbl> <chr>     <dbl> <chr>   <chr>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  
 1  283030 Fujisan   17406 Confir… <NA>       NA   -1850       0       0 Radioc…
 2  300023 Kurile…   18903 Confir… <NA>        7   -6440       0       0 Radioc…
 3  211060 Etna      13779 Confir… North …     1    1918      11      30 Histor…
 4  211042 Lipari    13395 Confir… Gabell…    NA   -5820       0       0 Radioc…
 5  283280 Hakkod…   18133 Confir… O-dake      1     450       0       0 Radioc…
 6  311130 Kasato…   19741 Confir… <NA>        0    1760       0       0 Histor…
 7  311320 Akutan    19838 Confir… <NA>        0    1908       2      22 Histor…
 8  313030 Redoubt   20330 Confir… <NA>       NA   -1560       0       0 Varve …
 9  372030 Katla     12668 Confir… <NA>        4    1550       0       0 Tephro…
10  251070 Ritter…   14976 Confir… <NA>        1    2006      10      17 Histor…
# … with 5,579 more rows, 5 more variables: end_year <dbl>, end_month <dbl>,
#   end_day <dbl>, latitude <dbl>, longitude <dbl>, and abbreviated variable
#   names ¹​volcano_number, ²​volcano_name, ³​eruption_number, ⁴​eruption_category,
#   ⁵​area_of_activity, ⁶​start_year, ⁷​start_month, ⁸​start_day,
#   ⁹​evidence_method_dating

Eruptions are categorized into three different types. These include Confirmed Eruption, Discredited Eruption, and Uncertain Eruption. I used the count function to figure out how many of each eruption category were observed in the volcanoes. This is shown here:

exploratory_data %>%
  count(eruption_category)
# A tibble: 3 × 2
  eruption_category        n
  <chr>                <int>
1 Confirmed Eruption    4945
2 Discredited Eruption    89
3 Uncertain Eruption     555

As one can see, there were 4,978 confirmed eruptions, 76 discredited eruptions, and 535 uncertain eruptions. This data is helpful in me answering my first question of whether there were more confirmed or unconfirmed eruptions. The data presented shows me that there were many more confirmed eruptions, at 4978.

I then created another table but used the group_by function to organize it by volcano name and the count function to count each type of eruption for that volcano. Lastly, the arrange function changes the way the data is presented by organizing it in order from the volcano with the highest number of eruptions first to the one with the least. The table generated is shown below:

exploratory_data %>%
  group_by(volcano_name) %>%
  count(eruption_category) %>%
  arrange(-n)
# A tibble: 1,046 × 3
# Groups:   volcano_name [771]
   volcano_name           eruption_category      n
   <chr>                  <chr>              <int>
 1 Etna                   Confirmed Eruption   106
 2 Fournaise, Piton de la Confirmed Eruption    82
 3 Villarrica             Confirmed Eruption    76
 4 Asosan                 Confirmed Eruption    74
 5 Asamayama              Confirmed Eruption    59
 6 Klyuchevskoy           Confirmed Eruption    57
 7 Merapi                 Confirmed Eruption    57
 8 Katla                  Confirmed Eruption    56
 9 Mauna Loa              Confirmed Eruption    53
10 Sheveluch              Confirmed Eruption    49
# … with 1,036 more rows

Based on the table above, I am able to determine which volcanoes have erupted the highest number of confirmed times. It’s evident that the volcano named “Fournaise, Piton de la” has 97 confirmed eruptions. “Etna” is a close second with 94 confirmed eruptions. This data helps me answer my second question of which volcano recorded the highest number of confirmed eruptions, with the answer being “Fournaise, Piton de la” with 97 confirmed eruptions within the last 2,500 years.

I used the same functions to create a similar yet different table that shows me the area with the highest volcanic activity. The table appears below:

exploratory_data %>%
  group_by(area_of_activity) %>%
  count(eruption_category) %>%
  arrange(-n)
# A tibble: 1,511 × 3
# Groups:   area_of_activity [1,460]
   area_of_activity eruption_category        n
   <chr>            <chr>                <int>
 1 <NA>             Confirmed Eruption    2734
 2 <NA>             Uncertain Eruption     416
 3 <NA>             Discredited Eruption    88
 4 Naka-dake        Confirmed Eruption      70
 5 Ngauruhoe        Confirmed Eruption      29
 6 Bromo            Confirmed Eruption      25
 7 Ohachi           Confirmed Eruption      23
 8 Anak Krakatau    Confirmed Eruption      22
 9 Mihara-yama      Confirmed Eruption      20
10 Central Crater   Confirmed Eruption      16
# … with 1,501 more rows

Looking at this table, it’s evident that the area with the highest volcanic activity is not available (NA). The confirmed eruptions in this area within the past 2,500 years is 2724, along with 411 uncertain eruptions and 76 discredited eruptions. The next named area of activity is Naka-dake, with 73 confirmed eruptions. I did some further research and found out that Mt. Aso Nakadake in Aso-Kuju National Park, central Kyushu, is Japan’s most active volcano. The huge caldera stretches 24 kilometeres from north to south, surrounded by 5 peaks, known collectively as Aso Gogaku. The Mt. Nakadake crater is still active. Therefore, the answer to my last question (which asks in what area did the most reported eruptions occur?), is Naka-dake.

Hypothesis Analysis

In my hypothesis I stated that although volcanoes exist all over the globe, the most active volcanoes with confirmed eruptions are located in nearby areas on the map. From my last code block, I was able to determine that a certain area had the most activity based on confirmed eruptions. This resulted in the area of Naka-dake. I was hoping to use ggplot to create a graph or map that would help to better visualize where the most active volcanoes were located based on a global level. I was unsure how to do this with many trial and errors using code. However, I can see that the names of the areas with the most activity appear to be in the eastern Japan and eastern Asia region.

Conclusions

In conclusion, I was able to use and interpret the volcano eruptions data set for many different aspects of the analysis. I used it to perform various code runs to better analyze the volcanic activity of numerous volcanoes from the past 2,500 years. Based on this data set, environmental scientists and researchers can estimate what volcanoes are the most active and where they are located. They could also use the data to possibly predict which volcano is due to erupt next. By using the data to perform tests and functions and then making predictions about future volcanic activity, we could take action in preventing potential harm and destruction to human society and all other organisms living on earth.