The Volcano Eruptions data comes from The Smithsonian Institution and includes volcanic eruptions that were reported from the past 2,500 years. It includes 15 variables measured on 11,178 total observed volcanoes. This data set allows me to ask interesting questions like
Were there more confirmed or unconfirmed eruptions reported in total?
Which volcano recorded the highest number of confirmed eruptions?
In what area did the most reported eruptions occur?
Introduction
The data I am analyzing is a data set on Volcanic Eruptions. The variables include quantitative features (like latitude and longitude) of the eruptions, as well as qualitative features like volcano number (the volcano’s unique ID), volcano name, eruption category (type of eruption), area of activity, etc.
Since the data set is so large, I first began by using code to split the file in half to “exploratory data” and “test data”. The exploratory data now contains 15 variables measured on 5589 volcanoes.
More Background on Volcanoes
A volcano is a rupture in the crust of a planetary-mass object, such as Earth, that allows hot lava, volcanic ash, and gases to escape from a magma chamber below the surface.
Earth’s volcanoes occur because its crust is broken into 17 major, rigid tectonic plates that float on a hotter, softer layer in its mantle. Therefore, on Earth, volcanoes are generally found where tectonic plates are diverging or converging, and most are found underwater
Erupting volcanoes can pose many hazards, not only in the immediate vicinity of the eruption. One such hazard is that volcanic ash can be a threat to aircraft, in particular those with jet engines where ash particles can be melted by the high operating temperature; the melted particles then adhere to the turbine blades and alter their shape, disrupting the operation of the turbine. Large eruptions can affect temperature as ash and droplets of sulfuric acid obscure the sun and cool the Earth’s lower atmosphere (or troposphere); however, they also absorb heat radiated from the Earth, thereby warming the upper atmosphere (or stratosphere). Historically, volcanic winters have caused catastrophic famines.
The researchers detected 238 eruptions from the past 2,500 years, they report today in Nature. About half were in the mid- to high-latitudes in the northern hemisphere, while 81 were in the tropics. (Because of the rotation of the Earth, material from tropical volcanoes ends up in both Greenland and Antarctica, while material from northern volcanoes tends to stay in the north.)
Reading Data into the Notebook & Installing Packages
Rows: 958 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): volcano_name, primary_volcano_type, last_eruption_year, country, r...
dbl (8): volcano_number, latitude, longitude, elevation, population_within_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 11178 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): volcano_name, eruption_category, area_of_activity, evidence_method...
dbl (11): volcano_number, eruption_number, vei, start_year, start_month, sta...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here is where I split the data file in half into exploratory data and test data. The exploratory data now contains 15 variables measured on 5589 volcanoes.
This initial_split function, used above, created a single binary split of the data into an exploratory set and a test set. This way I have a smaller more specified list of eruptions to work with. I can also change “eruptions” to “volcano” to view some different data and variables on the volcanoes.
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
(as 'lib' is unspecified)
library(skimr) exploratory_data %>%skim()
Data summary
Name
Piped data
Number of rows
5589
Number of columns
15
_______________________
Column type frequency:
character
4
numeric
11
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
volcano_name
0
1.00
3
37
0
771
0
eruption_category
0
1.00
18
20
0
3
0
area_of_activity
3238
0.42
3
60
0
1459
0
evidence_method_dating
658
0.88
5
25
0
20
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
volcano_number
0
1.00
300497.79
52399.19
210010.00
263310.00
290270.00
343100.00
600000.00
▇▇▁▁▁
eruption_number
0
1.00
15643.63
3302.15
10002.00
12779.00
15602.00
18481.00
22352.00
▇▇▇▇▅
vei
1470
0.74
1.96
1.17
0.00
1.00
2.00
2.00
7.00
▅▇▃▁▁
start_year
1
1.00
595.74
2524.30
-11345.00
600.00
1847.00
1951.00
2020.00
▁▁▁▁▇
start_month
102
0.98
3.46
4.06
0.00
0.00
1.00
7.00
12.00
▇▁▂▁▂
start_day
104
0.98
7.17
9.71
0.00
0.00
0.00
15.00
31.00
▇▁▂▁▁
end_year
3435
0.39
1914.96
169.88
-475.00
1895.00
1958.00
1991.00
2020.00
▁▁▁▁▇
end_month
3437
0.39
6.17
3.65
0.00
3.00
6.00
9.00
12.00
▇▅▇▆▇
end_day
3439
0.38
13.43
9.96
0.00
3.00
15.00
22.00
31.00
▇▃▆▃▅
latitude
0
1.00
16.80
31.04
-77.53
-6.10
16.72
40.83
85.61
▁▃▇▆▃
longitude
0
1.00
31.67
115.24
-179.97
-77.99
55.71
139.39
179.58
▂▃▂▁▇
Hypothesis
I hypothesize that although volcanoes exist all over the globe, the most active volcanoes with confirmed eruptions are located in nearby areas/regions on the map.
Exploratory Analysis
After I used code to split the data, I then used code and piping to create a table that organized the exploratory data by it’s reported variables. The table generated appears as follows. It shows 5,589 rows by 15 columns. The rows represent the 5,589 volcanoes and the columns represent the 15 different variables in which the volcanoes were measured/observed.
Eruptions are categorized into three different types. These include Confirmed Eruption, Discredited Eruption, and Uncertain Eruption. I used the count function to figure out how many of each eruption category were observed in the volcanoes. This is shown here:
As one can see, there were 4,978 confirmed eruptions, 76 discredited eruptions, and 535 uncertain eruptions. This data is helpful in me answering my first question of whether there were more confirmed or unconfirmed eruptions. The data presented shows me that there were many more confirmed eruptions, at 4978.
I then created another table but used the group_by function to organize it by volcano name and the count function to count each type of eruption for that volcano. Lastly, the arrange function changes the way the data is presented by organizing it in order from the volcano with the highest number of eruptions first to the one with the least. The table generated is shown below:
# A tibble: 1,046 × 3
# Groups: volcano_name [771]
volcano_name eruption_category n
<chr> <chr> <int>
1 Etna Confirmed Eruption 106
2 Fournaise, Piton de la Confirmed Eruption 82
3 Villarrica Confirmed Eruption 76
4 Asosan Confirmed Eruption 74
5 Asamayama Confirmed Eruption 59
6 Klyuchevskoy Confirmed Eruption 57
7 Merapi Confirmed Eruption 57
8 Katla Confirmed Eruption 56
9 Mauna Loa Confirmed Eruption 53
10 Sheveluch Confirmed Eruption 49
# … with 1,036 more rows
Based on the table above, I am able to determine which volcanoes have erupted the highest number of confirmed times. It’s evident that the volcano named “Fournaise, Piton de la” has 97 confirmed eruptions. “Etna” is a close second with 94 confirmed eruptions. This data helps me answer my second question of which volcano recorded the highest number of confirmed eruptions, with the answer being “Fournaise, Piton de la” with 97 confirmed eruptions within the last 2,500 years.
I used the same functions to create a similar yet different table that shows me the area with the highest volcanic activity. The table appears below:
Looking at this table, it’s evident that the area with the highest volcanic activity is not available (NA). The confirmed eruptions in this area within the past 2,500 years is 2724, along with 411 uncertain eruptions and 76 discredited eruptions. The next named area of activity is Naka-dake, with 73 confirmed eruptions. I did some further research and found out that Mt. Aso Nakadake in Aso-Kuju National Park, central Kyushu, is Japan’s most active volcano. The huge caldera stretches 24 kilometeres from north to south, surrounded by 5 peaks, known collectively as Aso Gogaku. The Mt. Nakadake crater is still active. Therefore, the answer to my last question (which asks in what area did the most reported eruptions occur?), is Naka-dake.
Hypothesis Analysis
In my hypothesis I stated that although volcanoes exist all over the globe, the most active volcanoes with confirmed eruptions are located in nearby areas on the map. From my last code block, I was able to determine that a certain area had the most activity based on confirmed eruptions. This resulted in the area of Naka-dake. I was hoping to use ggplot to create a graph or map that would help to better visualize where the most active volcanoes were located based on a global level. I was unsure how to do this with many trial and errors using code. However, I can see that the names of the areas with the most activity appear to be in the eastern Japan and eastern Asia region.
Conclusions
In conclusion, I was able to use and interpret the volcano eruptions data set for many different aspects of the analysis. I used it to perform various code runs to better analyze the volcanic activity of numerous volcanoes from the past 2,500 years. Based on this data set, environmental scientists and researchers can estimate what volcanoes are the most active and where they are located. They could also use the data to possibly predict which volcano is due to erupt next. By using the data to perform tests and functions and then making predictions about future volcanic activity, we could take action in preventing potential harm and destruction to human society and all other organisms living on earth.