For the first portion of the project, I will be analyzing two large datasets. The first being a very in depth and detailed description of sommelier wines from different geographic locations, and their respective ratings with other informative metrics (price,point ranking, sommelier taster, winery, and origin). The second dataset includes happiness scores from every country around the world based on certain metrics such as happiness score, economy, health, and many more. As someone who is of legal age and enjoys to occasionally drink wine (also watched an amazing documentary called sommelier) I would like to see if higher wine scores in different regions pertains to happiness.
library(tidyverse)
# read your datasets in here, e.g., with read_csv()
wine <- read_csv("wine.csv")
happiness <- read_csv("happiness.csv")
glimpse(wine)
## Rows: 129,971
## Columns: 14
## $ X1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ country <chr> "Italy", "Portugal", "US", "US", "US", "Spain",…
## $ description <chr> "Aromas include tropical fruit, broom, brimston…
## $ designation <chr> "Vulkà Bianco", "Avidagos", NA, "Reserve Late H…
## $ points <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87,…
## $ price <dbl> NA, 15, 14, 13, 65, 15, 16, 24, 12, 27, 19, 30,…
## $ province <chr> "Sicily & Sardinia", "Douro", "Oregon", "Michig…
## $ region_1 <chr> "Etna", NA, "Willamette Valley", "Lake Michigan…
## $ region_2 <chr> NA, NA, "Willamette Valley", NA, "Willamette Va…
## $ taster_name <chr> "Kerin O’Keefe", "Roger Voss", "Paul Gregutt", …
## $ taster_twitter_handle <chr> "@kerinokeefe", "@vossroger", "@paulgwine ", NA…
## $ title <chr> "Nicosia 2013 Vulkà Bianco (Etna)", "Quinta do…
## $ variety <chr> "White Blend", "Portuguese Red", "Pinot Gris", …
## $ winery <chr> "Nicosia", "Quinta dos Avidagos", "Rainstorm", …
glimpse(happiness)
## Rows: 158
## Columns: 12
## $ Country <chr> "Switzerland", "Iceland", "Denmark", …
## $ Region <chr> "Western Europe", "Western Europe", "…
## $ `Happiness Rank` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
## $ `Happiness Score` <dbl> 7.587, 7.561, 7.527, 7.522, 7.427, 7.…
## $ `Standard Error` <dbl> 0.03411, 0.04884, 0.03328, 0.03880, 0…
## $ `Economy (GDP per Capita)` <dbl> 1.39651, 1.30232, 1.32548, 1.45900, 1…
## $ Family <dbl> 1.34951, 1.40223, 1.36058, 1.33095, 1…
## $ `Health (Life Expectancy)` <dbl> 0.94143, 0.94784, 0.87464, 0.88521, 0…
## $ Freedom <dbl> 0.66557, 0.62877, 0.64938, 0.66973, 0…
## $ `Trust (Government Corruption)` <dbl> 0.41978, 0.14145, 0.48357, 0.36503, 0…
## $ Generosity <dbl> 0.29678, 0.43630, 0.34139, 0.34699, 0…
## $ `Dystopia Residual` <dbl> 2.51738, 2.70201, 2.49204, 2.46531, 2…
If your datasets are tidy already, demonstrate that you can reshape data with pivot wider/longer here (e.g., untidy and then retidy). Alternatively, it may be easier to wait until the wrangling section so you can reshape your summary statistics. Note here if you are going to do this.
# your tidying code (if applicable; can also wait until
# wrangling section) due to my data being rather tidy I will
# perform a methodology of untidying and tidying
# I will make the region untidy by creating one column and
# separating regions by a column Thus, I will wait until the
# wrangling section to reshape my data
wine <- wine %>% pivot_longer(8:9, names_to = "region", names_prefix = "region_",
values_to = "region type")
# Separate the first and last names of sommeliers for easier
# readability
wine <- wine %>% separate(col = taster_name, into = c("First Name",
"Last Name"), sep = " ")
# your joining code display the dimensions for the wine and
# happiness dataset
wine %>% dim()
## [1] 259942 15
happiness %>% dim()
## [1] 158 12
# join the two datasets with inner_join
joinedData <- wine %>% inner_join(happiness, by = c(country = "Country"))
joinedData %>% dim()
## [1] 150660 26
joinedData %>% summarize_all(n_distinct)
## # A tibble: 1 x 26
## X1 country description designation points price province `First Name`
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 75330 41 69385 24194 21 388 397 15
## # … with 18 more variables: `Last Name` <int>, taster_twitter_handle <int>,
## # title <int>, variety <int>, winery <int>, region <int>, `region
## # type` <int>, Region <int>, `Happiness Rank` <int>, `Happiness Score` <int>,
## # `Standard Error` <int>, `Economy (GDP per Capita)` <int>, Family <int>,
## # `Health (Life Expectancy)` <int>, Freedom <int>, `Trust (Government
## # Corruption)` <int>, Generosity <int>, `Dystopia Residual` <int>
# your joining code find the rows in wine which didn't have a
# match in happiness by country
wine %>% anti_join(happiness, by = c(country = "Country"))
## # A tibble: 109,282 x 15
## X1 country description designation points price province `First Name`
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 2 US Tart and s… <NA> 87 14 Oregon Paul
## 2 2 US Tart and s… <NA> 87 14 Oregon Paul
## 3 3 US Pineapple … Reserve La… 87 13 Michigan Alexander
## 4 3 US Pineapple … Reserve La… 87 13 Michigan Alexander
## 5 4 US Much like … Vintner's … 87 65 Oregon Paul
## 6 4 US Much like … Vintner's … 87 65 Oregon Paul
## 7 10 US Soft, supp… Mountain C… 87 19 Califor… Virginie
## 8 10 US Soft, supp… Mountain C… 87 19 Califor… Virginie
## 9 12 US Slightly r… <NA> 87 34 Califor… Virginie
## 10 12 US Slightly r… <NA> 87 34 Califor… Virginie
## # … with 109,272 more rows, and 7 more variables: `Last Name` <chr>,
## # taster_twitter_handle <chr>, title <chr>, variety <chr>, winery <chr>,
## # region <chr>, `region type` <chr>
Simply put, I am inner joining the two datasets on a country basis which will remove any values which don’t have a match in countries from either dataset. This will allow me to accurately compare the countries which were represented in both datasets, and not have any countries describe their associated happiness metrics. After joining, one can see that that the dimensions for the dataset are 150660 rows by 26 columns. The dimensions for wine and happiness are seen above. Wine has apprxoimately 109,282 rows which didn’t appear in the happiness dataset
library(tidytext)
# Use pivot wider to tidy data back into region1 and region2
# for easier readibility, and more in depth analysis
joinedData <- joinedData %>% pivot_wider(names_from = region,
values_from = `region type`, names_prefix = "region_")
# your wrangling code Aggregate the countries with perfect
# wine rating scores from the joined dataset
joinedData %>% group_by(title, country) %>% filter(points ==
100) %>% select(title, points, country) %>% arrange(desc(points))
## # A tibble: 15 x 3
## # Groups: title, country [15]
## title points country
## <chr> <dbl> <chr>
## 1 Chambers Rosewood Vineyards NV Rare Muscat (Rutherglen) 100 Austral…
## 2 Avignonesi 1995 Occhio di Pernice (Vin Santo di Montepulcia… 100 Italy
## 3 Krug 2002 Brut (Champagne) 100 France
## 4 Tenuta dell'Ornellaia 2007 Masseto Merlot (Toscana) 100 Italy
## 5 Casa Ferreirinha 2008 Barca-Velha Red (Douro) 100 Portugal
## 6 Biondi Santi 2010 Riserva (Brunello di Montalcino) 100 Italy
## 7 Château Léoville Barton 2010 Saint-Julien 100 France
## 8 Louis Roederer 2008 Cristal Vintage Brut (Champagne) 100 France
## 9 Salon 2006 Le Mesnil Blanc de Blancs Brut Chardonnay (Champa… 100 France
## 10 Château Lafite Rothschild 2010 Pauillac 100 France
## 11 Casanova di Neri 2007 Cerretalto (Brunello di Montalcino) 100 Italy
## 12 Château Cheval Blanc 2010 Saint-Émilion 100 France
## 13 Château Léoville Las Cases 2010 Saint-Julien 100 France
## 14 Quinta do Noval 2011 Nacional Vintage (Port) 100 Portugal
## 15 Château Haut-Brion 2014 Pessac-Léognan 100 France
# Aggregate the number of wine varieties
joinedData %>% group_by(variety) %>% summarize(n())
## # A tibble: 638 x 2
## variety `n()`
## <chr> <int>
## 1 Abouriou 2
## 2 Agiorgitiko 63
## 3 Aglianico 344
## 4 Aidani 1
## 5 Airen 3
## 6 Albana 23
## 7 Albanello 1
## 8 Albariño 346
## 9 Albarossa 2
## 10 Aleatico 5
## # … with 628 more rows
# Find the maximum price of wine in descending order
joinedData %>% group_by(title) %>% select(title, price) %>% arrange(desc(price))
## # A tibble: 75,330 x 2
## # Groups: title [68,489]
## title price
## <chr> <dbl>
## 1 Château les Ormes Sorbet 2013 Médoc 3300
## 2 Château Pétrus 2014 Pomerol 2500
## 3 Domaine du Comte Liger-Belair 2010 La Romanée 2500
## 4 Château Pétrus 2011 Pomerol 2000
## 5 Domaine du Comte Liger-Belair 2005 La Romanée 2000
## 6 Château Margaux 2009 Margaux 1900
## 7 Château Lafite Rothschild 2010 Pauillac 1500
## 8 Château Cheval Blanc 2010 Saint-Émilion 1500
## 9 Château Mouton Rothschild 2009 Pauillac 1300
## 10 Château Haut-Brion 2009 Pessac-Léognan 1200
## # … with 75,320 more rows
# your wrangling code find the average value for points of
# wine
joinedData %>% summarize(meanPoints = mean(points))
## # A tibble: 1 x 1
## meanPoints
## <dbl>
## 1 88.4
# calculate the standard deviation and standard error for the
# price of wine on a country basis
joinedData %>% group_by(country) %>% summarize(mean = mean(price,
na.rm = T), n = n(), sd = sd(price, na.rm = T), se = sd/sqrt(n))
## # A tibble: 41 x 5
## country mean n sd se
## <chr> <dbl> <int> <dbl> <dbl>
## 1 Argentina 24.5 3800 23.4 0.380
## 2 Armenia 14.5 2 0.707 0.5
## 3 Australia 35.4 2329 49.0 1.02
## 4 Austria 30.8 3345 27.2 0.471
## 5 Bosnia and Herzegovina 12.5 2 0.707 0.5
## 6 Brazil 23.8 52 11.1 1.53
## 7 Bulgaria 14.6 141 9.51 0.801
## 8 Canada 35.7 257 19.7 1.23
## 9 Chile 20.8 4472 21.9 0.328
## 10 China 18 1 NA NA
## # … with 31 more rows
# Create a new column to determine the tier of wine based on
# points scale
joinedData <- joinedData %>% mutate(wineTier = ifelse(points >
95, "high", ifelse(points <= 95 & 50 <= points, "med", "low")))
# Find the highest priced variety of wines for the dataset
joinedData %>% group_by(variety) %>% summarize(meanPrice = mean(price,
na.rm = T)) %>% arrange(desc(meanPrice)) %>% slice(1:10)
## # A tibble: 10 x 2
## variety meanPrice
## <chr> <dbl>
## 1 Ramisco 495
## 2 Terrantez 236
## 3 Francisa 160
## 4 Cabernet-Shiraz 150
## 5 Rosenmuskateller 150
## 6 Malbec-Cabernet 113.
## 7 Tinta Negra Mole 112
## 8 Muscadelle 111.
## 9 Bual 100
## 10 Mazuelo 78.2
# your wrangling code Use sentiment analysis to properly
# format the description key words
wine_words <- joinedData %>% mutate(description = str_remove_all(description,
"&.*;,0123456789")) %>% unnest_tokens(word, description)
mystops <- stop_words %>% filter(lexicon == "snowball")
wine_words_clean <- anti_join(wine_words, mystops)
glimpse(wine_words_clean)
## Rows: 1,920,810
## Columns: 27
## $ X1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ country <chr> "Italy", "Italy", "Italy", "Italy", "…
## $ designation <chr> "Vulkà Bianco", "Vulkà Bianco", "Vulk…
## $ points <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 8…
## $ price <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ province <chr> "Sicily & Sardinia", "Sicily & Sardin…
## $ `First Name` <chr> "Kerin", "Kerin", "Kerin", "Kerin", "…
## $ `Last Name` <chr> "O’Keefe", "O’Keefe", "O’Keefe", "O’K…
## $ taster_twitter_handle <chr> "@kerinokeefe", "@kerinokeefe", "@ker…
## $ title <chr> "Nicosia 2013 Vulkà Bianco (Etna)", …
## $ variety <chr> "White Blend", "White Blend", "White …
## $ winery <chr> "Nicosia", "Nicosia", "Nicosia", "Nic…
## $ Region <chr> "Western Europe", "Western Europe", "…
## $ `Happiness Rank` <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 5…
## $ `Happiness Score` <dbl> 5.948, 5.948, 5.948, 5.948, 5.948, 5.…
## $ `Standard Error` <dbl> 0.03914, 0.03914, 0.03914, 0.03914, 0…
## $ `Economy (GDP per Capita)` <dbl> 1.25114, 1.25114, 1.25114, 1.25114, 1…
## $ Family <dbl> 1.19777, 1.19777, 1.19777, 1.19777, 1…
## $ `Health (Life Expectancy)` <dbl> 0.95446, 0.95446, 0.95446, 0.95446, 0…
## $ Freedom <dbl> 0.26236, 0.26236, 0.26236, 0.26236, 0…
## $ `Trust (Government Corruption)` <dbl> 0.02901, 0.02901, 0.02901, 0.02901, 0…
## $ Generosity <dbl> 0.22823, 0.22823, 0.22823, 0.22823, 0…
## $ `Dystopia Residual` <dbl> 2.02518, 2.02518, 2.02518, 2.02518, 2…
## $ region_1 <chr> "Etna", "Etna", "Etna", "Etna", "Etna…
## $ region_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wineTier <chr> "med", "med", "med", "med", "med", "m…
## $ word <chr> "aromas", "include", "tropical", "fru…
Your discussion of wrangling section here. Feel encouraged to break up into more than once code chunk and discuss each in turn.
# your plot 1 Create a bar graph based on the countries with
# the number of highest wine ratings on a per country basis
joinedData %>% group_by(country) %>% filter(wineTier == "high") %>%
summarize(countryCount = n()) %>% ggplot(aes(country, countryCount,
fill = country)) + geom_bar(stat = "identity") + xlab("Country") +
ylab("Count") + ggtitle("Countries based on Highest Quality Wine Ratings") +
scale_y_continuous(breaks = seq(0, 300, 25)) + theme_bw()
# Create a bar graph based on the sentiment of sommelier
# reviews for wine descriptions
wine_words_clean %>% group_by(word) %>% summarize(n = n()) %>%
arrange(desc(n)) %>% slice(1:10) %>% ggplot(aes(word, n)) +
geom_bar(stat = "identity") + scale_colour_brewer(palette = "Accent") +
xlab("Most Used Sommelier Terms") + ylab("Usage in Descriptions") +
ggtitle("The Ten Most Used Terms by Sommeliers and their Frequencies") +
scale_y_continuous(breaks = seq(10000, 50000, 5000)) + theme_minimal()
# Find the average prices for different varieties of wine
joinedData %>% group_by(variety) %>% summarize(meanPrice = mean(price,
na.rm = T)) %>% arrange(desc(meanPrice)) %>% slice(1:8) %>%
ggplot(aes(x = variety, y = meanPrice, fill = variety)) +
geom_bar(stat = "summary", fun = mean) + geom_errorbar(stat = "summary",
fun.data = mean_se) + xlab("Wine Varieties") + ylab("Average Price") +
ggtitle("Average Wine Prices Based on Variety")
This plot is a bar graph representing the number of perfect score wines on a country by country basis. From this plot we can see that France represents a large portion of these perfect scores for wine with Italy coming in second and Portugal in third. The second bar graphs uses our sentiment analysis dataset to show us the most used words by sommeliers in their descriptions of wine. Obviously, wine is the number one used word with flavors, fruit,aromas, and palate trailing behind. This gives us an idea into how many of these wines are described by those across the world when looking at a wine.
# your plot 2 Most common words used by sommeliers joinedData
# %>% ggplot(aes(points,price,color=country))+geom_point()+
# xlab('Wine Score (points)') +ylab('Price of Wine')
# +ggtitle('Wine Rating versus Price on a Country Basis')
joinedData %>% group_by(`Happiness Rank`, country) %>% filter(`Happiness Rank` <
25) %>% select(country, `Happiness Rank`, points, price) %>%
ggplot(aes(points, price, color = country)) + xlab("Wine Ratings") +
ylab("Price of Wine") + ggtitle("Wine Ratings versus Price on a Country Basis") +
geom_point(alpha = 0.9) + theme_classic()
This plot allows us to see on a country by country basis if there’s any relation between the wine rating and its price. After completing a geom_smooth test, I noticed the points were not accurately producing linear relationships, so as a result we are unable to determine if this is such. From this plot, however, we can see that Australia has a great spread of wines in both the x and y axis along with Austria. Additionally, Austria as a major outlier with the most expensive wine.
# your plot 3
joinedData %>% ggplot(aes(points, `Happiness Score`)) + geom_density2d_filled(alpha = 0.6) +
xlab("Wine Ratings") + ylab("Happiness of Countries") + ggtitle("Density Visualisations of Wine Ratings versus Happiness Scores") +
theme_minimal()
# use width and height to create vertical and horizontal
# noise
Plot three shows a density distribution for Wine Ratings versus Happiness levels of different countries. From this plot, we can see that the most common set of values are occuring in the regions of yellow. For example a wine rating of 86 with a happiness level of 6.6 is seen many times throughout the dataset. We are hoping to see if there’s a certain trend which follows the higher the wine rating the greater the happiness and vice versa, and from this we can see a slight depiction of this trend from the higher density points. Further analysis is necessary to complete such.
Overall, the wine dataset proved to be a very valuable and extensive resource to compute descriptive statistics and further my knowledge in this subject matter. I find it very interesting the world of wine, and although we can’t confirm it has a direct impact on happiness in a country, it’s certainly a minor factor in creased GDP, econoomy, which plays pivotal roles in human happiness within a country.