Data Wrangling, Exploration, Visualization

Vincent Lebovitz vjl369

Introduction

For the first portion of the project, I will be analyzing two large datasets. The first being a very in depth and detailed description of sommelier wines from different geographic locations, and their respective ratings with other informative metrics (price,point ranking, sommelier taster, winery, and origin). The second dataset includes happiness scores from every country around the world based on certain metrics such as happiness score, economy, health, and many more. As someone who is of legal age and enjoys to occasionally drink wine (also watched an amazing documentary called sommelier) I would like to see if higher wine scores in different regions pertains to happiness.

library(tidyverse)
# read your datasets in here, e.g., with read_csv()
wine <- read_csv("wine.csv")
happiness <- read_csv("happiness.csv")
glimpse(wine)
## Rows: 129,971
## Columns: 14
## $ X1                    <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1…
## $ country               <chr> "Italy", "Portugal", "US", "US", "US", "Spain",…
## $ description           <chr> "Aromas include tropical fruit, broom, brimston…
## $ designation           <chr> "Vulkà Bianco", "Avidagos", NA, "Reserve Late H…
## $ points                <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87,…
## $ price                 <dbl> NA, 15, 14, 13, 65, 15, 16, 24, 12, 27, 19, 30,…
## $ province              <chr> "Sicily & Sardinia", "Douro", "Oregon", "Michig…
## $ region_1              <chr> "Etna", NA, "Willamette Valley", "Lake Michigan…
## $ region_2              <chr> NA, NA, "Willamette Valley", NA, "Willamette Va…
## $ taster_name           <chr> "Kerin O’Keefe", "Roger Voss", "Paul Gregutt", …
## $ taster_twitter_handle <chr> "@kerinokeefe", "@vossroger", "@paulgwine ", NA…
## $ title                 <chr> "Nicosia 2013 Vulkà Bianco  (Etna)", "Quinta do…
## $ variety               <chr> "White Blend", "Portuguese Red", "Pinot Gris", …
## $ winery                <chr> "Nicosia", "Quinta dos Avidagos", "Rainstorm", …
glimpse(happiness)
## Rows: 158
## Columns: 12
## $ Country                         <chr> "Switzerland", "Iceland", "Denmark", …
## $ Region                          <chr> "Western Europe", "Western Europe", "…
## $ `Happiness Rank`                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12…
## $ `Happiness Score`               <dbl> 7.587, 7.561, 7.527, 7.522, 7.427, 7.…
## $ `Standard Error`                <dbl> 0.03411, 0.04884, 0.03328, 0.03880, 0…
## $ `Economy (GDP per Capita)`      <dbl> 1.39651, 1.30232, 1.32548, 1.45900, 1…
## $ Family                          <dbl> 1.34951, 1.40223, 1.36058, 1.33095, 1…
## $ `Health (Life Expectancy)`      <dbl> 0.94143, 0.94784, 0.87464, 0.88521, 0…
## $ Freedom                         <dbl> 0.66557, 0.62877, 0.64938, 0.66973, 0…
## $ `Trust (Government Corruption)` <dbl> 0.41978, 0.14145, 0.48357, 0.36503, 0…
## $ Generosity                      <dbl> 0.29678, 0.43630, 0.34139, 0.34699, 0…
## $ `Dystopia Residual`             <dbl> 2.51738, 2.70201, 2.49204, 2.46531, 2…

Tidying: Reshaping

If your datasets are tidy already, demonstrate that you can reshape data with pivot wider/longer here (e.g., untidy and then retidy). Alternatively, it may be easier to wait until the wrangling section so you can reshape your summary statistics. Note here if you are going to do this.

# your tidying code (if applicable; can also wait until
# wrangling section) due to my data being rather tidy I will
# perform a methodology of untidying and tidying

# I will make the region untidy by creating one column and
# separating regions by a column Thus, I will wait until the
# wrangling section to reshape my data
wine <- wine %>% pivot_longer(8:9, names_to = "region", names_prefix = "region_", 
    values_to = "region type")
# Separate the first and last names of sommeliers for easier
# readability
wine <- wine %>% separate(col = taster_name, into = c("First Name", 
    "Last Name"), sep = " ")

Joining/Merging

# your joining code display the dimensions for the wine and
# happiness dataset
wine %>% dim()
## [1] 259942     15
happiness %>% dim()
## [1] 158  12
# join the two datasets with inner_join
joinedData <- wine %>% inner_join(happiness, by = c(country = "Country"))
joinedData %>% dim()
## [1] 150660     26
joinedData %>% summarize_all(n_distinct)
## # A tibble: 1 x 26
##      X1 country description designation points price province `First Name`
##   <int>   <int>       <int>       <int>  <int> <int>    <int>        <int>
## 1 75330      41       69385       24194     21   388      397           15
## # … with 18 more variables: `Last Name` <int>, taster_twitter_handle <int>,
## #   title <int>, variety <int>, winery <int>, region <int>, `region
## #   type` <int>, Region <int>, `Happiness Rank` <int>, `Happiness Score` <int>,
## #   `Standard Error` <int>, `Economy (GDP per Capita)` <int>, Family <int>,
## #   `Health (Life Expectancy)` <int>, Freedom <int>, `Trust (Government
## #   Corruption)` <int>, Generosity <int>, `Dystopia Residual` <int>
# your joining code find the rows in wine which didn't have a
# match in happiness by country
wine %>% anti_join(happiness, by = c(country = "Country"))
## # A tibble: 109,282 x 15
##       X1 country description designation points price province `First Name`
##    <dbl> <chr>   <chr>       <chr>        <dbl> <dbl> <chr>    <chr>       
##  1     2 US      Tart and s… <NA>            87    14 Oregon   Paul        
##  2     2 US      Tart and s… <NA>            87    14 Oregon   Paul        
##  3     3 US      Pineapple … Reserve La…     87    13 Michigan Alexander   
##  4     3 US      Pineapple … Reserve La…     87    13 Michigan Alexander   
##  5     4 US      Much like … Vintner's …     87    65 Oregon   Paul        
##  6     4 US      Much like … Vintner's …     87    65 Oregon   Paul        
##  7    10 US      Soft, supp… Mountain C…     87    19 Califor… Virginie    
##  8    10 US      Soft, supp… Mountain C…     87    19 Califor… Virginie    
##  9    12 US      Slightly r… <NA>            87    34 Califor… Virginie    
## 10    12 US      Slightly r… <NA>            87    34 Califor… Virginie    
## # … with 109,272 more rows, and 7 more variables: `Last Name` <chr>,
## #   taster_twitter_handle <chr>, title <chr>, variety <chr>, winery <chr>,
## #   region <chr>, `region type` <chr>

Simply put, I am inner joining the two datasets on a country basis which will remove any values which don’t have a match in countries from either dataset. This will allow me to accurately compare the countries which were represented in both datasets, and not have any countries describe their associated happiness metrics. After joining, one can see that that the dimensions for the dataset are 150660 rows by 26 columns. The dimensions for wine and happiness are seen above. Wine has apprxoimately 109,282 rows which didn’t appear in the happiness dataset

Wrangling

library(tidytext)
# Use pivot wider to tidy data back into region1 and region2
# for easier readibility, and more in depth analysis
joinedData <- joinedData %>% pivot_wider(names_from = region, 
    values_from = `region type`, names_prefix = "region_")

# your wrangling code Aggregate the countries with perfect
# wine rating scores from the joined dataset
joinedData %>% group_by(title, country) %>% filter(points == 
    100) %>% select(title, points, country) %>% arrange(desc(points))
## # A tibble: 15 x 3
## # Groups:   title, country [15]
##    title                                                         points country 
##    <chr>                                                          <dbl> <chr>   
##  1 Chambers Rosewood Vineyards NV Rare Muscat (Rutherglen)          100 Austral…
##  2 Avignonesi 1995 Occhio di Pernice  (Vin Santo di Montepulcia…    100 Italy   
##  3 Krug 2002 Brut  (Champagne)                                      100 France  
##  4 Tenuta dell'Ornellaia 2007 Masseto Merlot (Toscana)              100 Italy   
##  5 Casa Ferreirinha 2008 Barca-Velha Red (Douro)                    100 Portugal
##  6 Biondi Santi 2010 Riserva  (Brunello di Montalcino)              100 Italy   
##  7 Château Léoville Barton 2010  Saint-Julien                       100 France  
##  8 Louis Roederer 2008 Cristal Vintage Brut  (Champagne)            100 France  
##  9 Salon 2006 Le Mesnil Blanc de Blancs Brut Chardonnay (Champa…    100 France  
## 10 Château Lafite Rothschild 2010  Pauillac                         100 France  
## 11 Casanova di Neri 2007 Cerretalto  (Brunello di Montalcino)       100 Italy   
## 12 Château Cheval Blanc 2010  Saint-Émilion                         100 France  
## 13 Château Léoville Las Cases 2010  Saint-Julien                    100 France  
## 14 Quinta do Noval 2011 Nacional Vintage  (Port)                    100 Portugal
## 15 Château Haut-Brion 2014  Pessac-Léognan                          100 France
# Aggregate the number of wine varieties
joinedData %>% group_by(variety) %>% summarize(n())
## # A tibble: 638 x 2
##    variety     `n()`
##    <chr>       <int>
##  1 Abouriou        2
##  2 Agiorgitiko    63
##  3 Aglianico     344
##  4 Aidani          1
##  5 Airen           3
##  6 Albana         23
##  7 Albanello       1
##  8 Albariño      346
##  9 Albarossa       2
## 10 Aleatico        5
## # … with 628 more rows
# Find the maximum price of wine in descending order
joinedData %>% group_by(title) %>% select(title, price) %>% arrange(desc(price))
## # A tibble: 75,330 x 2
## # Groups:   title [68,489]
##    title                                          price
##    <chr>                                          <dbl>
##  1 Château les Ormes Sorbet 2013  Médoc            3300
##  2 Château Pétrus 2014  Pomerol                    2500
##  3 Domaine du Comte Liger-Belair 2010  La Romanée  2500
##  4 Château Pétrus 2011  Pomerol                    2000
##  5 Domaine du Comte Liger-Belair 2005  La Romanée  2000
##  6 Château Margaux 2009  Margaux                   1900
##  7 Château Lafite Rothschild 2010  Pauillac        1500
##  8 Château Cheval Blanc 2010  Saint-Émilion        1500
##  9 Château Mouton Rothschild 2009  Pauillac        1300
## 10 Château Haut-Brion 2009  Pessac-Léognan         1200
## # … with 75,320 more rows
# your wrangling code find the average value for points of
# wine
joinedData %>% summarize(meanPoints = mean(points))
## # A tibble: 1 x 1
##   meanPoints
##        <dbl>
## 1       88.4
# calculate the standard deviation and standard error for the
# price of wine on a country basis
joinedData %>% group_by(country) %>% summarize(mean = mean(price, 
    na.rm = T), n = n(), sd = sd(price, na.rm = T), se = sd/sqrt(n))
## # A tibble: 41 x 5
##    country                 mean     n     sd     se
##    <chr>                  <dbl> <int>  <dbl>  <dbl>
##  1 Argentina               24.5  3800 23.4    0.380
##  2 Armenia                 14.5     2  0.707  0.5  
##  3 Australia               35.4  2329 49.0    1.02 
##  4 Austria                 30.8  3345 27.2    0.471
##  5 Bosnia and Herzegovina  12.5     2  0.707  0.5  
##  6 Brazil                  23.8    52 11.1    1.53 
##  7 Bulgaria                14.6   141  9.51   0.801
##  8 Canada                  35.7   257 19.7    1.23 
##  9 Chile                   20.8  4472 21.9    0.328
## 10 China                   18       1 NA     NA    
## # … with 31 more rows
# Create a new column to determine the tier of wine based on
# points scale
joinedData <- joinedData %>% mutate(wineTier = ifelse(points > 
    95, "high", ifelse(points <= 95 & 50 <= points, "med", "low")))
# Find the highest priced variety of wines for the dataset
joinedData %>% group_by(variety) %>% summarize(meanPrice = mean(price, 
    na.rm = T)) %>% arrange(desc(meanPrice)) %>% slice(1:10)
## # A tibble: 10 x 2
##    variety          meanPrice
##    <chr>                <dbl>
##  1 Ramisco              495  
##  2 Terrantez            236  
##  3 Francisa             160  
##  4 Cabernet-Shiraz      150  
##  5 Rosenmuskateller     150  
##  6 Malbec-Cabernet      113. 
##  7 Tinta Negra Mole     112  
##  8 Muscadelle           111. 
##  9 Bual                 100  
## 10 Mazuelo               78.2
# your wrangling code Use sentiment analysis to properly
# format the description key words
wine_words <- joinedData %>% mutate(description = str_remove_all(description, 
    "&.*;,0123456789")) %>% unnest_tokens(word, description)
mystops <- stop_words %>% filter(lexicon == "snowball")
wine_words_clean <- anti_join(wine_words, mystops)
glimpse(wine_words_clean)
## Rows: 1,920,810
## Columns: 27
## $ X1                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ country                         <chr> "Italy", "Italy", "Italy", "Italy", "…
## $ designation                     <chr> "Vulkà Bianco", "Vulkà Bianco", "Vulk…
## $ points                          <dbl> 87, 87, 87, 87, 87, 87, 87, 87, 87, 8…
## $ price                           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ province                        <chr> "Sicily & Sardinia", "Sicily & Sardin…
## $ `First Name`                    <chr> "Kerin", "Kerin", "Kerin", "Kerin", "…
## $ `Last Name`                     <chr> "O’Keefe", "O’Keefe", "O’Keefe", "O’K…
## $ taster_twitter_handle           <chr> "@kerinokeefe", "@kerinokeefe", "@ker…
## $ title                           <chr> "Nicosia 2013 Vulkà Bianco  (Etna)", …
## $ variety                         <chr> "White Blend", "White Blend", "White …
## $ winery                          <chr> "Nicosia", "Nicosia", "Nicosia", "Nic…
## $ Region                          <chr> "Western Europe", "Western Europe", "…
## $ `Happiness Rank`                <dbl> 50, 50, 50, 50, 50, 50, 50, 50, 50, 5…
## $ `Happiness Score`               <dbl> 5.948, 5.948, 5.948, 5.948, 5.948, 5.…
## $ `Standard Error`                <dbl> 0.03914, 0.03914, 0.03914, 0.03914, 0…
## $ `Economy (GDP per Capita)`      <dbl> 1.25114, 1.25114, 1.25114, 1.25114, 1…
## $ Family                          <dbl> 1.19777, 1.19777, 1.19777, 1.19777, 1…
## $ `Health (Life Expectancy)`      <dbl> 0.95446, 0.95446, 0.95446, 0.95446, 0…
## $ Freedom                         <dbl> 0.26236, 0.26236, 0.26236, 0.26236, 0…
## $ `Trust (Government Corruption)` <dbl> 0.02901, 0.02901, 0.02901, 0.02901, 0…
## $ Generosity                      <dbl> 0.22823, 0.22823, 0.22823, 0.22823, 0…
## $ `Dystopia Residual`             <dbl> 2.02518, 2.02518, 2.02518, 2.02518, 2…
## $ region_1                        <chr> "Etna", "Etna", "Etna", "Etna", "Etna…
## $ region_2                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wineTier                        <chr> "med", "med", "med", "med", "med", "m…
## $ word                            <chr> "aromas", "include", "tropical", "fru…

Your discussion of wrangling section here. Feel encouraged to break up into more than once code chunk and discuss each in turn.

Visualizing

# your plot 1 Create a bar graph based on the countries with
# the number of highest wine ratings on a per country basis
joinedData %>% group_by(country) %>% filter(wineTier == "high") %>% 
    summarize(countryCount = n()) %>% ggplot(aes(country, countryCount, 
    fill = country)) + geom_bar(stat = "identity") + xlab("Country") + 
    ylab("Count") + ggtitle("Countries based on Highest Quality Wine Ratings") + 
    scale_y_continuous(breaks = seq(0, 300, 25)) + theme_bw()

# Create a bar graph based on the sentiment of sommelier
# reviews for wine descriptions
wine_words_clean %>% group_by(word) %>% summarize(n = n()) %>% 
    arrange(desc(n)) %>% slice(1:10) %>% ggplot(aes(word, n)) + 
    geom_bar(stat = "identity") + scale_colour_brewer(palette = "Accent") + 
    xlab("Most Used Sommelier Terms") + ylab("Usage in Descriptions") + 
    ggtitle("The Ten Most Used Terms by Sommeliers and their Frequencies") + 
    scale_y_continuous(breaks = seq(10000, 50000, 5000)) + theme_minimal()

# Find the average prices for different varieties of wine
joinedData %>% group_by(variety) %>% summarize(meanPrice = mean(price, 
    na.rm = T)) %>% arrange(desc(meanPrice)) %>% slice(1:8) %>% 
    ggplot(aes(x = variety, y = meanPrice, fill = variety)) + 
    geom_bar(stat = "summary", fun = mean) + geom_errorbar(stat = "summary", 
    fun.data = mean_se) + xlab("Wine Varieties") + ylab("Average Price") + 
    ggtitle("Average Wine Prices Based on Variety")

This plot is a bar graph representing the number of perfect score wines on a country by country basis. From this plot we can see that France represents a large portion of these perfect scores for wine with Italy coming in second and Portugal in third. The second bar graphs uses our sentiment analysis dataset to show us the most used words by sommeliers in their descriptions of wine. Obviously, wine is the number one used word with flavors, fruit,aromas, and palate trailing behind. This gives us an idea into how many of these wines are described by those across the world when looking at a wine.

# your plot 2 Most common words used by sommeliers joinedData
# %>% ggplot(aes(points,price,color=country))+geom_point()+
# xlab('Wine Score (points)') +ylab('Price of Wine')
# +ggtitle('Wine Rating versus Price on a Country Basis')

joinedData %>% group_by(`Happiness Rank`, country) %>% filter(`Happiness Rank` < 
    25) %>% select(country, `Happiness Rank`, points, price) %>% 
    ggplot(aes(points, price, color = country)) + xlab("Wine Ratings") + 
    ylab("Price of Wine") + ggtitle("Wine Ratings versus Price on a Country Basis") + 
    geom_point(alpha = 0.9) + theme_classic()

This plot allows us to see on a country by country basis if there’s any relation between the wine rating and its price. After completing a geom_smooth test, I noticed the points were not accurately producing linear relationships, so as a result we are unable to determine if this is such. From this plot, however, we can see that Australia has a great spread of wines in both the x and y axis along with Austria. Additionally, Austria as a major outlier with the most expensive wine.

# your plot 3
joinedData %>% ggplot(aes(points, `Happiness Score`)) + geom_density2d_filled(alpha = 0.6) + 
    xlab("Wine Ratings") + ylab("Happiness of Countries") + ggtitle("Density Visualisations of Wine Ratings versus Happiness Scores") + 
    theme_minimal()

# use width and height to create vertical and horizontal
# noise

Plot three shows a density distribution for Wine Ratings versus Happiness levels of different countries. From this plot, we can see that the most common set of values are occuring in the regions of yellow. For example a wine rating of 86 with a happiness level of 6.6 is seen many times throughout the dataset. We are hoping to see if there’s a certain trend which follows the higher the wine rating the greater the happiness and vice versa, and from this we can see a slight depiction of this trend from the higher density points. Further analysis is necessary to complete such.

Concluding Remarks

Overall, the wine dataset proved to be a very valuable and extensive resource to compute descriptive statistics and further my knowledge in this subject matter. I find it very interesting the world of wine, and although we can’t confirm it has a direct impact on happiness in a country, it’s certainly a minor factor in creased GDP, econoomy, which plays pivotal roles in human happiness within a country.