class: center, middle, inverse, title-slide .title[ # Visualizations I ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- ### Visualization * We'll see how to create beautiful visualizations using ggplot2. ```r library(tidyverse) library(palmerpenguins) library(ggthemes) # color palettes for ggplot penguins ``` ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 334 more rows # ℹ 2 more variables: sex <fct>, year <int> ``` --- ### Goal: be able to create visualizations like this <img src="penguins_visualization.png" width="75%" /> --- ### Creating a ggplot * Start with function `ggplot()` * Add **layers** to this plot * Then need to define the **aesthetics** of the plot ```r ggplot(data = penguins) # tells ggplot to get info from penguins tibble ``` data:image/s3,"s3://crabby-images/3423e/3423e7122291b52150335b37f6d1850d65180725" alt=""<!-- --> --- ### Creating a ggplot * Start with function `ggplot()` * Add **layers** to this plot * Then need to define the **aesthetics** of the plot * No data displayed yet, but axes are clear ```r ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) ``` data:image/s3,"s3://crabby-images/a23a8/a23a8b59e674de4e621ecdd02c8a138750bd50b3" alt=""<!-- --> --- ### Creating a ggplot .pull-left[ * Start with function `ggplot()` * Add **layers** to this plot * Then need to define the **aesthetics** of the plot * Data displayed using **geom**: geometrical object used to represent data * `geom_bar()`: bar chart; `geom_line()`: lines; `geom_boxplot()`: boxplot; `geom_point()`: scatterplot, etc. ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` ``` Warning: Removed 2 rows containing missing values (`geom_point()`). ``` data:image/s3,"s3://crabby-images/f485e/f485e9af84694be429042b52bf2b7d767f64312e" alt=""<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * We can have aesthetics change as a function of categorical variables inside the tibble * e.g. each penguin has a **species**; we can use different colors for each species easily * When categorical variable is mapped to an aesthetic, ggplot assigns unique value of the aesthetic (here: unique color) to each unique level of the variable (here: species), then add a legend explaining this ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g, color = species)) + geom_point() ``` ``` Warning: Removed 2 rows containing missing values (`geom_point()`). ``` data:image/s3,"s3://crabby-images/caeb5/caeb5225b1dd75232816f53c059255a4a189a1b9" alt=""<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * Let's now add a new layer, `geom_smooth(method = "lm")`, which visualizes line of best fit based on a `l`inear `m`odel * We now have lines, but we have lines for each species rather than one global line. * When aesthetic mappings are added at the beginning of ggplot, they are done so at *global* level - all remaining layers will use the structure defined from this * So when we say `color=species`, it groups all of the penguins by species * When we want aesthetic mappings at local level, we can use `mapping` arg inside the specific things we want them for ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g, color = species)) + geom_point() + geom_smooth(method = "lm") ``` ``` `geom_smooth()` using formula = 'y ~ x' ``` data:image/s3,"s3://crabby-images/70c64/70c649efdb3a3dcb3fc7e9d6458280ee21a85e15" alt=""<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * Let's now add a new layer, `geom_smooth(method = "lm")`, which visualizes line of best fit based on a `l`inear `m`odel * We now have lines, but we have lines for each species rather than one global line. * When aesthetic mappings are added at the beginning of ggplot, they are done so at *global* level - all remaining layers will use the structure defined from this * So when we say `color=species`, it groups all of the penguins by species * When we want aesthetic mappings at local level, we can use `mapping` arg inside the specific things we want them for ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes(color = species)) + geom_smooth(method = "lm") ``` ``` `geom_smooth()` using formula = 'y ~ x' ``` data:image/s3,"s3://crabby-images/22bfa/22bfa38ff8e7ea419e1f7ca67f3ee451b3722590" alt=""<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * One thing that remains: we want different shapes for different species * We can specify this in a local aesthetic mapping of points using `shape=` * The legend will be updated to show this too! ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes( color = species, shape = species)) + geom_smooth(method = "lm") ``` ] .pull-right[ ``` `geom_smooth()` using formula = 'y ~ x' ``` data:image/s3,"s3://crabby-images/53da1/53da1bfeb6e26ed32dc9a8e65c92ee4138b49bfc" alt=""<!-- --> ] --- ### Axis labels .pull-left[ * Now just need to add title and axis labels ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes( color = species, shape = species)) + geom_smooth(method = "lm") + labs( title = "Body mass and flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Flipper length (mm)", y = "Body mass (g)", color = "Species", shape = "Species" ) + scale_color_colorblind() ``` ] .pull-right[ ``` `geom_smooth()` using formula = 'y ~ x' ``` data:image/s3,"s3://crabby-images/605ad/605ad35244054df16c297860743049c933dd8f15" alt=""<!-- --> ] --- ### ggplot2 calls * The first two arguments of ggplot are always `data = ` and `mapping = `, so we will often see things like ```r ggplot(penguins, aes( x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` * We can do this with piping as well: ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` --- ### Visualizing distributions .pull-left[ * **Categorical** variables take only one of a finite set of values * Bar charts are useful for visualizing categorical variables ```r ggplot(penguins, aes(x = species)) + geom_bar() ``` data:image/s3,"s3://crabby-images/047fe/047fef3526fa62c955b3d8557067f7c870779357" alt=""<!-- --> ] .pull-right[ * **Numeric** values we are familiar with * Histograms are useful for these - use argument `binwidth = ` ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 200) ``` data:image/s3,"s3://crabby-images/35e3c/35e3c4acd05bcf4131d57f3bdb61a082534d1658" alt=""<!-- --> ] --- ### Visualizing distributions * You will likely need to spend time tuning the binwidth parameter .pull-left[ * **Categorical** variables take only one of a finite set of values * Bar charts are useful for visualizing categorical variables ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 2000) ``` data:image/s3,"s3://crabby-images/f192d/f192d36d653b91f7c963b8cac960f393f2e8d55a" alt=""<!-- --> ] .pull-right[ * **Numeric** values we are familiar with * Histograms are useful for these - use argument `binwidth = ` ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 20) ``` data:image/s3,"s3://crabby-images/b43d9/b43d9ffbb0920fa00fcfdf3226b2af2fc78eae8c" alt=""<!-- --> ] --- ### Density plots * A smoothed out version of histogram which is supposed to approximate a probability density function (if you haven't heard of this term, don't worry) .pull-left[ ```r ggplot(penguins, aes(x = body_mass_g)) + geom_density() ``` data:image/s3,"s3://crabby-images/73834/738341eaa40d49ba5b4ea0dd6fe71ba11b4d1fe3" alt=""<!-- --> ] .pull-right[ ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 200) ``` data:image/s3,"s3://crabby-images/9b9e7/9b9e724be0193f6862e0b1778120f83f1f5594fb" alt=""<!-- --> ] --- ### Visualizing distributions * Let's check the difference between setting `color = ` vs `fill = ` with `geom_bar`: .pull-left[ ```r ggplot(penguins, aes(x = species)) + geom_bar(color = "red") ``` data:image/s3,"s3://crabby-images/3c977/3c97748f422122877c1b698d69316b20abeae33c" alt=""<!-- --> ] .pull-right[ ```r ggplot(penguins, aes(x = species)) + geom_bar(fill = "red") ``` data:image/s3,"s3://crabby-images/fdfdf/fdfdf0019b9f1f025cc2025b9a0f0814a6824246" alt=""<!-- --> ] --- ### Box plots * Box plots allow for visualizing the spread of a distribution * Makes it easy to see 25th percentile, median, 75Th percentile, and outliers (>1.5*IQR from 25th or 75th percentile) <img src="boxplot.png" width="75%" /> --- ### Box plots .pull-left[ * Let's see distribution of body mass by species using `geom_boxplot()`: ```r ggplot(penguins, aes(x = species, y = body_mass_g)) + geom_boxplot() ``` data:image/s3,"s3://crabby-images/c408e/c408e1fc774db784ed325d01343dcd94d36a8ba4" alt=""<!-- --> ] .pull-right[ * Compare to `geom_density()`: ```r ggplot(penguins, aes(x = body_mass_g, color = species)) + geom_density(linewidth = 0.75) ``` data:image/s3,"s3://crabby-images/98967/9896791cd2fdfb76e99b99b675cc4baf1d2741c0" alt=""<!-- --> ] (End of January 26 slides) --- ### Box plots * We can map `species` to both `color` and `fill` aesthetics and use `alpha` to add transparency * `alpha` is a number between 0 and 1; 0 = completely transparent, 1 = fully opaque .pull-left[ ```r ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) + geom_density(alpha = 0.3) ``` data:image/s3,"s3://crabby-images/3dd26/3dd26749a0bbb185a74a2dda84879bf2986278c9" alt=""<!-- --> ] .pull-right[ ```r ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) + geom_density(alpha = 0.7) ``` data:image/s3,"s3://crabby-images/46a8c/46a8cbcbe1f9cd32ce7fa6b0f9f6850458ee2ca1" alt=""<!-- --> ] --- ### Multiple categorical variables * Stacked bar plots can help visualize relationships between 2 categorical variables .pull-left[ * Frequencies of each species on each island: ```r ggplot(penguins, aes(x = island, fill = species)) + geom_bar() ``` data:image/s3,"s3://crabby-images/7b5ba/7b5baab1f89d7736682c15689c92df917297e7b6" alt=""<!-- --> ] .pull-right[ * Isn't easy to tell relative frequency of each percentage * `position= "fill"` in geom allows for comparing frequencies across distributions ```r ggplot(penguins, aes(x = island, fill = species)) + geom_bar(position = 'fill') ``` data:image/s3,"s3://crabby-images/7522b/7522b1caba419a4eca1e5241d418caefc7a51005" alt=""<!-- --> ] --- ### Multiple numerical variables * Already saw how to use scatter plots to visualize two numeric variables .pull-left[ * Frequencies of each species on each island: ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` data:image/s3,"s3://crabby-images/283fa/283fa60a81608df17c9043be62bb9eacbf877f8e" alt=""<!-- --> ] .pull-right[ * We saw how using color aesthetic in the geom can help incorporate group information for two numeric variables * We can use separate vals for color and shape ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = island)) ``` data:image/s3,"s3://crabby-images/f7162/f71623c9bb3927c3a69247f332c4676f82590290" alt=""<!-- --> ] --- ### Multiple numerical variables * With too many aesthetic changes (shape, color, size etc), plots become cluttered and difficult to visualize * Useful to use **facets**, using `facet_wrap` * `facet_wrap()` takes a *formula* argument - we will see more later, but an example: ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species)) + facet_wrap(~island) ``` data:image/s3,"s3://crabby-images/5ccd2/5ccd23753516896699f769acb20094f4f485aaaf" alt=""<!-- --> --- ### Saving plots * Once you've made a plot, you can save using `ggsave()` * Either can save whatever plot you made last: ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ggsave(filename = "penguin-plot.png") ``` * Or you can save the plot object and save that ```r p <- ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ggsave(filename = "penguin-plot.png", p) ``` --- ### Aesthetic mappings * Let's consider the `mpg` dataframe - bundled with ggplot2. ```r mpg ``` ``` # A tibble: 234 × 11 manufacturer model displ year cyl trans drv cty hwy fl class <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… 3 audi a4 2 2008 4 manu… f 20 31 p comp… 4 audi a4 2 2008 4 auto… f 21 30 p comp… 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… # ℹ 224 more rows ``` * `displ`: numerical, car's engine size, in liters * `hwy`: numerical, car's fuel efficiency in mpg * `class`: string / categorical, kind of car --- ### Aesthetic mappings * Let's look at the relationship between `displ` and `hwy` for different classes of ars * We'll use a scatterplot with numerical values mapped to `x` and `y`, categorical to aesthetics like `shape` and `color`: .pull-left[ ```r ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() ``` data:image/s3,"s3://crabby-images/3c356/3c35630e4dd8403888a21029593214fbd69568f6" alt=""<!-- --> ] .pull-right[ ```r ggplot(mpg, aes(x = displ, y = hwy, shape = class)) + geom_point() ``` data:image/s3,"s3://crabby-images/14e30/14e30b90d72cbb23e3b6acb251bc182c0b364432" alt=""<!-- --> ]