class: center, middle, inverse, title-slide .title[ # Visualizations I ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- ### Visualization * We'll see how to create beautiful visualizations using ggplot2. ```r library(tidyverse) library(palmerpenguins) library(ggthemes) # color palettes for ggplot penguins ``` ``` # A tibble: 344 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # ℹ 334 more rows # ℹ 2 more variables: sex <fct>, year <int> ``` --- ### Goal: be able to create visualizations like this <img src="penguins_visualization.png" width="75%" /> --- ### Creating a ggplot * Start with function `ggplot()` * Add **layers** to this plot * Then need to define the **aesthetics** of the plot ```r ggplot(data = penguins) # tells ggplot to get info from penguins tibble ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-3-1.png)<!-- --> --- ### Creating a ggplot * Start with function `ggplot()` * Add **layers** to this plot * Then need to define the **aesthetics** of the plot * No data displayed yet, but axes are clear ```r ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- ### Creating a ggplot .pull-left[ * Start with function `ggplot()` * Add **layers** to this plot * Then need to define the **aesthetics** of the plot * Data displayed using **geom**: geometrical object used to represent data * `geom_bar()`: bar chart; `geom_line()`: lines; `geom_boxplot()`: boxplot; `geom_point()`: scatterplot, etc. ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` ``` Warning: Removed 2 rows containing missing values (`geom_point()`). ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * We can have aesthetics change as a function of categorical variables inside the tibble * e.g. each penguin has a **species**; we can use different colors for each species easily * When categorical variable is mapped to an aesthetic, ggplot assigns unique value of the aesthetic (here: unique color) to each unique level of the variable (here: species), then add a legend explaining this ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g, color = species)) + geom_point() ``` ``` Warning: Removed 2 rows containing missing values (`geom_point()`). ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * Let's now add a new layer, `geom_smooth(method = "lm")`, which visualizes line of best fit based on a `l`inear `m`odel * We now have lines, but we have lines for each species rather than one global line. * When aesthetic mappings are added at the beginning of ggplot, they are done so at *global* level - all remaining layers will use the structure defined from this * So when we say `color=species`, it groups all of the penguins by species * When we want aesthetic mappings at local level, we can use `mapping` arg inside the specific things we want them for ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g, color = species)) + geom_point() + geom_smooth(method = "lm") ``` ``` `geom_smooth()` using formula = 'y ~ x' ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * Let's now add a new layer, `geom_smooth(method = "lm")`, which visualizes line of best fit based on a `l`inear `m`odel * We now have lines, but we have lines for each species rather than one global line. * When aesthetic mappings are added at the beginning of ggplot, they are done so at *global* level - all remaining layers will use the structure defined from this * So when we say `color=species`, it groups all of the penguins by species * When we want aesthetic mappings at local level, we can use `mapping` arg inside the specific things we want them for ] .pull-right[ ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes(color = species)) + geom_smooth(method = "lm") ``` ``` `geom_smooth()` using formula = 'y ~ x' ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ] --- ### Adding aesthetics and layers .pull-left[ * One thing that remains: we want different shapes for different species * We can specify this in a local aesthetic mapping of points using `shape=` * The legend will be updated to show this too! ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes( color = species, shape = species)) + geom_smooth(method = "lm") ``` ] .pull-right[ ``` `geom_smooth()` using formula = 'y ~ x' ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-10-1.png)<!-- --> ] --- ### Axis labels .pull-left[ * Now just need to add title and axis labels ```r ggplot(data = penguins, mapping = aes( x = flipper_length_mm, y = body_mass_g)) + geom_point(mapping = aes( color = species, shape = species)) + geom_smooth(method = "lm") + labs( title = "Body mass and flipper length", subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins", x = "Flipper length (mm)", y = "Body mass (g)", color = "Species", shape = "Species" ) + scale_color_colorblind() ``` ] .pull-right[ ``` `geom_smooth()` using formula = 'y ~ x' ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-12-1.png)<!-- --> ] --- ### ggplot2 calls * The first two arguments of ggplot are always `data = ` and `mapping = `, so we will often see things like ```r ggplot(penguins, aes( x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` * We can do this with piping as well: ```r penguins %>% ggplot(aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` --- ### Visualizing distributions .pull-left[ * **Categorical** variables take only one of a finite set of values * Bar charts are useful for visualizing categorical variables ```r ggplot(penguins, aes(x = species)) + geom_bar() ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] .pull-right[ * **Numeric** values we are familiar with * Histograms are useful for these - use argument `binwidth = ` ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 200) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-16-1.png)<!-- --> ] --- ### Visualizing distributions * You will likely need to spend time tuning the binwidth parameter .pull-left[ * **Categorical** variables take only one of a finite set of values * Bar charts are useful for visualizing categorical variables ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 2000) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-17-1.png)<!-- --> ] .pull-right[ * **Numeric** values we are familiar with * Histograms are useful for these - use argument `binwidth = ` ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 20) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-18-1.png)<!-- --> ] --- ### Density plots * A smoothed out version of histogram which is supposed to approximate a probability density function (if you haven't heard of this term, don't worry) .pull-left[ ```r ggplot(penguins, aes(x = body_mass_g)) + geom_density() ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-19-1.png)<!-- --> ] .pull-right[ ```r ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 200) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-20-1.png)<!-- --> ] --- ### Visualizing distributions * Let's check the difference between setting `color = ` vs `fill = ` with `geom_bar`: .pull-left[ ```r ggplot(penguins, aes(x = species)) + geom_bar(color = "red") ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-21-1.png)<!-- --> ] .pull-right[ ```r ggplot(penguins, aes(x = species)) + geom_bar(fill = "red") ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-22-1.png)<!-- --> ] --- ### Box plots * Box plots allow for visualizing the spread of a distribution * Makes it easy to see 25th percentile, median, 75Th percentile, and outliers (>1.5*IQR from 25th or 75th percentile) <img src="boxplot.png" width="75%" /> --- ### Box plots .pull-left[ * Let's see distribution of body mass by species using `geom_boxplot()`: ```r ggplot(penguins, aes(x = species, y = body_mass_g)) + geom_boxplot() ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-24-1.png)<!-- --> ] .pull-right[ * Compare to `geom_density()`: ```r ggplot(penguins, aes(x = body_mass_g, color = species)) + geom_density(linewidth = 0.75) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-25-1.png)<!-- --> ] (End of January 26 slides) --- ### Box plots * We can map `species` to both `color` and `fill` aesthetics and use `alpha` to add transparency * `alpha` is a number between 0 and 1; 0 = completely transparent, 1 = fully opaque .pull-left[ ```r ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) + geom_density(alpha = 0.3) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-26-1.png)<!-- --> ] .pull-right[ ```r ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) + geom_density(alpha = 0.7) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-27-1.png)<!-- --> ] --- ### Multiple categorical variables * Stacked bar plots can help visualize relationships between 2 categorical variables .pull-left[ * Frequencies of each species on each island: ```r ggplot(penguins, aes(x = island, fill = species)) + geom_bar() ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-28-1.png)<!-- --> ] .pull-right[ * Isn't easy to tell relative frequency of each percentage * `position= "fill"` in geom allows for comparing frequencies across distributions ```r ggplot(penguins, aes(x = island, fill = species)) + geom_bar(position = 'fill') ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-29-1.png)<!-- --> ] --- ### Multiple numerical variables * Already saw how to use scatter plots to visualize two numeric variables .pull-left[ * Frequencies of each species on each island: ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-30-1.png)<!-- --> ] .pull-right[ * We saw how using color aesthetic in the geom can help incorporate group information for two numeric variables * We can use separate vals for color and shape ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = island)) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-31-1.png)<!-- --> ] --- ### Multiple numerical variables * With too many aesthetic changes (shape, color, size etc), plots become cluttered and difficult to visualize * Useful to use **facets**, using `facet_wrap` * `facet_wrap()` takes a *formula* argument - we will see more later, but an example: ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = species, shape = species)) + facet_wrap(~island) ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-32-1.png)<!-- --> --- ### Saving plots * Once you've made a plot, you can save using `ggsave()` * Either can save whatever plot you made last: ```r ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ggsave(filename = "penguin-plot.png") ``` * Or you can save the plot object and save that ```r p <- ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point() ggsave(filename = "penguin-plot.png", p) ``` --- ### Aesthetic mappings * Let's consider the `mpg` dataframe - bundled with ggplot2. ```r mpg ``` ``` # A tibble: 234 × 11 manufacturer model displ year cyl trans drv cty hwy fl class <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… 3 audi a4 2 2008 4 manu… f 20 31 p comp… 4 audi a4 2 2008 4 auto… f 21 30 p comp… 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… 7 audi a4 3.1 2008 6 auto… f 18 27 p comp… 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp… 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp… 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp… # ℹ 224 more rows ``` * `displ`: numerical, car's engine size, in liters * `hwy`: numerical, car's fuel efficiency in mpg * `class`: string / categorical, kind of car --- ### Aesthetic mappings * Let's look at the relationship between `displ` and `hwy` for different classes of ars * We'll use a scatterplot with numerical values mapped to `x` and `y`, categorical to aesthetics like `shape` and `color`: .pull-left[ ```r ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-36-1.png)<!-- --> ] .pull-right[ ```r ggplot(mpg, aes(x = displ, y = hwy, shape = class)) + geom_point() ``` ![](lec09-visualization-1_files/figure-html/unnamed-chunk-37-1.png)<!-- --> ]