class: center, middle, inverse, title-slide .title[ # Visualizations II ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- ### Aesthetic mappings * Let's consider the `mpg` dataframe - bundled with ggplot2. ```r mpg #> # A tibble: 234 × 11 #> manufacturer model displ year cyl trans drv cty hwy fl #> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> #> 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p #> 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p #> 3 audi a4 2 2008 4 manual(m6) f 20 31 p #> 4 audi a4 2 2008 4 auto(av) f 21 30 p #> 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p #> 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p #> # ℹ 228 more rows #> # ℹ 1 more variable: class <chr> ``` * `displ`: numerical, car's engine size, in liters * `hwy`: numerical, car's fuel efficiency in mpg * `class`: string / categorical, kind of car --- ### Aesthetic mappings * Let's look at the relationship between `displ` and `hwy` for different classes of ars * We'll use scatterplot with numerical valu mapped to `x` and `y`, categorical to `shape` and `color`: .pull-left[ ```r ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-2-1.png" width="504" /> ] .pull-right[ ```r ggplot(mpg, aes(x = displ, y = hwy, shape = class)) + geom_point() #> Warning: The shape palette can deal with a maximum of 6 discrete values because more #> than 6 becomes difficult to discriminate #> ℹ you have requested 7 values. Consider specifying shapes manually if you #> need that many have them. #> Warning: Removed 62 rows containing missing values (`geom_point()`). ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-3-1.png" width="504" /> ] --- ### Aesthetic mappings * Can also use `size` and `alpha` for categorical mappings like class. * However, not recommended, since `size` and `alpha` are **ordered** aesthetics, and no natural ordering to these categorical variables ```r # Left ggplot(mpg, aes(x = displ, y = hwy, size = class)) + geom_point() #> Warning: Using size for a discrete variable is not advised. # Right ggplot(mpg, aes(x = displ, y = hwy, alpha = class)) + geom_point() #> Warning: Using alpha for a discrete variable is not advised. ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-4-1.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable." width="288" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-4-2.png" alt="Two scatterplots next to each other, both visualizing highway fuel efficiency versus engine size of cars and showing a negative association. In the plot on the left class is mapped to the size aesthetic, resulting in different sizes for each class. In the plot on the right class is mapped the alpha aesthetic, resulting in different alpha (transparency) levels for each class. Each plot comes with a legend that shows the mapping between size or alpha level and levels of the class variable." width="288" /> --- ### Aesthetics .pull-left[ * You can also set visual properties of geoms by manually supplying argments of the geom function *outide * of `aes()`, e.g. making all points in a plot blue: ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue") ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-5-1.png" width="432" /> ] .pull-left[ - The name of a color as a character string, e.g., `color = "blue"` - The size of a point in mm, e.g., `size = 1` - The shape of a point as a number, e.g. <div class="figure" style="text-align: center"> <img src="lec10-visualization-2_files/figure-html/fig-shapes-1.png" alt="Mapping between shapes and the numbers that represent them: 0 - square, 1 - circle, 2 - triangle point up, 3 - plus, 4 - cross, 5 - diamond, 6 - triangle point down, 7 - square cross, 8 - star, 9 - diamond plus, 10 - circle plus, 11 - triangles up and down, 12 - square plus, 13 - circle cross, 14 - square and triangle down, 15 - filled square, 16 - filled circle, 17 - filled triangle point-up, 18 - filled diamond, 19 - solid circle, 20 - bullet (smaller circle), 21 - filled circle blue, 22 - filled square blue, 23 - filled diamond blue, 24 - filled triangle point-up blue, 25 - filled triangle point down blue." width="432" /> <p class="caption">R has 25 built-in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the `color` and `fill` aesthetics. The hollow shapes (0--14) have a border determined by `color`; the solid shapes (15--20) are filled with `color`; the filled shapes (21--24) have a border of `color` and are filled with `fill`. </p> </div> ] --- ### Geometric objects How are these two plots similar? <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-7-1.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="216" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-7-2.png" alt="There are two plots. The plot on the left is a scatterplot of highway fuel efficiency versus engine size of cars and the plot on the right shows a smooth curve that follows the trajectory of the relationship between these variables. A confidence interval around the smooth curve is also displayed." width="216" /> * Both plots contain the same x variable, the same y variable, and both describe the same data. * Each plot uses a different geometric object, geom, to represent the data. ```r # Left ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() # Right ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth() #> `geom_smooth()` using method = 'loess' and formula = 'y ~ x' ``` --- * Every geom object takes a `mapping` argument * Either dfeined locally in the geom arg, or inhereted from the `ggplot()` layer via `aes()` * Obvious restrictions - can set shape of points but not lines; can set linetypes for lines but not points ```r # Left ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) + geom_smooth() # Center ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + geom_smooth() # Right ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(aes(linetype = drv)) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-9-1.png" width="288" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-9-2.png" width="288" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-9-3.png" width="288" /> --- * Can utilize multiple geom's together to make relationship clearer ```r # Left ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(aes(linetype = drv)) # Right ggplot(mpg, aes(x = displ, y = hwy, linetype = drv)) + geom_smooth() ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-10-1.png" width="360" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-10-2.png" width="360" /> --- ### Displaying multiple rows of data * Many geom's (like `geom_smooth`) use a single geometric object to display multiple rows of data * You can set `group` aesthetic to a categorical variable to draw multiple objects, so ggplot will then automatically group data to make multiple plots * If you map any aesthetic to a discrete variable, it will automatically group them .pull-left[ ```r # Top Left ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth() + geom_point(alpha=0.2) # Top Right ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(aes(group = drv)) # Bottom ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth(aes(color = drv), show.legend = FALSE) ``` ] .pull-right[ <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-12-1.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, with a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="216" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-12-2.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, with a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="216" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-12-3.png" alt="Three plots, each with highway fuel efficiency on the y-axis and engine size of cars, where data are represented by a smooth curve. The first plot only has these two variables, the center plot has three separate smooth curves for each level of drive train, and the right plot not only has the same three separate smooth curves for each level of drive train but these curves are plotted in different colors, with a legend explaining which color maps to which level. Confidence intervals around the smooth curves are also displayed." width="216" /> ] --- ### Different mappings for different layers * By placing mappings inside geom functions, you can use local mappings for different layers * e.g., let's color dots according to class of car but then have a uniform line color for a `geom_smooth`: ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_smooth() #> `geom_smooth()` using method = 'loess' and formula = 'y ~ x' ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-13-1.png" width="432" /> --- ### Different mappings for different layers * You can also combine filtering and use multiple geom's to emphasize different things in the visualization .pull-left[ ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_point( data = mpg |> filter(class == "2seater"), shape = "circle open", size = 3, color = "red") ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-14-1.png" width="432" /> ] .pull-right[ ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_point( data = mpg |> filter(class == "2seater"), shape = "circle open", size = 3, color = "red") + geom_point( data = mpg |> filter(class == "2seater"), color = "red") ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-15-1.png" width="288" /> ] --- ### Example * Let's look through how we can create these 4 figures: <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-16-1.png" width="234" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-16-2.png" width="234" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-16-3.png" width="234" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-16-4.png" width="234" /> .pull-left[ ```r # Left ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_smooth() # Center Left ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_smooth(se = FALSE) ``` ] .pull-right[ ```r # Center Right ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE) # Right ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = drv)) + geom_smooth(se = FALSE) ``` ] --- ### Facets .pull-left[ * Facetting with `facet_wrap()` splits plots into subplots based on categorical variables: ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_wrap(~cyl) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-19-1.png" width="432" /> ] .pull-right[ * To facet with the combo of 2 variables, use `facet_grid()`. * The first argument of `facet_grid()` is also a formula, but now it's a double sided formula: `rows ~ cols`. ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_grid(drv ~ cyl) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-20-1.png" alt="Scatterplot of highway fuel efficiency versus engine size of cars, faceted by number of cylinders across rows and by type of drive train across columns. This results in a 4x3 grid of 12 facets. Some of these facets have no observations: 5 cylinders and 4 wheel drive, 4 or 5 cylinders and front wheel drive." width="432" /> ] --- ### Facets * Facets share same scale for x/y axes by default * Useful for comparing across facets, but can be limiting * `scales` argument allows for freeing up the x or y variables ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_grid(drv ~ cyl) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_grid(drv ~ cyl, scales = "free_y") ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-21-1.png" width="432" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-21-2.png" width="432" /> --- ### Facets * Which of the following makes it easier to compare engine size (`displ`) across cras with different drive trains? ```r ggplot(mpg, aes(x = displ)) + geom_histogram() + facet_grid(drv ~ .) ggplot(mpg, aes(x = displ)) + geom_histogram() + facet_grid(. ~ drv) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-22-1.png" width="432" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-22-2.png" width="432" /> --- ### Facets * What are advantages and disadvantages to using facets instead of color? ```r ggplot(mpg) + geom_point(aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2) ggplot(mpg) + geom_point(aes(x = displ, y = hwy, color = class)) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-23-1.png" width="432" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-23-2.png" width="432" /> --- ### Statistical transformations .pull-left[ * Consider the `diamonds` dataset, in tidyverse: ```r head(diamonds) #> # A tibble: 6 × 10 #> carat cut color clarity depth table price x y z #> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> #> 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 #> 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 #> 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 #> 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 #> 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 #> 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ``` * Let's create a bar chart across cuts: ```r ggplot(diamonds, aes(x = cut)) + geom_bar() ``` ] .pull-right[ <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-26-1.png" width="432" /> * `count` is not a variable in diamonds, so how is it creating this? ] --- ### Statistical transformations - Bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin. - Smoothers fit a model to your data and then plot predictions from the model. - Boxplots compute the five-number summary of the distribution and then display that summary as a specially formatted box. The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation. <img src="visualization-stat-bar.png" width="75%" /> --- ### Position adjustments .pull-left[ * You can color bar charts using `color` or `fill` aesthetics: ```r # Left ggplot(mpg, aes(x = drv, color = drv)) + geom_bar() # Right ggplot(mpg, aes(x = drv, fill = drv)) + geom_bar() ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-28-1.png" alt="Two bar charts of drive types of cars. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of cars in each cut category." width="180" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-28-2.png" alt="Two bar charts of drive types of cars. In the first plot, the bars have colored borders. In the second plot, they're filled with colors. Heights of the bars correspond to the number of cars in each cut category." width="180" /> ] .pull-right[ * You can also fill aesthetics with other variables ```r ggplot(mpg, aes(x = drv, fill = class)) + geom_bar() ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-29-1.png" alt="Segmented bar chart of drive types of cars, where each bar is filled with colors for the classes of cars. Heights of the bars correspond to the number of cars in each drive category, and heights of the colored segments are proportional to the number of cars with a given class level within a given drive type level." width="432" /> * Each rectangle represents a combo of `drv` and `class` ] --- ### Position adjustments * The way bar charts are stacked is determined using `position` argument (position adjustment) * For non-stacked charts, you can use three options: `position = 'identity'`, `position = 'dodge'`, `position = 'fill'` * `position = 'identity'` is not super useful for bar charts: it places each bar exactly where it falls in context of the graph (i.e. at `drv`) ```r # Left ggplot(mpg, aes(x = drv, fill = class)) + geom_bar(alpha = 1/5, position = "identity") # Right ggplot(mpg, aes(x = drv, color = class)) + geom_bar(fill = NA, position = "identity") ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-30-1.png" width="252" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-30-2.png" width="252" /> * `position = 'identity'` is useful for geom_point, since we want all dots to appear where data is --- ### Position adjustments * `position = "fill"` works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups. * `position = "dodge"` places overlapping objects directly beside one another. This makes it easier to compare individual values. ```r # Left ggplot(mpg, aes(x = drv, fill = class)) + geom_bar(position = "fill") # Right ggplot(mpg, aes(x = drv, fill = class)) + geom_bar(position = "dodge") ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-31-1.png" width="324" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-31-2.png" width="324" /> --- ### Exploratory data analysis Cycle through the following: * Generate questions about your data. * Search for answers by visualizing, transforming, and modelling your data. * Use what you learn to refine your questions and/or generate new questions Requires creativity and critical thinking. Two questions which capture many different types of questions: * What type of variation occurs **within** each variable? (Mean, standard deviation, skewness, ...) * What type of covariation occurs **between** variables? (How does height vary with weight, ... ) --- ### Variation * **Variation** is tendency for values of a variable to change from measurement to measurement * Can be due to measurement error (e.g., measuring height with different rulers) or due to within-group variation (different people have different heights) * Let's explore the distribution of weights (`carat`) of the ~50k diamonds from `diamonds` dataset. * `carat` is numerical, can use histogram: ```r ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.5) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-32-1.png" width="432" /> --- ### Typical values .pull-left[ * In bar charts and histograms, tall bars = common values; no bars = values not seen * Questions to ask yourself: - Which values are the most common? Why? - Which values are rare? Why? Does that match your expectations? - Can you see any unusual patterns? What might explain them? * Let's look at distribution of weights of smaller diamonds. ```r smaller <- diamonds |> filter(carat < 3) ggplot(smaller, aes(x = carat)) + geom_histogram(binwidth = 0.01) ``` ] .pull-right[ <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-34-1.png" width="432" /> * Why are there more diamonds at whole carats than fractional? * Why are more diamonds slightly to the right of each peak than slightly to left? ] --- ### Unusual values * If you encounter unusual values in dataset, and want to ignore them for rest of analysis, two options: - Drop entire row with the strange values, e.g. ```r diamonds_filtered <- diamonds %>% filter(between( y, 3, 20)) ``` - Replacing unusual values with missing values ```r diamonds2 <- diamonds |> mutate(y = if_else(y < 3 | y > 20, NA, y)) ``` * Latter is more recommended behavior. --- ### Covariation * **Covariation** is tendency for values of >= 2 variables to vary together in a related way #### Categorical and numerical variable * We can use `geom_freqpoly()` (similar to `geom_hist`) to show "frequency polygons", e.g. for how price of a diamond varies with quality ```r ggplot(diamonds, aes(x = price)) + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-37-1.png" width="432" /> * Not super informative, since the height mainly reflects the count. --- ### Covariation * More useful to understand the *density* of the variable (count / total number) * To do this, we can use `after_stat(density)`, which does this normalization * Default behavior is `after_stat(count)` ```r ggplot(diamonds, aes(x = price)) + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75) ggplot(diamonds, aes(x = price, y = after_stat(count))) + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75) ggplot(diamonds, aes(x = price, y = after_stat(density))) + geom_freqpoly(aes(color = cut), binwidth = 500, linewidth = 0.75) ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-38-1.png" width="288" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-38-2.png" width="288" /><img src="lec10-visualization-2_files/figure-html/unnamed-chunk-38-3.png" width="288" /> * Doesn't appear that diamond quality has significant effect? --- ### Covariation * Let's further inspect with a box plot .pull-left[ ```r ggplot(diamonds, aes(x = cut, y = price)) + geom_boxplot() ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-39-1.png" width="288" /> ] .pull-right[ * Can now easily compare medians and 25th/75th percentiles * Seems that better quality diamonds are typically cheaper! We'll investigate further why. ] --- ### Factor orders .pull-left[ * `cut` in the `diamonds` dataset is an **(ordered) factor**: categorical varibale which has ordering (fair < good < very good, etc) * We'll see a lot more about factors in the next few lectures * We can re-define the ordering of factors using `fct_reorder()` ```r ggplot(mpg, aes(x = fct_reorder(class, hwy, median), y = hwy)) + geom_boxplot() ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-40-1.png" width="324" /> ] .pull-right[ * If we have long variable names, sometimes better by making the y axis have labels ```r ggplot(mpg, aes(x = hwy, y = fct_reorder(class, hwy, median))) + geom_boxplot() ``` <img src="lec10-visualization-2_files/figure-html/unnamed-chunk-41-1.png" width="324" /> ]