class: center, middle, inverse, title-slide .title[ # Visualizations III ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- ### Further inspecting diamond prices * Recall that it appeared that diamond prices seemed inversely correlated with quality ```r ggplot(diamonds, aes(x = cut, y = price)) + geom_boxplot() ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-1-1.png" width="288" /> * What might be responsible for this? Recall variables: x,y,z refer to length/width/depth of diamond in mm ```r colnames(diamonds) #> [1] "carat" "cut" "color" "clarity" "depth" "table" "price" #> [8] "x" "y" "z" ``` * Let's look at **size** vs price. --- ### Size vs. price in diamonds * We can try doing 3 side-by-side plots by just doing 3 ggplots: ```r ggplot(diamonds, aes(x = x, y = price)) + geom_point() ggplot(diamonds, aes(x = y, y = price)) + geom_point() ggplot(diamonds, aes(x = z, y = price)) + geom_point() ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-3-1.png" width="252" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-3-2.png" width="252" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-3-3.png" width="252" /> * Some things we notice: * Clearly some outliers - "zero" length / width / depth diamonds apparently? * But still some positive correlation between x, y, z and price * Hard to see with so many dots * Let's use `alpha=` to add some transparency, make it easier on the eyes --- ### Size vs. price in diamonds * We can try doing 3 side-by-side plots by just doing 3 ggplots: ```r ggplot(diamonds, aes(x = x, y = price)) + geom_point(alpha = 0.02) ggplot(diamonds, aes(x = y, y = price)) + geom_point(alpha = 0.02) ggplot(diamonds, aes(x = z, y = price)) + geom_point(alpha = 0.02) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-4-1.png" width="252" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-4-2.png" width="252" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-4-3.png" width="252" /> * Some things we notice: * Clearly some outliers - "zero" length / width / depth diamonds apparently? * But still some positive correlation between x, y, z and price * Hard to see with so many dots * Let's now clean up the tibble so we are only looking at the middle 95% of values for x, y, z --- ### Size vs. price in diamonds * Let's clean up the tibble so we are only looking at the middle 95% of values for x, y, z ```r diamonds_middle <- diamonds %>% filter(between(x, quantile(diamonds$x, 0.025), quantile(diamonds$x, 0.975)), between(y, quantile(diamonds$y, 0.025), quantile(diamonds$y, 0.975)), between(z, quantile(diamonds$z, 0.025), quantile(diamonds$z, 0.975))) ggplot(diamonds_middle, aes(x = x, y = price)) + geom_point(alpha = 0.02) ggplot(diamonds_middle, aes(x = y, y = price)) + geom_point(alpha = 0.02) ggplot(diamonds_middle, aes(x = z, y = price)) + geom_point(alpha = 0.02) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-5-1.png" width="288" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-5-2.png" width="288" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-5-3.png" width="288" /> --- ### Size vs. price in diamonds * Appears x, y, z (length/width/depth) are highly correlated with price * Recall that volume of a cube is length * width * depth - so let's look at the relationship between estimated volume (x\*y\*z) and price ```r diamonds_middle %>% mutate(est_volume = x*y*z) %>% ggplot(aes(x = est_volume, y = price)) + geom_point(alpha = 0.01) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-6-1.png" width="360" /> * Does appear to be a strong correlation. * Now let's think again as to why it was that we previously found that higher quality diamonds were cheaper. --- ### Size vs. price in diamonds * Hypothesis: higher quality diamonds are smaller, lower quality diamonds are larger. * To investigate, we could use boxplot: we want to visualize variation in size for each category of quality .pull-left[ ```r diamonds_middle %>% mutate(est_volume = x*y*z) %>% ggplot(aes(x = cut, y = est_volume)) + geom_boxplot() ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-7-1.png" width="432" /> ] --- ### Size vs price in diamonds * Let's try to visualize the different cuts on the dot plot: ```r diamonds_middle %>% mutate(est_volume = x*y*z) %>% ggplot(aes(x = est_volume, y = price, color = cut)) + geom_point(alpha = 0.5) diamonds_middle %>% mutate(est_volume = x*y*z) %>% ggplot(aes(x = est_volume, y = price, color = cut)) + geom_smooth() #> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-8-1.png" width="360" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-8-2.png" width="360" /> --- ### Size vs. price in diamonds * We were looking at the estimated volume, but could also look at the *weight*: carat. * Let's see how if weight varies by class. .pull-left[ ```r diamonds %>% ggplot(aes(x = cut, y = carat)) + geom_boxplot() ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-9-1.png" width="288" /> ] .pull-right[ * Again appears to be a lot of outliers; let's look at middle 95%. ```r diamonds_middle <- diamonds %>% filter(between(carat, quantile(carat, 0.025), quantile(carat, 0.975))) diamonds_middle %>% ggplot(aes(x = cut, y = carat)) + geom_boxplot() ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-10-1.png" width="288" /> ] --- ### Size vs price in diamonds .pull-left[ * And now let's visualize relationship between carat size and price by cut: <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-11-1.png" width="324" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-11-2.png" width="324" /> ] .pull-right[ * Compare this to what happens if we didn't ignore outliers! <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-12-1.png" width="324" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-12-2.png" width="324" /> ] --- ### Communicating ideas * Now we'll look at how to best display data once we've figured out some meaningful relationships between things * Packages we'll be using: `tidyverse`, `scales`, `ggrepel`, `patchwork` .pull-left[ * Good labels are crucial to good figures: ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_smooth(se = FALSE) + labs( x = "Engine displacement (L)", y = "Highway fuel economy (mpg)", color = "Car type", title = "Fuel efficiency generally decreases with engine size", subtitle = "Two seaters (sports cars) are an exception because of their light weight", caption = "Data from fueleconomy.gov" ) ``` ] .pull-right[ <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-15-1.png" width="396" /> ] --- ### Annotations .pull-left[ * Often helpful to label individual observations or gruops of observations * Tool to use: `geom_text()`, similar to `geom_point` but has additional aesthetic `label`. * Sources of labels: - Tibble itself has labels - Custom, user-added labels. ```r label_info <- mpg |> group_by(drv) |> arrange(desc(displ)) |> slice_head(n = 1) |> mutate( drive_type = case_when( drv == "f" ~ "front-wheel drive", drv == "r" ~ "rear-wheel drive", drv == "4" ~ "4-wheel drive" ) ) |> select(displ, hwy, drv, drive_type) ``` ] .pull-right[ ```r label_info #> # A tibble: 3 × 4 #> # Groups: drv [3] #> displ hwy drv drive_type #> <dbl> <int> <chr> <chr> #> 1 6.5 17 4 4-wheel drive #> 2 5.3 25 f front-wheel drive #> 3 7 24 r rear-wheel drive ``` * Can use this to directly label groups and replace legend ] --- ### Annotations .pull-left[ ```r label_info #> # A tibble: 3 × 4 #> # Groups: drv [3] #> displ hwy drv drive_type #> <dbl> <int> <chr> <chr> #> 1 6.5 17 4 4-wheel drive #> 2 5.3 25 f front-wheel drive #> 3 7 24 r rear-wheel drive ``` * Want to use `label_info` as the data. * x and y aesthetics the same as in the original `mpg` tibble * But then want to set `hjust` and `vjust` ] .pull-right[ ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(alpha = 0.3) + geom_smooth(se = FALSE) + geom_text( data = label_info, aes(x = displ, y = hwy, label = drive_type), fontface = "bold", size = 5, hjust = "right", vjust = "bottom" ) + theme(legend.position = "none") #> `geom_smooth()` using method = 'loess' and formula = 'y ~ x' ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-19-1.png" width="324" /> ] --- ### Annotations .pull-left[ <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-20-1.png" width="324" /> * Two things to address: (1) What are `hjust` and `vjust` doing, why is text appearing where it does? (2) * Let's look at `label_info` ```r label_info #> displ hwy drv drive_type #> <dbl> <int> <chr> <chr> #> 1 6.5 17 4 4-wheel drive #> 2 5.3 25 f front-wheel drive #> 3 7 24 r rear-wheel drive ``` ] .pull-right[ * We can use ``ggrepel::geom_label_repel()` to make text repel other text labels. ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point(alpha = 0.3) + geom_smooth(se = FALSE) + geom_label_repel( data = label_info, aes(x = displ, y = hwy, label = drive_type), fontface = "bold", size = 5, nudge_y = 2 ) + theme(legend.position = "none") #> `geom_smooth()` using method = 'loess' and formula = 'y ~ x' ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-22-1.png" width="288" /> ] --- ### Annotations .pull-left[ * Same idea can be used to highlight points on plot, e.g. outliers ```r potential_outliers <- mpg |> filter(hwy > 40 | (hwy > 20 & displ > 5)) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_text_repel(data = potential_outliers, aes(label = model)) + geom_point(data = potential_outliers, color = "red") + geom_point( data = potential_outliers, color = "red", size = 3, shape = "circle open" ) ``` ] .pull-right[ <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-24-1.png" width="432" /> ] --- ### Annotations * Other annotations which you might encounter / could be useful: * `geom_hline()`, `geom_vline()`: create horizontal / vertical lines * `geom_rect()`: rectangle around points of interest, require specifying `xmin`, `xmax`, `ymin`, `ymax` * `geom_segment()`: draw an arrow from one point to another * Function which allows for doing any of these: `annotate()` * Let's say we want to add some text to our plot, but split lines every 30 characters using `str_wrap()` ```r trend_text <- "Larger engine sizes tend to have lower fuel economy." |> str_wrap(width = 30) trend_text #> [1] "Larger engine sizes tend to\nhave lower fuel economy." ``` --- ### Annotations .pull-left[ * Let's add a text label in red using `annotate(geom = 'label')` * ... and then add a line segment giving general trend ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + annotate( geom = "label", x = 3.5, y = 38, label = trend_text, hjust = "left", color = "red" ) + annotate( geom = "segment", x = 3, y = 35, xend = 5, yend = 25, color = "red", arrow = arrow(type = "closed") ) ``` ] .pull-right[ <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-27-1.png" width="432" /> ] --- ### Scales * By default, ggplot2 automatically adds scales for you, e.g. this code ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) ``` uses the default behavior for scaling `geom_point()`, which is using continuous scales for x and y and discrete color scales: ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + scale_x_continuous() + scale_y_continuous() + scale_color_discrete() ``` --- ### Axis ticks * Axes and legends are called **guides** in R. Axes used for x,y aesthetics, legends for everything else * Ticks on axes and keys on legend affected by args `breaks` and `labels`. .pull-left[ ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + scale_y_continuous(breaks = seq(15, 40, by = 5)) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-30-1.png" width="288" /> ] .pull-right[ ```r ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + geom_point() + scale_x_continuous(labels = NULL) + scale_y_continuous(labels = NULL) + scale_color_discrete(labels = c("4" = "4-wheel", "f" = "front", "r" = "rear")) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-31-1.png" width="288" /> ] --- ### Axis ticks * Specifying `labels` and `breaks` gives a lot of flexibility on how to plot ```r # Left ggplot(diamonds, aes(x = price, y = cut)) + geom_boxplot(alpha = 0.05) + scale_x_continuous(labels = label_dollar()) # Right ggplot(diamonds, aes(x = price, y = cut)) + geom_boxplot(alpha = 0.05) + scale_x_continuous( labels = label_dollar(scale = 1/1000, suffix = "K"), breaks = seq(1000, 19000, by = 6000) ) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-32-1.png" width="324" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-32-2.png" width="324" /> --- ### Legends .pull-left[ * To control location of legend, use `theme()` (themes control non data parts of a plot) ```r base <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) base + theme(legend.position = "right") # the default base + theme(legend.position = "left") base + theme(legend.position = "top") + guides(color = guide_legend(nrow = 3)) base + theme(legend.position = "bottom") + guides(color = guide_legend(nrow = 3)) ``` ] .pull-right[ <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-34-1.png" width="216" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-34-2.png" width="216" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-34-3.png" width="216" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-34-4.png" width="216" /> ] --- ### Replacing scales .pull-left[ * It's often useful to consider transformations of data, e.g. log transformations ```r # Left ggplot(diamonds, aes(x = carat, y = price)) + geom_point(alpha = 0.02) # Right ggplot(diamonds, aes(x = log10(carat), y = log10(price))) + geom_point(alpha = 0.02) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-35-1.png" width="216" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-35-2.png" width="216" /> ] .pull-right[ * A useful geom for large-sample-size datasets: `geom_bin2d()` ```r # Left ggplot(diamonds, aes(x = carat, y = price)) + geom_bin2d() # Right ggplot(diamonds, aes(x = log10(carat), y = log10(price))) + geom_bin2d() ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-36-1.png" width="216" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-36-2.png" width="216" /> ] --- ### Replacing scales * Not ideal that previous plots had `log10(carat)` on the x axis, hard to interpret * Can use `scale_x_log10()` and `scale_y_log190)` to plot regular values, but with ticks that are spread using a log scale ```r # Left ggplot(diamonds, aes(x = carat, y = price)) + geom_bin2d() # Right ggplot(diamonds, aes(x = carat, y = price)) + geom_bin2d() + scale_x_log10() + scale_y_log10() ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-37-1.png" width="324" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-37-2.png" width="324" /> --- ### Replacing scales .pull-left[ * Color scales are also often replaced; especially using `scale_color_brewer` ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = drv)) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = drv)) + scale_color_brewer(palette = "Set1") ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-38-1.png" alt="Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette." width="216" /><img src="lec11-visualization-3_files/figure-html/unnamed-chunk-38-2.png" alt="Two scatterplots of highway mileage versus engine size where points are colored by drive type. The plot on the left uses the default ggplot2 color palette and the plot on the right uses a different color palette." width="216" /> ] .pull-right[ * Many different options for color brewer <div class="figure"> <img src="lec11-visualization-3_files/figure-html/fig-brewer-1.png" alt="All colorBrewer scales. One group goes from light to dark colors. Another group is a set of non ordinal colors. And the last group has diverging scales (from dark to light to dark again). Within each set there are a number of palettes." width="432" /> <p class="caption">All colorBrewer scales.</p> </div> ] --- ### Manual color scaling .pull-left[ * `scale_color_manual()` allows for you to give specific colors for groups in the tibble, e.g. ```r presidential #> # A tibble: 12 × 4 #> name start end party #> <chr> <date> <date> <chr> #> 1 Eisenhower 1953-01-20 1961-01-20 Republican #> 2 Kennedy 1961-01-20 1963-11-22 Democratic #> 3 Johnson 1963-11-22 1969-01-20 Democratic #> 4 Nixon 1969-01-20 1974-08-09 Republican #> 5 Ford 1974-08-09 1977-01-20 Republican #> 6 Carter 1977-01-20 1981-01-20 Democratic #> # ℹ 6 more rows ``` * Takes argument `values` which should be a named list, with entries of the form `group=color`. * `color` can be either English names ("blue", "red") or hexadecimal color codes ] .pull-right[ ```r presidential |> mutate(id = 33 + row_number()) |> ggplot(aes(x = start, y = id, color = party)) + geom_point() + geom_segment(aes(xend = end, yend = id)) + scale_color_manual( values = c(Republican = "#E81B23", Democratic = "#00AEF3")) ``` <img src="lec11-visualization-3_files/figure-html/unnamed-chunk-41-1.png" alt="Line plot of id number of presidents versus the year they started their presidency. Start year is marked with a point and a segment that starts there and ends at the end of the presidency. Democratic presidents are represented in blue and Republicans in red." width="432" /> ]