Tidyverse and Data Visualisation

Author

Jason Hilton

Warning: package 'here' was built under R version 4.4.2

here() starts at Z:/mydocuments/teaching/RESM3002/resm3002_2023-24/RESM3002_Workshops

Warning: package 'tidyverse' was built under R version 4.4.2

Warning: package 'ggplot2' was built under R version 4.4.2

Warning: package 'tibble' was built under R version 4.4.2

Warning: package 'tidyr' was built under R version 4.4.2

Warning: package 'readr' was built under R version 4.4.2

Warning: package 'purrr' was built under R version 4.4.2

Warning: package 'dplyr' was built under R version 4.4.2

Warning: package 'stringr' was built under R version 4.4.2

Warning: package 'forcats' was built under R version 4.4.2

Warning: package 'lubridate' was built under R version 4.4.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction

In this workshop we will carry out some visualisation using ggplot2.

I recommend typing each part of the example code below into an R script and running it, rather than copying and pasting. This help you internalise the process a little better and understand what is happening at each step.

Feel free to experiment and change bits of the code. This is the best way to learn.

Getting started

Begin by opening Rstudio and create a project for this workshop (or use one you have already created for this course).

See this brief discussion of how to go about this.

We will need to install some additional packages for this exercise. We will start by installing the tidyverse set of packages. Although there are many tidyverse packages, including tidyr, dplyr, and ggplot2, which we could install individually, it is easier to use the ‘tidyverse’ package

install.packages("tidyverse")
library(tidyverse)

The documentation for ggplot2 is available online here if you need to access it.

Remember that you can always access the help for any R function if you need to by typing ? and the name of the function into the console.

?geom_point

Reading and Tidying

On the course blackboard page, you will find the three .csv files. Download these files in a folder within your project named data [^1]. [1:] All of this data is repurposed from the gapminder package.

We can load one of these in using the read_csv function:

gdp_df <- read_csv("data/gdp_pc.csv")

gdp_df

Rows: 142 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, continent
dbl (12): 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 142 × 14
   country     continent `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987`
   <chr>       <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghanistan Asia        779.   821.   853.   836.   740.   786.   978.   852.
 2 Albania     Europe     1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.
 3 Algeria     Africa     2449.  3014.  2551.  3247.  4183.  4910.  5745.  5681.
 4 Angola      Africa     3521.  3828.  4269.  5523.  5473.  3009.  2757.  2430.
 5 Argentina   Americas   5911.  6857.  7133.  8053.  9443. 10079.  8998.  9140.
 6 Australia   Oceania   10040. 10950. 12217. 14526. 16789. 18334. 19477. 21889.
 7 Austria     Europe     6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688.
 8 Bahrain     Asia       9867. 11636. 12753. 14805. 18269. 19340. 19211. 18524.
 9 Bangladesh  Asia        684.   662.   686.   721.   630.   660.   677.   752.
10 Belgium     Europe     8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526.
# ℹ 132 more rows
# ℹ 4 more variables: `1992` <dbl>, `1997` <dbl>, `2002` <dbl>, `2007` <dbl>

We can see that this data is not in ‘tidy’ format.

The columns ‘1952’,‘1957’, …, etc give each provide different sets of observations of the same variable (GDP per capita).

Therefore, we would like to convert this plot to tidy format using the pivot_longer function from the tidyr package.

gdp_tidy_df <- gdp_df %>% 
  # select all columns except country and continent
  pivot_longer(-c(country, continent),
               names_to="year",
               values_to="gdp_per_cap")
gdp_tidy_df

# A tibble: 1,704 × 4
   country     continent year  gdp_per_cap
   <chr>       <chr>     <chr>       <dbl>
 1 Afghanistan Asia      1952         779.
 2 Afghanistan Asia      1957         821.
 3 Afghanistan Asia      1962         853.
 4 Afghanistan Asia      1967         836.
 5 Afghanistan Asia      1972         740.
 6 Afghanistan Asia      1977         786.
 7 Afghanistan Asia      1982         978.
 8 Afghanistan Asia      1987         852.
 9 Afghanistan Asia      1992         649.
10 Afghanistan Asia      1997         635.
# ℹ 1,694 more rows

Filter, Group by, Summarise

We can filter, compute summaries, and create new columns in the data by using functions from the dplyr package.

We can, for instance, calculate the mean gdp per capita.

summarise(gdp_tidy_df, mean_gdp_per_cap = mean(gdp_per_cap))

# A tibble: 1 × 1
  mean_gdp_per_cap
             <dbl>
1            7215.

We could also do this using the pipe operator:

gdp_tidy_df %>% summarise(mean_gdp_per_cap=mean(gdp_per_cap))

# A tibble: 1 × 1
  mean_gdp_per_cap
             <dbl>
1            7215.

We might like to find out the number of observations in each group.

We can subset the data to find a specific year.

gdp_tidy_df %>% 
  # choose only those rows where the condition year==max(year) holds
  # (i.e. the latest year)
  filter(year==max(year)) %>%
  # arrange in order of gdp per capita, 
  # in descending order (without the `-`, we would have smallest first) 
  arrange(-gdp_per_cap)

# A tibble: 142 × 4
   country          continent year  gdp_per_cap
   <chr>            <chr>     <chr>       <dbl>
 1 Norway           Europe    2007       49357.
 2 Kuwait           Asia      2007       47307.
 3 Singapore        Asia      2007       47143.
 4 United States    Americas  2007       42952.
 5 Ireland          Europe    2007       40676.
 6 Hong Kong, China Asia      2007       39725.
 7 Switzerland      Europe    2007       37506.
 8 Netherlands      Europe    2007       36798.
 9 Canada           Americas  2007       36319.
10 Iceland          Europe    2007       36181.
# ℹ 132 more rows

We will calculate mean gdp by continent[^1], so will use the group_by function to create groups based on the continent column. Any subsequent calculations we carry out on that data frame will be conducted separately for each group, as though we were working on separate dataframes.

gdp_tidy_df %>% 
  group_by(continent)

# A tibble: 1,704 × 4
# Groups:   continent [5]
   country     continent year  gdp_per_cap
   <chr>       <chr>     <chr>       <dbl>
 1 Afghanistan Asia      1952         779.
 2 Afghanistan Asia      1957         821.
 3 Afghanistan Asia      1962         853.
 4 Afghanistan Asia      1967         836.
 5 Afghanistan Asia      1972         740.
 6 Afghanistan Asia      1977         786.
 7 Afghanistan Asia      1982         978.
 8 Afghanistan Asia      1987         852.
 9 Afghanistan Asia      1992         649.
10 Afghanistan Asia      1997         635.
# ℹ 1,694 more rows

We can see the groups are indicated in the information printed at the top of the dataframe.

gdp_tidy_df %>% 
  group_by(continent) %>% 
  summarise(mean_gdp_per_cap=mean(gdp_per_cap))

# A tibble: 5 × 2
  continent mean_gdp_per_cap
  <chr>                <dbl>
1 Africa               2194.
2 Americas             7136.
3 Asia                 7902.
4 Europe              14469.
5 Oceania             18622.

We can also choose a different variable to group on. Try doing this by year, for example.

Joining

In fact, these values are not very meaningful. We should really be accounting for differences in population size between the countries we are averaging over.

Therefore, we need to load in data on population. As above, try to read in the data on population and covert to tidy format, and store in dataframe called pop_tidy_df.

Solution

pop_tidy_df <- read_csv("data/pop.csv") %>% 
  pivot_longer(-c(country, continent),
               names_to="year",
               values_to="population")
pop_tidy_df

Rows: 142 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, continent
dbl (12): 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 1,704 × 4
   country     continent year  population
   <chr>       <chr>     <chr>      <dbl>
 1 Afghanistan Asia      1952     8425333
 2 Afghanistan Asia      1957     9240934
 3 Afghanistan Asia      1962    10267083
 4 Afghanistan Asia      1967    11537966
 5 Afghanistan Asia      1972    13079460
 6 Afghanistan Asia      1977    14880372
 7 Afghanistan Asia      1982    12881816
 8 Afghanistan Asia      1987    13867957
 9 Afghanistan Asia      1992    16317921
10 Afghanistan Asia      1997    22227415
# ℹ 1,694 more rows

We would like to combine this with the data on gdp.

We can do this using the join functions from dplyr.

country_df <- gdp_tidy_df %>% 
  left_join(pop_tidy_df)

Joining with `by = join_by(country, continent, year)`

country_df

# A tibble: 1,704 × 5
   country     continent year  gdp_per_cap population
   <chr>       <chr>     <chr>       <dbl>      <dbl>
 1 Afghanistan Asia      1952         779.    8425333
 2 Afghanistan Asia      1957         821.    9240934
 3 Afghanistan Asia      1962         853.   10267083
 4 Afghanistan Asia      1967         836.   11537966
 5 Afghanistan Asia      1972         740.   13079460
 6 Afghanistan Asia      1977         786.   14880372
 7 Afghanistan Asia      1982         978.   12881816
 8 Afghanistan Asia      1987         852.   13867957
 9 Afghanistan Asia      1992         649.   16317921
10 Afghanistan Asia      1997         635.   22227415
# ℹ 1,694 more rows

This function looks for columns common to both datasets, and looks for rows in each dataset with the same values in these common columns by which to join the dataset together.

Look at the simple example below to understand the result of a join operation. Notice that the correct customer age and sex is matched the relevant row of the transaction dataset. In the case of John, there is no customer data, so an NA value is inserted. Note also that it doesn’t matter whether the datasets are in the same order or have the same number of rows.

transaction_df <- 
  tibble(Transaction=c(1,2,3,4,5),
         Customer=c("John", "Jill", "Omar", "Liying", "Omar"),
         Amount =c(24,54,65.4,5,0.3))
transaction_df

# A tibble: 5 × 3
  Transaction Customer Amount
        <dbl> <chr>     <dbl>
1           1 John       24  
2           2 Jill       54  
3           3 Omar       65.4
4           4 Liying      5  
5           5 Omar        0.3

customer_df <- 
  tibble(Customer =c("Liying", "Jill", "Omar"),
         Age =c(34,54,65),
         Sex = c("f", "f", "m"))
customer_df

# A tibble: 3 × 3
  Customer   Age Sex  
  <chr>    <dbl> <chr>
1 Liying      34 f    
2 Jill        54 f    
3 Omar        65 m

transaction_df %>% 
  left_join(customer_df)

Joining with `by = join_by(Customer)`

# A tibble: 5 × 5
  Transaction Customer Amount   Age Sex  
        <dbl> <chr>     <dbl> <dbl> <chr>
1           1 John       24      NA <NA> 
2           2 Jill       54      54 f    
3           3 Omar       65.4    65 m    
4           4 Liying      5      34 f    
5           5 Omar        0.3    65 m

Mutating

Back to the gdp data, lets calculate average gdp per capita by continent for the latest year in a more correct manner.

country_df %>% 
  filter(year==max(year)) %>%
  # creating a new column
  mutate(gdp = gdp_per_cap * population) %>%
  group_by(continent) %>%
  summarise(total_gdp=sum(gdp),
            total_pop=sum(population)) %>% 
  mutate(mean_gdp = total_gdp/total_pop)

# A tibble: 5 × 4
  continent total_gdp  total_pop mean_gdp
  <chr>         <dbl>      <dbl>    <dbl>
1 Africa      2.38e12  929539692    2561.
2 Americas    1.94e13  898871184   21603.
3 Asia        2.07e13 3811953827    5432.
4 Europe      1.48e13  586098529   25244.
5 Oceania     8.07e11   24549947   32885.

Visualisation

Next, read in the third .csv dataset provided, and join this to the other two, so that we have gdp, population and life expectancy data together in one data frame called country_df.

Solution

life_exp_df <- read_csv("data/life_exp.csv") %>% 
  pivot_longer(-c(country, continent),
               names_to="year",
               values_to="life_expectancy")

country_df <- country_df %>% 
  left_join(life_exp_df)
country_df

Rows: 142 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, continent
dbl (12): 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(country, continent, year)`

# A tibble: 1,704 × 6
   country     continent year  gdp_per_cap population life_expectancy
   <chr>       <chr>     <chr>       <dbl>      <dbl>           <dbl>
 1 Afghanistan Asia      1952         779.    8425333            28.8
 2 Afghanistan Asia      1957         821.    9240934            30.3
 3 Afghanistan Asia      1962         853.   10267083            32.0
 4 Afghanistan Asia      1967         836.   11537966            34.0
 5 Afghanistan Asia      1972         740.   13079460            36.1
 6 Afghanistan Asia      1977         786.   14880372            38.4
 7 Afghanistan Asia      1982         978.   12881816            39.9
 8 Afghanistan Asia      1987         852.   13867957            40.8
 9 Afghanistan Asia      1992         649.   16317921            41.7
10 Afghanistan Asia      1997         635.   22227415            41.8
# ℹ 1,694 more rows

Let’s start by building up a plot. First we initialise a plot and choose the dataset we wish to use:

ggplot(data= country_df)

This just creates a blank plot. This is because we have not specified any mappings from data dimensions to aesthetics, nor have we stated the geometric object we want to represent the data on the screen.

Let’s decide that we will map some of the variables in our dataset to position aesthetics. We will map GDP to the x aesthetic, and life expectancy to the y aesthetic. By specifying these at the plot initialisation stage, we have determine that we want this to apply to all future layers.

ggplot(data=country_df, mapping=aes(x=gdp_per_cap, y=life_expectancy))

We can now see that be specifying the mappings, ggplot knows enough about what we intend to draw the axes.

We have still to determine what the physical representations of our data will look like. We do this by creating a new layer with a specific geometry. We will use a point as the representation of each observation in our dataset.

ggplot(data=country_df, mapping=aes(x=gdp_per_cap, y=life_expectancy)) +
  geom_point()

You can also assign a plot to R variable.

p <- ggplot(data=country_df, mapping=aes(x=gdp_per_cap, y=life_expectancy))

This allows you to add layers to it later.

p + geom_point()

The relationship between gdp_per_cap and life_expectancy looks highly non-linear. Perhaps a transformation would help address this.

Tasks

Question 1

Change the scale of the x axis to a log scale (refer back to the lecture notes or documentation is necessary).

Solution

Several possible solutions:

ggplot(data=country_df, 
       mapping=aes(x=gdp_per_cap, y=life_expectancy)) + 
  geom_point() + 
  scale_x_log10( labels=round)

We can change the scale using the transformation argument of the scale, in this case using .

ggplot(data=country_df, 
       mapping=aes(x=gdp_per_cap, y=life_expectancy)) + 
  geom_point() + 
  scale_x_continuous(trans="log", labels=round)

We can transform the data itself.

ggplot(data=country_df, 
       mapping=aes(x=log(gdp_per_cap), y=life_expectancy)) + 
  geom_point()

Note the differences in the labels along the x-axis. - The first and second solution keeps the data the same, but uses log-scaling to translate this to distance along the x axis. - The third solution transforms the original data, and then translates this transformed data to distance in the x-direction using a linear scaling.

Question 2

Colour each continent separately using the colour aesthetic

Solution

ggplot(data=country_df, 
       mapping=aes(x=log(gdp_per_cap), y=life_expectancy, colour=continent)) + 
  geom_point()

Question 3

Patterns aren’t always easy to distinguish with lots of observations. Try faceting the plot on the year variable.

Solution

ggplot(data=country_df, 
       mapping=aes(x=log(gdp_per_cap), y=life_expectancy,
                   colour=continent)) + 
  geom_point() + 
  facet_wrap(~year)

Question 4.

Plot histograms of the continuous variables. Experiment with examining the distributions of these variables separately by year.

Note that ggplot2 has a geom_histogram geometry that allows histograms to be created easily, mapping the continuous variable of interest to the x aesthetic. Technically speaking, this is actually a combination of a transformation of the variable into binned counts, and the use of a bar geometry, but the histogram geometry is a useful shorthand for this.

Solution

ggplot(country_df, aes(x=life_expectancy)) + 
  geom_histogram() + facet_wrap(~year)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(country_df, aes(x=gdp_per_cap)) + 
  geom_histogram() + facet_wrap(~year) + 
  scale_x_log10()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Extra

We might also want to build a stacked histogram:

ggplot(country_df, aes(x=life_expectancy, 
                       group=continent, fill=continent)) + 
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note the difference between fill and colour in the above!

Position This works by using the position argument to the geometry, which we haven’t yet discussed. - The point of this attribute to geom is to avoid overplotting - Most of the time position will be set to identity - Looking at the position argument to geom_histogram (try ?geom_histogram), we notice that we it is set to stack by default. - This means bars in different groups with the same x value are translated to lie on top of one another.

We can overide this, adding some transparency:

ggplot(country_df, aes(x=life_expectancy, 
                       group=continent, fill=continent)) + 
  geom_histogram(position="identity", alpha=0.5) +
  theme_bw()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is not a very easy plot to interpret, however, and I wouldn’t recommend it!

ggplot(country_df %>% filter(year %in% seq(1957,2007, 10)), 
       aes(x=life_expectancy, group=continent, fill=continent)) + 
  geom_histogram() + facet_wrap(~year)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can compare more directly between continents using facet_grid, allowings us to split out facets by two cateogries, one in rows, and one in columns. But the individual plots are now quite small.

ggplot(country_df %>% filter(year %in% seq(1967,2007, 20)), 
       aes(x=life_expectancy, group=continent, fill=continent)) + 
  geom_histogram() + facet_grid(continent~year)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Question 5

Label your plots with titles and more appropriate axis names. Try a different theme, and adjust the base theme

Solution

p <- ggplot(country_df %>% filter(year %in% seq(1967,2008, 20)), 
       aes(x=gdp_per_cap, y=life_expectancy, colour=continent)) + 
  geom_point() + 
  scale_x_log10(labels= function(x) round(x/1000)) +
  facet_wrap(~year) +
  theme_bw(base_size = 20) +
  labs(title="GDP per Capita vs Life Expectancy by Year",
       x="GDP per Capita (000s)")

print(p)

Question 6

Save one of your plots out to a png using ggsave.

Use the help function to get familiar with how to use ggsave. The examples section of the bottom of the help page can be particularly useful.

Solution

  ggsave("pop.png", p)

Saving 7 x 5 in image