Let's Jitter

By Carl Goodwin in R

September 12, 2017

This first little project uses the tidyverse collection of packages to import, explore and visualise some sales data. The UK Government’s Digital Marketplace provides a rich and varied source of public data under the Open Government Licence.

The marketplace was set up with an intent to break down barriers that impede Small and Medium Enterprises (SMEs) from bidding for Public Sector contracts. So, let’s see how that’s going.

library(tidyverse)
library(clock)
library(janitor)
library(scales, exclude = "date_format")
library(wesanderson)
library(glue)
library(kableExtra)
theme_set(theme_bw())

(cols <- wes_palette("Royal1"))

The tidyverse framework sits at the heart of all my data science work as evidenced in my favourite things. So I’ll begin by using two of my most used tidyverse packages (readr and dplyr) to import and tidy the cloud services (G-Cloud) sales data.

Wild data are often scruffy affairs. Cleaning and tidying is a necessary first step. In the case of these data, there are characters in an otherwise numeric spend column. And the date column is a mix of two formats.

url <- str_c(
  "https://www.gov.uk/government/",
  "uploads/system/uploads/attachment_data/",
  "file/639799/g-cloud-sales-figures-july2017.csv"
)

gcloud_df <-
  read_csv(url) |> 
  clean_names() |> 
  mutate(
    evidenced_spend = str_remove_all(evidenced_spend, "[^0-9-]") |> 
      parse_number(),
    date = as.Date(as.numeric(return_month), origin = "1899-12-30"),
    date = if_else(
      is.na(date),
      date_parse(return_month, format = "%d/%m/%y"),
      date
    ),
    sme_status = if_else(sme_status == "SME", "SME", "Non-SME"),
    sme_spend = if_else(sme_status == "SME", evidenced_spend, 0)
  )

Now we can summarise and visualise how the SME share has changed over time using the ggplot2 package.

share_df <- gcloud_df |> 
  group_by(date) |> 
  summarise(
    evidenced_spend = sum(evidenced_spend, na.rm = TRUE),
    sme_spend = sum(sme_spend, na.rm = TRUE),
    pct = sme_spend / evidenced_spend
  )

last_date <- gcloud_df |> 
  arrange(desc(date)) |> 
  slice(1) |> 
  pull(date) |> 
  date_format(format = "%B %d, %Y")

share_df |> 
  ggplot(aes(date, pct)) +
  geom_point(colour = cols[4]) +
  geom_smooth(colour = cols[2], fill = cols[3]) +
  scale_y_continuous(labels = label_percent()) +
  scale_x_date(date_breaks = "years", date_labels = "%Y") +
  labs(
    x = NULL, y = NULL,
    title = glue("SME Share of G-Cloud to {last_date}"), 
    subtitle = "Dots = % Monthly Sales via SMEs",
    caption = "Source: GOV.UK G-Cloud Sales"
  )

Sales grew steadily to a cumulative £2.4bn by July 2017. And as the volume of sales grew, an increasingly clearer picture of sustained growth in the SME share emerged. However, in those latter few months, SMEs lost a little ground.

Dig a little deeper, and one can also see variation by sub-sector. And that’s after setting aside those buyers with cumulative G-Cloud spend below £100k, where large enterprise suppliers are less inclined to compete.

sector_df <- gcloud_df |> 
  mutate(sector = if_else(
    sector %in% c("Central Government", "Local Government", "Police", "Health"),
    sector,
    "Other Sector"
  )) |> 
  group_by(customer_name, sector) |> 
  summarise(
    evidenced_spend = sum(evidenced_spend, na.rm = TRUE),
    sme_spend = sum(sme_spend, na.rm = TRUE),
    pct = sme_spend / evidenced_spend
  ) |> 
  filter(evidenced_spend >= 100000) |>  
  group_by(sector) |> 
  mutate(median_pct = median(pct)) |> 
  ungroup() |> 
  mutate(sector = fct_reorder(sector, median_pct))

n_df <- sector_df |>  group_by(sector) |>  summarise(n = n())

sector_df |> 
  ggplot(aes(sector, pct)) +
  geom_boxplot(outlier.shape = FALSE, fill = cols[3]) +
  geom_jitter(width = 0.2, alpha = 0.5, colour = cols[2]) +
  geom_label(aes(y = .75, label = glue("n = {n}")),
    data = n_df,
    fill = cols[1], colour = "white"
  ) +
  scale_y_continuous(labels = label_percent()) +
  labs(
    x = NULL, y = NULL,
    title = glue("SME Share of G-Cloud to {last_date}"),
    subtitle = "% Sales via SMEs for Buyers with Cumulative Sales >= £100k",
    caption = "Source: gov.uk G-Cloud Sales"
  )

The box plot, overlaid with jittered points to avoid over-plotting, shows:

  • Central government, with its big-spending departments, and police favouring large suppliers. This may reflect, among other things, their ability to scale.
  • Local government and health, in contrast, favouring SMEs. And this despite their looser tether to central government strategy.

So, irrespective of whether service integration is taken in-house or handled by a service integrator, large enterprise suppliers have much to offer:

  • The ability to deliver at scale;
  • A breadth and depth of capabilities exploitable during discovery to better articulate the “art of the possible”;
  • A re-assurance that there is always extensive capability on hand.

SMEs offer flexibility, fresh thinking and broader competition, often deploying their resources and building their mission around a narrower focus. They tend to do one thing, or a few things, exceptionally well.

These data are explored further in Six months later and Can Ravens Forecast.

R Toolbox

Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts.

Package Function
base as.Date[1]; as.numeric[1]; c[1]; conflicts[1]; cumsum[1]; function[1]; is.na[1]; search[1]; sum[5]
clock date_format[1]; date_parse[1]
dplyr filter[6]; arrange[3]; desc[3]; group_by[5]; if_else[7]; mutate[8]; n[1]; pull[1]; slice[1]; summarise[4]; ungroup[1]
forcats fct_reorder[1]
ggplot2 aes[3]; geom_boxplot[1]; geom_jitter[1]; geom_label[1]; geom_point[1]; geom_smooth[1]; ggplot[2]; labs[2]; scale_x_date[1]; scale_y_continuous[2]; theme_bw[1]; theme_set[1]
glue glue[3]
janitor clean_names[1]
kableExtra kbl[1]
purrr map[1]; map2_dfr[1]; possibly[1]; set_names[1]
readr parse_number[1]; read_csv[1]; read_lines[1]
scales label_percent[2]
stats median[1]
stringr str_c[6]; str_count[1]; str_detect[2]; str_remove[2]; str_remove_all[2]; str_starts[1]
tibble as_tibble[1]; tibble[2]; enframe[1]
tidyr unnest[1]
wesanderson wes_palette[1]
Posted:
September 12, 2017
Updated:
April 19, 2022
Length:
4 minute read, 810 words
Categories:
R
See Also: