Skip to contents

This tutorial introduces the SGL language as well as usage of the rsgl package.

Setup

For use with the examples in this tutorial, we will create an in-memory DuckDB database and load it with two tables, cars and trees.

library(rsgl)
#> 
#> Attaching package: 'rsgl'
#> The following objects are masked from 'package:datasets':
#> 
#>     cars, trees
library(duckdb)
#> Loading required package: DBI

con <- dbConnect(duckdb())
dbWriteTable(con, "cars", cars)
dbWriteTable(con, "trees", trees)

Let’s query each to view a sample of data:

dbGetQuery(con, "
  select *
  from cars
  limit 5
")
#>   car_id horsepower miles_per_gallon origin year
#> 1      1        130               18    USA 1970
#> 2      2        165               15    USA 1970
#> 3      3        150               18    USA 1970
#> 4      4        150               16    USA 1970
#> 5      5        140               17    USA 1970
dbGetQuery(con, "
  select *
  from trees
  limit 5
")
#>   tree_id  age circumference
#> 1       1  118            30
#> 2       1  484            58
#> 3       1  664            87
#> 4       1 1004           115
#> 5       1 1231           120

dbGetPlot

The primary interface to rsgl is the dbGetPlot function, which takes a DBI database connection and a SGL statement and returns the corresponding plot. Although the examples in this tutorial use a connection to a DuckDB database, dbGetPlot will accept any DBI connection.

The SGL Language

The From Clause

The from keyword precedes a data source specification, which is often the name of a table in the database. Here, we specify the cars table as the data source.

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using points
")

This is similar in usage to the from keyword in SQL, except that only a single data source is allowed (i.e., a comma-separated list of table names is not valid). If data from multiple sources or pre-processing of data is necessary, then a SQL subquery can be provided:

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from (
    select *
    from cars
    where origin = 'Japan'
  )
  using points
")

The Using Clause

The using keyword precedes the name of the geometric object(s) that will represent the data. Following ggplot2 terminology, these geometric objects are referred to as geoms. Our previous examples demonstrated representing data with point geoms.

The Visualize Clause

The visualize keyword precedes the aesthetic-to-column mapping, which maps perceivable traits of the geoms to data source columns. Our prior examples mapped the x and y positions of the point geoms to data source columns. However, aesthetics may be non-positional, as shown below. The visualize keyword most closely resembles the select keyword within SQL.

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y,
    origin as color
  from cars
  using points
")

In addition to mapping aesthetics to columns, aesthetics can be mapped to expressions that include transformations and aggregations.

Column-Level Transformations and Aggregations

SGL supports column-level transformations and aggregations, as shown below where a binning transformation is combined with a count aggregation to produce a histogram on miles_per_gallon.

dbGetPlot(con, "
  visualize
    bin(miles_per_gallon) as x,
    count(*) as y
  from cars
  group by
    bin(miles_per_gallon)
  using bars
")

Here we see that, similarly to SQL, SGL has a group by clause where aggregation groupings are specified.

Although SQL itself supports column-level transformation, grouping, and aggregation, it is desirable for SGL to provide additional support for these operations. Statistical graphics often require operations such as binning that are not natively supported by SQL. Additionally, SGL’s column-level transformations and aggregations are performed after scaling, which cannot easily be replicated using SQL. Below is an example of this feature, where binning and counting are applied after log scaling, resulting in a log-scaled histogram.

dbGetPlot(con, "
  visualize
    bin(miles_per_gallon) as x,
    count(*) as y
  from cars
  group by
    bin(miles_per_gallon)
  using bars
  scale by
    log(x)
")

The Collect By Clause

In SGL, a geom is classified as individual if it represents each record (after transformation and aggregation) by a distinct geometric object. Alternatively, a geom is classified as collective if it represents multiple records by one geometric object. For example, points and lines are individual and collective geoms, respectively, as shown below where the same data is represented using each.

dbGetPlot(con, "
  visualize
    year as x,
    avg(miles_per_gallon) as y
  from cars
  group by
    year
  using points
")

dbGetPlot(con, "
  visualize
    year as x,
    avg(miles_per_gallon) as y
  from cars
  group by
    year
  using line
")

For collective geoms, the collection of records to represent by each object is determined automatically using reasonable defaults. This behavior can be overridden by providing explicit collections in the collect by clause. The collect by clause is similar to the group by clause, except that rather than defining groups to aggregate by, the collect by clause defines collections of records to be represented by one object. Below we see an example where the default collection is not ideal, followed by a preferrable explicit collection.

dbGetPlot(con, "
  visualize
    age as x,
    circumference as y
  from trees
  using line
")

dbGetPlot(con, "
  visualize
    age as x,
    circumference as y
  from trees
  collect by
    tree_id
  using lines
")

Geom Qualifiers

Geom qualifiers modify how geoms positionally represent data, and are specified as keywords that precede geom names within the using clause. Geom qualifiers can largely be classified into two groups, statistical qualifiers and collision qualifiers. Statistical qualifiers modify the positional representation via statistical transformation, such as linear regression:

dbGetPlot(con, "
  visualize
    age as x,
    circumference as y
  from trees
  using regression line
")

Collision qualifiers specify positional adjustments that are relevant to overlapping objects. Below we see an example of using the jittered qualifier to add a small amount of random variation so that overlapping points are discernible.

Without the jittered qualifier:

dbGetPlot(con, "
  visualize
    origin as x,
    miles_per_gallon as y
  from cars
  using points
")

With the jittered qualifier:

dbGetPlot(con, "
  visualize
    origin as x,
    miles_per_gallon as y
  from cars
  using jittered points
")

The Layer Operator

The graphics in previous sections contain a single layer of geometric objects. However, it is common to use multiple layers of objects to represent data in a single graphic. In SGL, multiple layers can be combined using the layer operator. For example, we can layer a regression line on a scatterplot:

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using points

  layer

  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using regression line
")

Layers often share a data source and aesthetic mapping. To reduce verbosity in these cases, the layer operator can be applied directly to geom expressions:

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using (
    points
    layer
    regression line
  )
")

Layers may have different data sources and aesthetic mappings. However, a graphic has a single scale for each aesthetic, regardless of the number of layers. As a result, a given aesthetic must be mapped to consistent type across all layers where it is present, e.g. an aesthetic cannot be mapped to a numerical type in one layer and a categorical type in another.

The Scale By Clause

Each mapped aesthetic has a scale that determines how data values are mapped to the corresponding visual property. Scales are determined implicitly by default, but can be explicitly specified within the scale by clause. Below we specify a log scale for the x and y aesthetics, overriding the default linear scaling for numerical mappings:

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using (
    points
    layer
    regression line
  )
  scale by
    log(x), log(y)
")

Scaling is performed prior to column-level transformations and aggregations. Additionally, scaling is performed prior to any positional modifications specified by geom qualifiers, e.g., the regression calculation above is performed after log-scaling the x and y aesthetics.

Scaling functions such as log are applied to aesthetic names rather than column names as these are modifications to aesthetic scales rather than actual data values. This distinction is demonstrated below, where a log function is instead applied to actual data values in a SQL subquery:

dbGetPlot(con, "
  visualize
    log_hp as x,
    log_mpg as y
  from (
    select
      log(horsepower) as log_hp,
      log(miles_per_gallon) as log_mpg
    from cars
  )
  using (
    points
    layer
    regression line
  )
")

Coordinate Systems

The graphics in prior examples use Cartesian coordinates, but alternative coordinate systems are valid. In SGL, the coordinate system is inferred from the positional aesthetics in the visualize clause. For example, x and y aesthetics imply Cartesian coordinates, whereas theta and r imply polar coordinates.

The examples below display the same information in Cartesian and polar coordinates. In the grammar of graphics (which SGL is based on), pie charts are stacked bar charts in a polar coordinate system. In SGL, the bar geom is stacked by default.

dbGetPlot(con, "
  visualize
    count(*) as y,
    origin as color
  from cars
  group by
    origin
  using bars
")

dbGetPlot(con, "
  visualize
    count(*) as theta,
    origin as color
  from cars
  group by
    origin
  using bars
")

Since a SGL statement may have multiple layers, it may also have multiple visualize clauses, each with positional aesthetic mappings. Since a graphic has a single coordinate system, the positional aesthetics referenced across layers must be consistent, e.g. one layer cannot reference x and y aesthetics while another references theta and r.

The Facet By Clause

Faceting generates small multiples where each panel represents a different partition of the source data. Partitioning is determined by the unique values for expressions specified in the facet by clause. Faceting by a single expression generates horizontal panels by default, but this can be modified with an orientation keyword:

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using points
  facet by
    origin
")

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using points
  facet by
    origin vertically
")

Two facet expressions may be specified, in which case one expression is represented horizontally while the other is represented vertically:

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from (
    select
      *,
      case
        when year < 1977
        then '< 1977'
        else '>= 1977'
      end as 'era'
    from cars
  )
  using points
  facet by
    era,
    origin
")

In SGL, facets are a graphic-level property, meaning that each SGL statement has at most one facet by clause, and that each layer is faceted accordingly.

The Title Clause

Titles for aesthetic scales are automatically determined from aesthetic mappings. However, this can be overridden by providing explicit titles in the title clause:

dbGetPlot(con, "
  visualize
    horsepower as x,
    miles_per_gallon as y
  from cars
  using points
  title
    x as 'Horsepower',
    y as 'Miles Per Gallon'
")

Next steps

  • Reference — details on the specific geoms, aesthetics, qualifiers, transformations, aggregations, and scales available in rsgl.
  • Example gallery — a collection of plots generated with rsgl.
  • SGL Paper — covers the language in greater depth, including SGL’s underlying grammar of graphics.