2 Coding Principles

2.1 Simplify

2.1.1 Data Types

Use the simplest data type reasonable. A simpler data type is less likely contain unintended values. As we have seen, a string variable called gender can simultaneously contain the values “m”, “f”, “F”, “Female”, “MALE”, “0”, “1”, “2”, “Latino”, ““, and NA. On the other hand, a boolean variable gender_male can be only FALSE, TRUE, and NA.1

SQLite does not have a dedicated datatype, so you must resort to storing it as 0, 1 and NULL values. Because a caller can’t assume that an ostensible boolean SQLite variable contains only those three values, the variable should be checked.

Once you have cleaned a variable in your initial ETL files (like an Ellis), establish boundaries so you do not have to spend time in the downstream files verifying that no bad values have been introduced. As a small bonus, simpler data types are typically faster, consume less memory, and translate more cleanly across platforms.

Within R, numeric-ish variables can be represented by the following four data types. Use the simplest type that adequately captures the information. logical is the simplest and numeric is the most flexible.

  1. logical/boolean/bit,
  2. integer,
  3. bit64::integer64, and
  4. numeric/double-precision floats.

Categorical variables have a similar spectrum. After logical types, factors are more restrictive and less flexible than characters.2

  1. logical/boolean/bit,
  2. factor, and
  3. character.

2.1.2 Categorical Levels

When a boolean variable would be too restrictive and a factor or character is required, choose the simplest representation. Where possible:

  1. Use only lower case (e.g., ‘male’ instead of ‘Male’ for the gender variable).
  2. avoid repeating the variable in the level (e.g., ‘control’ instead of ‘control condition’ for the condition variable).

2.1.3 Recoding

Almost every project recodes variables. Choose the simplest function possible. The functions at the top are easier to read and harder to mess up than the functions below it

  1. Leverage existing booleans: Suppose you have the logical variable gender_male (which can be only TRUE, FALSE, or NA). Writing gender_male == TRUE or gender_male == FALSE will evaluate to a boolean –that’s unnecessary because gender_male is already a boolean.

    1. Testing for TRUE: use the variable by itself (i.e., gender_male instead of gender_male == TRUE).

    2. Testing for FALSE: use !. Write !gender_male instead of gender_male == FALSE or gender_male != TRUE.

  2. dplyr::coalesce(): The function evaluates a single variable and replaces NA with values from another variable.

    A coalesce like

    visit_completed <- dplyr::coalesce(visit_completed, FALSE)

    is much easier to read and not mess up than

    visit_completed <- dplyr::if_else(!is.na(visit_completed), visit_completed, FALSE)
  3. dplyr::na_if() transforms a nonmissing value into an NA.

    Recoding missing values like

    birth_apgar <- dplyr::na_if(birth_apgar, 99)

    is easier to read and not mess up than

    birth_apgar <- dplyr::if_else(birth_apgar == 99, NA_real_, birth_apgar)
  4. <= (or a similar comparison operator): Compare two quantities to output a boolean variable. The parentheses are unnecessary, but can help readability. If either value is NA, then the result is NA.

    Notice that we prefer to order the variables like a number line. When the result is TRUE, the smaller value is to the left of the larger value.

    dob_in_the_future   <- (Sys.Date() < dob)
    dod_follows_dob     <- (dob <= dod)
    premature           <- (gestation_weeks < 37)
    big_boy             <- (threshold_in_kg <= birth_weight_in_kg)
  5. dplyr::if_else(): The function evaluates a single boolean variable or expression. The output branches to only three possibilities: the input is (a) true, (b) false, or (c) (optionally) NA. Notice that unlike the <= operator, dplyr::if_else() lets you specify a value if the input expression evaluates to NA.

    date_start  <- as.Date("2017-01-01")
    
    # If a missing month element needs to be handled explicitly.
    stage       <- dplyr::if_else(date_start <= month, "post", "pre", missing = "missing-month")
    
    # Otherwise a simple boolean output is sufficient.
    stage_post  <- (date_start <= month)

    If it is important that the reader understand that an input expression of NA will produce an NA, consider using dplyr::if_else(). Even though these two lines are equivalent, a casual reader may not consider that stage_post could be NA.

    stage_post  <- (date_start <= month)
    stage_post  <- dplyr::if_else(date_start <= month, TRUE, FALSE, missing = NA)
  6. dplyr::between(): The function evaluates a numeric x against a left and a right boundary to return a boolean value. The output is TRUE if x is inside the boundaries or equal to either boundary (i.e., the boundaries are inclusive). The output is FALSE if x is outside either boundary.

    too_cold      <- 60
    too_hot       <- 88
    goldilocks_1  <- dplyr::between(temperature, too_cold, too_hot)
    
    # This is equivalent to the previous line.
    goldilocks_2  <- (too_cold <= temperature & temperature <= too_hot)

    If you need an exclusive boundary, abandon dplyr::between() and specify it exactly.

    # Left boundary is exclusive
    goldilocks_3  <- (too_cold < temperature & temperature <= too_hot)
    
    # Both boundaries are exclusive
    goldilocks_4  <- (too_cold < temperature & temperature <  too_hot)

    If your code starts to nest dplyr::between() calls inside dplyr::if_else(), consider base::cut().

  7. base::cut(): The function transforms a single numeric variable into a factor. Its range is cut into different segments/categories on the one-dimensional number line. The output branches to single discrete value (either a factor-level or an integer). Modify the right parameter to FALSE if you’d like the left/lower bound to be inclusive (which tends to be more natural for me).

    mtcars |>
      tibble::as_tibble() |>
      dplyr::select(
        disp,
      ) |>
      dplyr::mutate(
        # Example of a simple inequality operator (see two bullets above)
        muscle_car            = (300 <= disp),
    
        # Divide `disp` into three levels.
        size_default_labels   = cut(disp, breaks = c(-Inf, 200, 300, Inf), right = F),
    
        # Divide `disp` into three levels with custom labels.
        size_cut3             = cut(
          disp,
          breaks = c(-Inf,   200,      300,   Inf),
          labels = c(  "small", "medium", "big"),
          right = FALSE  # Is the right boundary INclusive ('FALSE' is an EXclusive boundary)
        ),
    
        # Divide `disp` into five levels with custom labels.
        size_cut5             = cut(
          disp,
          breaks = c(-Inf,         100,            150,            200,      300,   Inf),
          labels = c(  "small small", "medium small", "biggie small", "medium", "big"),
          right = FALSE
        ),
      )
  8. dplyr::recode(): The function accepts an integer or character variable. The output branches to a single discrete value. This example maps integers to strings.

    # https://www.census.gov/quickfacts/fact/note/US/RHI625219
    race_id        <- c(1L, 2L, 1L, 4L, 3L, 4L, 2L, NA_integer_)
    race_id_spouse <- c(1L, 1L, 2L, 3L, 3L, 4L, 5L, NA_integer_)
    race <-
      dplyr::recode(
        race_id,
        "1"      = "White",
        "2"      = "Black or African American",
        "3"      = "American Indian and Alaska Native",
        "4"      = "Asian",
        "5"      = "Native Hawaiian or Other Pacific Islander",
        .missing = "Unknown"
      )

    If multiple variables have the same mapping, define the mapping once in a named vector, and pass it for multiple calls to dplyr::recode(). Notice that the two variables race and race_spouse use the same mapping.3

    mapping_race <- c(
      "1" = "White",
      "2" = "Black or African American",
      "3" = "American Indian and Alaska Native",
      "4" = "Asian",
      "5" = "Native Hawaiian or Other Pacific Islander"
    )
    race <-
      dplyr::recode(
        race_id,
        !!!mapping_race,
        .missing = "Unknown"
      )
    race_spouse <-
      dplyr::recode(
        race_id_spouse,
        !!!mapping_race,
        .missing = "Unknown"
      )

    Tips for dplyr::recode():

    • A reusable dedicated mapping vector is very useful for surveys with 10+ Likert items with consistent levels like “disagree”, “neutral”, “agree”.
    • Use dplyr::recode_factor() to map integers to factor levels.
    • forcats::fct_recode() is similar. We prefer the .missing parameter of dplyr::recode() that translates an NA into an explicit value.
    • When using the REDCap API, these functions help convert radio buttons to a character or factor variable.
  9. lookup table: It is feasible to recode 6 levels of race directly in R, but it’s less feasible to recode 200 provider names. Specify the mapping in a csv, then use readr to convert the csv to a data.frame, and finally left join it.

  10. dplyr::case_when(): The function is the most complicated because it can evaluate multiple input variables. Also, multiple cases can be true, but only the first output is returned. This ‘water fall’ execution helps in complicated scenarios, but is overkill for most.

2.2 Defensive Style

2.2.1 Qualify functions

Try to prepend each function with its package. Write dplyr::filter() instead of filter(). When two packages contain public functions with the same name, the package that was most recently called with library() takes precedent. When multiple R files are executed, the packages’ precedents may not be predictable. Specifying the package eliminates the ambiguity, while also making the code easier to follow. For this reason, we recommend that almost all R files contain a ‘load-packages’ chunk.

See the Google Style Guide for more about qualifying functions.

Some exceptions exist, including:

2.2.2 Date Arithmetic

Don’t use the minus operator (i.e., -) to subtract dates. Instead use as.integer(difftime(stop, start, units="days")). It’s longer but protects from the scenario that start or stop are changed upstream from a date to a datetime. In that case, stop - start returns the number of seconds between the two points, not the number of days.

2.2.3 Excluding Bad Cases

Some variables are critical to the record, and if it’s missing, you don’t want or trust any of its other values. For instance, a hospital visit record rarely useful with a null patient ID. In these cases, prevent the record from passing through the ellis.

In this example, we’ll presume we cannot trust a patient record if it lacks a clean date of birth (dob).

  1. Define the permissible range, in either the ellis’s declare-globals chunk, or in the config-file. (We’ll use the config file for this example.) We’ll exclude anyone born before 2000, or after tomorrow. Even though it’s illogical for someone in a retrospective record to be born tomorrow, consider bending a little for small errors.

    range_dob   : !expr c(as.Date("2000-01-01"), Sys.Date() + lubridate::days(1))
  2. In the tweak-data chunk, use OuhscMunge::trim_date() to set the cell to NA if it falls outside an acceptable range. After dplyr::mutate(), call tidyr::drop_na() to exclude the entire record, regardless if (a) it was already NA, or (b) was “trimmed” to NA.

    ds <-
      ds |>
      dplyr::mutate(
        dob = OuhscMunge::trim_date(dob, config$range_dob)
      ) |>
      tidyr::drop_na(dob)
  3. Even though it’s overkill after trimming, (eventually) verify the variable for three reasons: (a) there’s a chance that the code above isn’t working as expected, (b) later code might have introduced bad values, and (c) it clearly documents to a reader that dob was included in this range at this stage of the pipeline.

    checkmate::assert_date(ds$dob, any.missing=F, lower=config$range_dob[1], upper=config$range_dob[2])

2.2.4 Throw errors for bad cells

The checkmate::assert_*() functions will throw an error and stop R’s execution when encountering a vector that violates the constraints you specified. The previous snippet will alert you if

  • ds$dob is not a date,
  • ds$dob has at least one NA value, or
  • ds$dob has value earlier than config$range_dob[1] or later than config$range_dob[2].

The package has a family of functions that accommodate many types of vectors. Some common conditions to verify are:

  • the vector’s values are unique, which arises when you’re about to upload a primary key to a database (e.g., a patient ID to the patient table),

    checkmate::assert_integer(ds$pt_id, unique = TRUE)
  • a vector’s string should follow a strict pattern (e.g., the patient ID is a “A” or “B”, followed by 4 digits)

    checkmate::assert_character(ds$pt_id, pattern = "^[AB]\\d{4}$")
  • the database doesn’t accept names longer than 50 characters

    checkmate::assert_character(ds$name_first, min.chars = 50)
    # or
    checkmate::assert_character(ds$name_first, pattern = "^.{0,50}$")

The pattern argument is ultimately passed to base::grepl(), which leverage regular expressions.

2.2.5 Throw errors for bad conditions

Sometimes a dataset smells fishy even though no single cell violates a constraint. Send up a flare if it’s kinda bad, yet stop the execution if it really stinks.

This is especially important for recurring scripts that process new datasets that are never inspected by a human, such as a daily forecast. Even though today’s incoming dataset is fine, you shouldn’t trust next month’s. At worst, the lonely test never catches a violation (and you wasted 5 minutes). At best, it catches a problem that would have proceeded undetected and compromised your downstream analyses.

The following snippet asserts it’s acceptable that 2% of patients are missing an age, but it should never get worse than 5%. Therefore it throws an error when the missingness exceeds 5% and it throws an warning if it exceeds 2%.

# Simulate a vector of ages.
ds <- tibble::tibble(
  age = sample(c(NA, 1:19), size = 100, replace = TRUE)
)

# Define thresholds for errors & warnings.
threshold_error     <- .05
threshold_warning   <- .02

# Calculate proportion of missing cells.
missing_proportion  <- mean(is.na(ds$age))

# Accompany the error/warning with an informative message.
if (threshold_error < missing_proportion) {
  stop(
    "The proportion of missing `age` values is ", missing_proportion,
    ", but it shouldn't exceed ", threshold_error, "."
  )
} else if (threshold_warning < missing_proportion) {
  warning(
    "The proportion of missing `age` values is ", missing_proportion,
    ", but ideally it stays below ", threshold_warning, "."
  )
}