C Snippets
C.1 Reading External Data
C.1.1 Reading from Excel
Background: Avoid Excel for the reasons previously discussed. But if there isn’t another good option, be protective. readxl::read_excel()
allows you to specify column types, but not column order. The names of col_types
is ignored by readxl::read_excel()
. To defend against roaming columns (e.g., the files changed over time), tesit::assert()
that the order is what you expect.
See the readxl vignette, Cell and Column Types, for more info.
Last Modified: 2019-12-12 by Will
# ---- declare-globals ---------------------------------------------------------
config <- config::get()
# cat(sprintf(' `%s` = "text",\n', colnames(ds)), sep="") # 'text' by default --then change where appropriate.
col_types <- c(
`Med Rec Num` = "text",
`Admit Date` = "date",
`Tot Cash Pymt` = "numeric"
)
# ---- load-data ---------------------------------------------------------------
ds <- readxl::read_excel(
path = config$path_admission_charge,
col_types = col_types
# sheet = "dont-use-sheets-if-possible"
)
testit::assert(
"The order of column names must match the expected list.",
names(col_types) == colnames(ds)
)
# Alternatively, this provides more detailed error messages than `testit::assert()`
# testthat::expect_equal(
# colnames(d),
# names(col_types),
# label = "worksheet's column name (x)",
# expected.label = "col_types' name (y)"
# )
C.1.2 Removing Trailing Comma from Header
Background: Occasionally a Meditech Extract will have an extra comma at the end of the 1st line. For each subsequent line, readr:read_csv()
appropriately throws a new warning that it is missing a column. This warning flood can mask real problems.
Explanation: This snippet (a) reads the csv as plain text, (b) removes the final comma, and (c) passes the plain text to readr::read_csv()
to convert it into a data.frame.
Instruction: Modify Dx50 Name
to the name of the final (real) column.
Real Example: truong-pharmacist-transition-1 (Accessible to only CDW members.)
Last Modified: 2019-12-12 by Will
# The next two lines remove the trailing comma at the end of the 1st line.
raw_text <- readr::read_file(path_in)
raw_text <- sub("^(.+Dx50 Name),", "\\1", raw_text)
ds <- readr::read_csv(raw_text, col_types=col_types)
C.1.3 Removing Trailing Comma from Header
Background: When incoming data files are on the large side to comfortably accept with readr, we use vroom. The two packages are developed by the same group and might be combined in the future.
Explanation: This snippet defines the col_types
list with names to mimic our approach of using readr. There are some small differences with our readr approach:
1. col_types
is a list instead of a readr::cols_only
object.
1. The call to vroom::vroom()
passes col_names = names(col_types)
explicitly.
1. If the data file contains columns we don’t need, we define them in col_types
anyway; vroom needs to know the file structure if it’s missing a header row.
Real Example: akande-medically-complex-1 (Accessible to only CDW members.) Thesee files did not have a header of variable names; the first line of the file is the first data row.
Last Modified: 2020-08-21 by Will
# ---- declare-globals ---------------------------------------------------------
config <- config::get()
col_types <- list(
sak = vroom::col_integer(), # "system-assigned key"
aid_category_id = vroom::col_character(),
age = vroom::col_integer(),
service_date_first = vroom::col_date("%m/%d/%Y"),
service_date_lasst = vroom::col_date("%m/%d/%Y"),
claim_type = vroom::col_character(),
provider_id = vroom::col_character(),
provider_lat = vroom::col_double(),
provider_long = vroom::col_double(),
provider_zip = vroom::col_character(),
cpt = vroom::col_integer(),
revenue_code = vroom::col_integer(),
icd_code = vroom::col_character(),
icd_sequence = vroom::col_integer(),
vocabulary_coarse_id = vroom::col_integer()
)
# ---- load-data ---------------------------------------------------------------
ds <- vroom::vroom(
file = config$path_ohca_patient,
delim = "\t",
col_names = names(col_types),
col_types = col_types
)
rm(col_types)
C.2 Row Operations
We frequently have to find the mean or sum across columns (within a row).
If
Finding mean across a lot of columns
Here are several approaches for finding the mean across columns, without naming each column. Some remarks:
-
m1
&m2
are sanity checks for this example.m1
would be clumsy if you have 10+ items.m2
is discouraged because it’s brittle.
A change in the column order could alter the calculation. We prefer to usegrep()
to specify a sequence of items. - Especially for large datasets,
I’d lean towards
m3
if the items are reasonably complete andm4
if some participants are missing enough items that their summary score is fishy. In the approaches below,m4
andm6
return the mean only if the participant completed 2 or more items. -
dplyr::rowwise()
is convenient, but slow for large datasets. - If you need a more complex function that’s too clumsy to include directly in a
mutate()
statement, see how the calculation form6
is delegated to the external function,f6
. - The technique behind
nonmissing
is pretty cool, because you can apply an arbitrary function on each cell before they’re summed/averaged.
- This is in contrast to
f6()
, which applies to an entire (row-wise) data.frame.
# Isolate the columns to average. Remember the `grep()` approach w/ `colnames()`
columns_to_average <- c("hp", "drat", "wt")
f6 <- function(x) {
# browser()
s <- sum(x, na.rm = TRUE)
n <- sum(!is.na(x))
dplyr::if_else(
2L <= n,
s / n,
NA_real_
)
}
mtcars |>
dplyr::mutate(
m1 = (hp + drat + wt) / 3,
m2 =
rowMeans(
dplyr::across(hp:wt), # All columns between hp & wt.
na.rm = TRUE
),
m3 =
rowMeans(
dplyr::across(!!columns_to_average),
na.rm = TRUE
),
s4 = # Finding the sum (used by m4)
rowSums(
dplyr::across(!!columns_to_average),
na.rm = TRUE
),
nonmissing =
rowSums(
dplyr::across(
!!columns_to_average,
.fns = \(x) { !is.na(x) }
)
),
m4 =
dplyr::if_else(
2 <= nonmissing,
s4 / nonmissing,
NA_real_
)
) |>
dplyr::rowwise() |> # Required for `m5`
dplyr::mutate(
m5 = mean(dplyr::c_across(dplyr::all_of(columns_to_average))),
) |>
dplyr::ungroup() |> # Clean up after rowwise()
dplyr::rowwise() |> # Required for `m6`
dplyr::mutate(
m6 = f6(dplyr::across(!!columns_to_average))
) |>
dplyr::ungroup() |> # Clean up after rowwise()
dplyr::select(
hp,
drat,
wt,
m1,
m2,
m3,
s4,
nonmissing,
m4,
m5,
m6,
)
C.3 Grooming
C.3.1 Correct for misinterpreted two-digit year
Background: Sometimes the Meditech dates are specified like 1/6/54
instead of 1/6/1954
. readr::read_csv()
has to choose if the year is supposed to be ‘1954’ or ‘2054’. A human can use context to guess a birth date is in the past (so it guesses 1954), but readr can’t (so it guesses 2054). For avoid this and other problems, request dates in an ISO-8601 format.
Explanation: Correct for this in a dplyr::mutate()
clause; compare the date value against today. If the date is today or before, use it; if the day is in the future, subtract 100 years.
Instruction: For future dates such as loan payments, the direction will flip.
Last Modified: 2019-12-12 by Will
ds |>
dplyr::mutate(
dob = dplyr::if_else(dob <= Sys.Date(), dob, dob - lubridate::years(100))
)
C.5 Correspondence with Collaborators
C.5.1 Excel files
Receiving and storing Excel files should almost always be avoided for the reasons explained in this letter.
We receive extracts as Excel files frequently, and have the following request ready to email the person sending us Excel files. Adapt the bold values like “109.19” to your situation. If you are familiar with their tools, suggest an alternative for saving the file as a csv. Once presented with these Excel gotchas, almost everyone has an ‘aha’ moment and recognizes the problem. Unfortunately, not everyone has flexible software and can adapt easily.
[Start of the letter]
Sorry to be tedious, but could you please resend the extract as a csv file? Please call me if you have questions.
Excel is being too helpful with some of the values, and essentially corrupting them. For example, values like 109.19 is interpreted as a number, not a character code (e.g., see cell L14). Because of limitations of finite precision, this becomes 109.18999999999999773. We can’t round it, because there are other values in this column that cannot be cast to numbers, such as V55.0. Furthermore, the “E”s in some codes are incorrectly interpreted as the exponent operator (e.g., “4E5” is converted to 400,000).
Finally, values like 001.0 are being converted to a number and any leading or trailing zeros are dropped (so cells like “1” are not distinguishable from “001.0”).
Unfortunately the problems exist in the Excel file itself. When we import the columns as text, the values are already in their corrupted state.
Please compress/zip the csv if the file is be too large to email. We’ve found that an Excel file is typically 5-10 times larger than a compressed csv.
As much as Excel interferes with our medical variables, we’re lucky. It has messed with other branches of science much worse. Genomics were using it far too late before they realized their mistakes.
What happened? By default, Excel and other popular spreadsheet applications convert some gene symbols to dates and numbers. For example, instead of writing out “Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase,” researchers have dubbed the gene MARCH1. Excel converts this into a date—03/01/2016, say—because that’s probably what the majority of spreadsheet users mean when they type it into a cell. Similarly, gene identifiers like “2310009E13” are converted to exponential numbers (2.31E+19). In both cases, the conversions strip out valuable information about the genes in question.
[End of the letter]