Create explicit factor level for missing values. — match

Missing values are converted to a factor level. This explicit assignment can reduce the chances that missing values are inadvertently ignored. It also allows the presence of a missing to become a predictor in models.

match_statistics(d_parent, d_child, join_columns)

Arguments

d_parent: A data.frame of the parent table.
d_child: A data.frame of the child table.
join_columns: The character vector of the column names used to join to parent and child tables.

Value

A numeric array of the following elements:

parent_in_child The count of parent records found in the child table.
parent_not_in_child The count of parent records not found in the child table.
parent_na_any The count of parent records with a NA in at least one of the join columns.
deadbeat_proportion The proportion of parent records not found in the child table.
child_in_parent The count of child records found in the parent table.
child_not_in_parent The count of child records not found in the parent table.
child_na_any The proportion of child records not found in the parent table.
orphan_proportion The count of child records with a NA in at least one of the join columns.

Details

If a nonexistent column is passed to join_columns, an error will be thrown naming the violating column name.

More information about the 'parent' and 'child' terminology and concepts can be found in the Hierarchical Database Model Wikipedia entry, among many other sources.

Note

The join_columns parameter is passed directly to dplyr::semi_join() and dplyr::anti_join().

Author

Will Beasley

Examples

ds_parent <- data.frame(
  parent_id         = 1L:10L,
  letter            = rep(letters[1:5], each=2),
  index             = rep(1:2, times=5),
  dv                = runif(10),
  stringsAsFactors  = FALSE
)
ds_child <- data.frame(
  child_id          = 101:140,
  parent_id         = c(4, 5, rep(6L:14L, each=4), 15, 16),
  letter            = rep(letters[3:12], each=4),
  index             = rep(1:2, each=2, length.out=40),
  dv                = runif(40),
  stringsAsFactors  = FALSE
)

#Match on one column:
match_statistics(ds_parent, ds_child, join_columns="parent_id")
#>     parent_in_child parent_not_in_child       parent_na_any deadbeat_proportion 
#>                7.00                3.00                0.00                0.30 
#>     child_in_parent child_not_in_parent        child_na_any   orphan_proportion 
#>               22.00               18.00                0.00                0.45 

#Match on two columns:
match_statistics(ds_parent, ds_child, join_columns=c("letter", "index"))
#>     parent_in_child parent_not_in_child       parent_na_any deadbeat_proportion 
#>                 6.0                 4.0                 0.0                 0.4 
#>     child_in_parent child_not_in_parent        child_na_any   orphan_proportion 
#>                12.0                28.0                 0.0                 0.7 

## Produce better format for humans to read
match_statistics_display(ds_parent, ds_child, join_columns="parent_id")
#> [1] "\n\nMatch stats for `ds_parent` vs `ds_child` on column(s): \"parent_id\".\n| parent in child     |        7 |\n| parent not in child |        3 |\n| parent na any       |        0 |\n| child in parent     |       22 |\n| child not in parent |       18 |\n| child na any        |        0 |\n| deadbeat proportion | 30.0000% |\n| orphan proportion   | 45.0000% |\n"
match_statistics_display(ds_parent, ds_child, join_columns=c("letter", "index"))
#> [1] "\n\nMatch stats for `ds_parent` vs `ds_child` on column(s): c(\"letter\", \"index\").\n| parent in child     |        6 |\n| parent not in child |        4 |\n| parent na any       |        0 |\n| child in parent     |       12 |\n| child not in parent |       28 |\n| child na any        |        0 |\n| deadbeat proportion | 40.0000% |\n| orphan proportion   | 70.0000% |\n"