Missing values are converted to a factor level. This explicit assignment can reduce the chances that missing values are inadvertently ignored. It also allows the presence of a missing to become a predictor in models.
match_statistics(d_parent, d_child, join_columns)
A numeric
array of the following elements:
parent_in_child
The count of parent records found in the child table.
parent_not_in_child
The count of parent records not found in the child table.
parent_na_any
The count of parent records with a NA
in at least one of the join columns.
deadbeat_proportion
The proportion of parent records not found in the child table.
child_in_parent
The count of child records found in the parent table.
child_not_in_parent
The count of child records not found in the parent table.
child_na_any
The proportion of child records not found in the parent table.
orphan_proportion
The count of child records with a NA
in at least one of the join columns.
If a nonexistent column is passed to join_columns
, an error will be thrown naming the violating column name.
More information about the 'parent' and 'child' terminology and concepts can be found in the Hierarchical Database Model Wikipedia entry, among many other sources.
The join_columns
parameter is passed directly to dplyr::semi_join()
and dplyr::anti_join()
.
ds_parent <- data.frame(
parent_id = 1L:10L,
letter = rep(letters[1:5], each=2),
index = rep(1:2, times=5),
dv = runif(10),
stringsAsFactors = FALSE
)
ds_child <- data.frame(
child_id = 101:140,
parent_id = c(4, 5, rep(6L:14L, each=4), 15, 16),
letter = rep(letters[3:12], each=4),
index = rep(1:2, each=2, length.out=40),
dv = runif(40),
stringsAsFactors = FALSE
)
#Match on one column:
match_statistics(ds_parent, ds_child, join_columns="parent_id")
#> parent_in_child parent_not_in_child parent_na_any deadbeat_proportion
#> 7.00 3.00 0.00 0.30
#> child_in_parent child_not_in_parent child_na_any orphan_proportion
#> 22.00 18.00 0.00 0.45
#Match on two columns:
match_statistics(ds_parent, ds_child, join_columns=c("letter", "index"))
#> parent_in_child parent_not_in_child parent_na_any deadbeat_proportion
#> 6.0 4.0 0.0 0.4
#> child_in_parent child_not_in_parent child_na_any orphan_proportion
#> 12.0 28.0 0.0 0.7
## Produce better format for humans to read
match_statistics_display(ds_parent, ds_child, join_columns="parent_id")
#> [1] "\n\nMatch stats for `ds_parent` vs `ds_child` on column(s): \"parent_id\".\n| parent in child | 7 |\n| parent not in child | 3 |\n| parent na any | 0 |\n| child in parent | 22 |\n| child not in parent | 18 |\n| child na any | 0 |\n| deadbeat proportion | 30.0000% |\n| orphan proportion | 45.0000% |\n"
match_statistics_display(ds_parent, ds_child, join_columns=c("letter", "index"))
#> [1] "\n\nMatch stats for `ds_parent` vs `ds_child` on column(s): c(\"letter\", \"index\").\n| parent in child | 6 |\n| parent not in child | 4 |\n| parent na any | 0 |\n| child in parent | 12 |\n| child not in parent | 28 |\n| child na any | 0 |\n| deadbeat proportion | 40.0000% |\n| orphan proportion | 70.0000% |\n"