6 Prototypical Repository
The following file repository structure has supported a wide spectrum of our projects, ranging from (a) a small, short-term retrospective project with one dataset, one manipulation file, and one analysis report to (b) a large, multi-year project fed by dozens of input files to support multiple statisticians and a sophisticated enrollment process.
Looking beyond any single project, we strongly encourage your team to adopt a common file organization. Pursuing commonality provides multiple benefits:
An evolved and thought-out structure makes it easier to follow good practices and avoid common traps.
Code files are more portable between projects. More code can be reused if both environments refer to same files and directories like config.yml, data-public/raw, and data-public/derived
People are more portable between projects. When a person is already familiar with the structure, they start contributing more quickly because they already know to look for statistical reports in analysis/ and to debug problematic file ingestions in manipulation/ files.
If a specific project doesn’t use a directory or file, we recommend retaining the stub. Like the empty chunks discusses in the Prototypical R File chapter, a stub communicates to your collaborator, “this project currently doesn’t use the feature, but when/if it does, this will be the location”. The collaborator can stop their search immediately, and avoid searching weird places in order to rule-out the feature is located elsewhere.
The template that has worked well for us is publicly available at https://github.com/wibeasley/RAnalysisSkeleton. The important files and directories are described below. Please use this as a starting point, and not as a dogmatic prison. Make adjustments when it fits your specific project or your overall team.
6.1 Root
The following files live in the repository’s root directory, meaning they are not in any subfolder/subdirectory.
6.1.1 config.R
The configuration file is simply a plain-text yaml file read by the config package. It is well-suited when a value has to be coordinated across multiple files.
Also see the discussion of how we use the config file for excluding bad data values and of how the config file relates to yaml, json, and xml.
default:
# To be processed by Ellis lanes
path_subject_1_raw: "data-public/raw/subject-1.csv"
path_mlm_1_raw: "data-public/raw/mlm-1.csv"
# Central Database (produced by Ellis lanes).
path_database: "data-public/derived/db.sqlite3"
# Analysis-ready datasets (produced by scribes & consumed by analyses).
path_mlm_1_derived: "data-public/derived/mlm-1.rds"
# Metadata
path_annotation: "data-public/metadata/cqi-annotation.csv"
# Logging errors and messages from automated execution.
path_log_flow: !expr strftime(Sys.time(), "data-unshared/log/flow-%Y-%m-%d--%H-%M-%S.log")
# time_zone_local : "America/Chicago" # Force local time, in case remotely run.
# ---- Validation Ranges & Patterns ----
range_record_id : !expr c(1L, 999999L)
range_dob : !expr c(as.Date("2010-01-01"), Sys.Date() + lubridate::days(1))
range_datetime_entry : !expr c(as.POSIXct("2019-01-01", tz="America/Chicago"), Sys.time())
max_age : 25
pattern_mrn : "^E\\d{9}$" # An 'E', followed by 9 digits.
6.1.2 flow.R
The workflow of the repo is determined by flow.R
. It calls (typically R, Python, and SQL) files in a specific order, while sending the log messages to a file.
See automation mediators for more details.
6.1.3 README.md
The readme is automatically displayed the GitHub repository is opened in a browser. Include all the static information that can quickly orientate a collaborator. Common elements include:
- Project Name (see our style guide for naming recommendations)
- Principal Investigator (ultimately accountable for the research) and Project Coordinator (easy contact if questions arise)
- IRB Tracking Number (or whatever oversight committee reviewed and approved the project). This will help communicate more accurately within your larger university or company.
- Abstract or some project description that is already written (for example, part of the IRB submission).
- Documentation locations and resources, such as those described in the
documentation/
section below - Data Locations and resources, such as
- database and database server
- REDCap project id and url
- networked file share
- The PI’s expectations and goals for your analysis team
- Likely deadlines, such as grant and conference submission dates
Each directory can have its own readme file, but (for typical analysis projects) we discourage you from putting too much in each individual readme. We’ve found it becomes cumbersome to keep all the scattered files updated and consistent; it’s also more work for the reader to traverse the directory structure reading everything. Our approach is to concentrate most of the information in the repo’s root readme, and the most of the remaining readmes are static and unchanged across projects (e.g., generic description of data-public/metadata/
).
6.1.4 *.Rproj
The Rproj file stores project-wide settings used by the RStudio IDE, such how trailing whitespaces are handled. The file’s major benefit is that it sets the R session’s working directory, which facilitates good discipline about setting a constant location for all files in the repo. Although the plain-text file can be edited directly, we recommend using RStudio’s dialog box. There is good documentation about Rproj settings. If you are unsure, copy this file to the repo’s root directory and rename it to match the repo exactly.
6.3 analysis/
In a sense, all the directories exist only to support the contents of analysis/
. All the exploratory, descriptive, and inferential statistics are produced by the Rmd files. Each subdirectory is the name of the report, (e.g., analysis/report-te-1
) and within that directory are four files:
- the R file that contains the meat of the analysis (e.g.,
analysis/report-te-1/report-te-1.R
). - the Rmd file that serves as the “presentation layer” and calls the R file (e.g.,
analysis/report-te-1/report-te-1.Rmd
). - the markdown file produced directly by the Rmd (e.g.,
analysis/report-te-1/report-te-1.md
). Some people consider this an intermediate file because it exists mostly by knitr/rmarkdown/pandoc to produce the eventual html file. - the html file that is derived from the markdown file (e.g.,
analysis/report-te-1/report-te-1.html
). The markdown and html files can be safely discarded because they will be reproduced the next time the Rmd is rendered. All the tables and graphs in the html file are self-contained, meaning the single file is portable and emailed without concern for the directory it is read from. Collaborators rarely care about any manipulation files or analysis code; they almost always look exclusively at the outputed html.
6.4 data-public/
This directory should contain information that is not sensitive and is not proprietary. It SHOULD NOT hold PHI (Protected Health Information), or other information like participant names, social security numbers, or passwords. Files with PHI should not be stored in a GitHub repository, even a private GitHub repository.
Please see data-unshared/
for options storing sensitive information.
The data-public/
directory typically works best if organized with subdirectories. We commonly use these subdirectories, which corresponds with the Data at Rest chapter.
6.4.1 data-public/raw/
…for the input to the pipelines. These datasets usually represents all the hard work of the data collection.
6.4.2 data-public/metadata/
…for the definitions of the datasets in raw. For example, “gender.csv” might translate the values 1 and 2 to male and female. Sometimes a dataset feels natural in either the raw or the metadata subdirectory. If the file would remain unchanged if a subsequent sample was collected, lean towards metadata.
6.4.3 data-public/derived/
…for output of the pipelines. Its contents should be completely reproducible when starting with data-public/raw/
and the repo’s code. In other words, it can be deleted and recreated at ease. This might contain a small database file, like SQLite.
6.4.4 data-public/logs/
…for logs that are useful to collaborators or necessary to demonstrate something in the future, beyond the reports contained in the analysis/
directory.
6.4.5 data-public/original/
…for nothing (hopefully); ideally it is never used. It is similar to data-public/raw/
. The difference is that data-public/raw/
is called by the pipeline code, while data-public/original/
is not.
A file in data-public/original/
typically comes from the investigator in a malformed state and requires some manual intervention; then it is copied to data-public/raw/
. Common offenders are (a) a csv or Excel file with bad or missing column headers, (b) a strange file format that is not readable by an R package, (c) a corrupted file that require a rehabilitation utility.
6.4.6 Characteristics
The characteristics of data-public/
vary based on the subject matter. For instance, medical research projects typically use only the metadata directory of a repo, because the incoming information contains PHI and therefore a database is the preferred location. On the other hand, microbiology and physics research typically do not have data protected by law, and it is desirable for the repo to contain everything so it’s not unnecessarily spread out.
We feel a private GitHub repo offers adequate protection if being scooped is the biggest risk.
6.6 documentation/
Good documentation is scarce and documentation files consume little space, so liberally copy everything you get to this directory. The most helpful include:
- Approval letters from the IRB or other oversight board. This is especially important if you are also the gatekeeper of a database, and must justify releasing sensitive information.
- Data dictionaries for the incoming datasets your team is ingesting.
- Data dictionaries for the derived datasets your team is producing.
If the documentation is public and stable, like the CDC’s site for vaccination codes, include the url in the repo’s readme. If you feel the information or the location may change, copy the url and also the full document so it is easier to reconstruct your logic when returning to the project in a few years.
6.7 Optional
Everything mentioned until now should exist in the repo, even if the file or directory is empty. Some projects benefit from the following additional capabilities.
6.7.1 DESCRIPTION
The plain-text DESCRIPTION
file lives in the repo’s root directory –see the example in R Analysis Skeleton. The file allows your repo to become an R package, which provides the following benefits even if it will never be deployed to CRAN.
- specify the packages (and their versions) required by your code. This include packages that aren’t available on CRAN, like OuhscBbmc/OuhscMunge.
- better unify and test the common code called from multiple files.
- better document functions and datasets within the repo.
The last two bullets are essentially an upgrade from merely sticking code in a file and sourcing it.
A package offers many capabilities beyond those listed above, but a typical data science repo will not scratch the surface. The larger topic is covered in Hadley Wickham’s R Packages.
6.7.2 utility/
Include files that may be run occasionally, but are not required to reproduce the analyses. Examples include:
- code for submitting the entire repo pipeline on a super computer,
- simulate artificial demonstration data, or
- running diagnostic checks on your code using something like the goodpractice or urlchecker.
6.7.3 stitched-output/
Stitching is a light-weight capability of knitr/rmarkdown. If you stitch the repo’s files (to server as a type of logging), consider directing all output to this directory. The basic call is:
knitr::stitch_rmd(
script = "manipulation/car-ellis.R",
output = "stitched-output/manipulation/car-ellis.md"
)
We don’t use this approach for medical research, because sensitive information is usually contained in the output, and sensitive patient information should not be stored in the repo. (That’s the last time I’ll talk about sensitive information –at least in this chapter.)