Validating an environment consists of 2 elements:
- Confidently recreating the same environment
- Trusting what is in the environment
The first concern, reproducing environments, is covered at length by the different strategies for environment management. The validated strategy is particularly useful for creating sets of approved packages, though other strategies can be used depending on the context.
The second concern forces us to answer the question: "Can we trust our
environment?". To trust an environment, we must have confidence that the
packages are accurate in their stated purpose. Unfortunately, with r nrow(available.packages(repo="https://cran.rstudio.com"))
^[Run on r Sys.Date()
] R packages on CRAN, and more added each day, it is impossible to
provide a single list of trusted packages. Every organization, or industry, will
need to apply their own judgement in determining whether or not to approve a
package. This page presents a set of metrics to help organizations make these
determinations.
Quick Links#
Not what you were expecting? Before continuing, here are some quick links to other resources specific to validation in the clinical pharma space:
- Base R Validation Document for FDA
- RStudio Professional Product Validation
- R in Pharma Validation Hub
- Validation Guidance for the tidyverse, tidymodels, r-lib, and gt packages
- Validation Guidance for shiny and rmarkdown
- Install Verification for RStudio Connect, RStudio Workbench, and RStudio Package Manager
Package Characteristics#
The following heuristics can help you judge whether or not a package is stable and useful. As a general rule of thumb, you can use these characteristics as a checklist when evaluating a package. Like any heuristic, there are exceptions - not all stable and useful packages will have everything.
CRAN Releases#
The first question to ask when evaluating a package is: "Is the package on CRAN?". Before CRAN accepts a package, CRAN runs a thorough set of tests to ensure the package will work with other packages on CRAN. Getting a package through these checks ensures the package is stable, and also indicates the package author is serious and motivated. While not every package on CRAN is perfect, a package on CRAN indicates a minimal level of effort and stability. More information on CRAN tests can be reviewed here.
Note
Many packages include a badge to quickly indicate their current CRAN status. For example, this is the CRAN status badge for ggplot2:
Tests#
In addition to documentation, a critical indicator that a package is ready for prime time is checking to see whether the package has tests. Normally, package authors include tests in a directory alongside their package code. Tests help authors check their code for accuracy and prevent them from accidentally breaking code.
Many packages will go a step further and report test coverage. This metric indicates how much of the package code is currently tested. Often package authors will automatically run tests using a continuous integration service and report test status and code coverage through public badges.
Documentation#
A critical indicator of a package's health and usefulness is the level of documentation. R packages provide documentation in a number of formats:
- Package READMEs
- Package Vignettes
- Function References and Help Files
- Websites
- Books
- Journal Papers
- Presentations
- Cheatsheets
Downloads#
The number of times a package is downloaded can help you determine how frequently a package is used. Often packages with many downloads are more stable than packages with fewer downloads. However, take care when using this metric - occasionally a package with fewer downloads may be a newer alternative to a package that has many downloads but is nearing end of life.
RStudio provides download logs for the popular CRAN mirror, https://cran.rstudio.com. The easiest way to access these logs is through the cranlogs R package and API, or by visiting this shiny app.
Dependencies#
When you consider bringing a package into your environment, it is important to evaluate the package's dependencies. Evaluating the risk of package dependencies is a complex process. A great place to start is reviewing this talk and the related itdepends tool. A few quick tips:
- Package dependencies can be viewed in the package's Description file and come in a few flavors: Suggests, Depends, Imports, and LinkingTo.
- Package dependencies describe what a package relies on. For example ggplot2 imports rlang, which means ggplot2 requires rlang in order to work. Reverse dependencies indicate the opposite, so ggplot2 is a reverse dependency for rlang.
- You should understand how package inter-dependencies impact reproducibility.
- In addition to depending on other R packages, a package can have system requirements. For example, the rJava package requires a Java installation. You can view system dependencies for a package in the Description file, though a more complete listing is available here or in RStudio Package Manager.
Authors#
R packages will list the package's author(s) in the Description file. It can be useful to see the number of authors and their affiliation. For a package on GitHub, it is possible to view the contribution activity. Some packages will include contribution guidelines.
For packages developed in a public forum, such as GitHub, it can be useful to review the package's open issues and pull requests. Are the package authors responsive to questions and feedback? Are issues addressed in a timely manner?
News, Releases, and Life Cycle#
Another indicator of a package's stability is the package's release history. For packages on GitHub, this release history is often visible directly. You can also look for the package's NEWS file.
Unfortunately, just looking at the number of releases or the date of the last release does not paint the whole picture. Some packages will have lots of recent releases because they are rapidly changing. Other packages might not have had a release for quite some time - is this because the package has been abandoned? Or is it because the package is really stable? Considering the package’s state of life can help answer these questions.
License Restrictions#
Finally, when picking a package, you should consider if your organization has any licensing restrictions. Licenses for R packages can be found in their Description file, and many R packages include an additional license file. Organizations with strict licensing requirements might consider an internal repository to track and audit license usage.
Related Work and Advice#
A group of pharmaceutical companies has formed a working group aimed at tackling the question of package validation. Take a look at their preliminary work.
The ROpenSci project has created a repository of packages that undergo significant peer review. Additionally, they also sponsor a tool for identifying useful package metrics.
Julia Silge has written an excellent series of blog posts expanding on the topic of package selection.
Finally, CRAN itself maintains a series of Task Views, and many websites provide options for searching CRAN, such as METACRAN.
Organizing Selected Packages#
If you work in an organization, you may want an easy way to harness tribal knowledge about packages that meet your team's requirements - or packages that have proven useful time and time again. An easy way to share useful sets of packages is through an internal repository which can be created using RStudio Package Manager. Internal repositories also provide an easy way to track package downloads, making it possible to see what packages are actually used by your team!