TLDR This post describes a bash script that can be run in an R project directory. The script, when run automatically installs any packages called within that project. The bash script is available in my GitHub Gists.

Sometimes, when you update to a new version of R, download a project from GitHub or copy a new project from a colleague it’s possible that there are functions or packages called that aren’t currently installed. This can be a real pain in the neck the first time you try to run the scripts. If the project has been set up well it’s often a matter of looking into a single file (perhaps a setup.R file), but in other cases, finding all the packages requires itteratively running the various scripts and installing things on a package by package basis.

I’ve encountered this often enough that I decided to to write a bash script for linux (and MacOS) to (1) search through all R and Rmd files in a directory, (2) return the full list of packages, and then (3) install the packages one by one.

Finding R package calls

Let’s define the scope of the first problem. If we’re going to write something that will detect R packages in a file we need to know: How do people call packages in R? This is a pretty good question in general, and probably not that well defined. We can make an educated guess and choose several fairly common patterns:

library(ggplot2)
library(ggplot2, verbose = TRUE)
library(ggplot2, stringr)
library("ggplot2")
library("ggplot2", "yarrr")
require(stringr)
#' @import fields
#' @importFrom neotoma compile_taxa
neotoma::get_dataset()
dataset %>% dplyr::filter(value > 10) %>% DT::datatable()

These methods of loading a package make up a valid (and common) subset of commands for loading libraries within R. So, if we want to capture a full set of packages we need to be able to figure out how to express the commands above using my favorite tool: regular expressions.

Caveat: There are a number of options for the library() command, and people might call library() any number of ways. While the list above is not an exhaustive list, this reflects the way that I commonly call packages.

Defining our tools

Figure 1. A model program flow for the intended script. A project is downloaded or copied and placed into a folder. The bash script is executed, it checks the directory for package declarations, compares those against packages in the local library folders, and then dowloads missing packages. Cloud Icon Created by Yo! Baba from the Noun Project.

This workflow is designed to work as a bash script within a linux terminal. I am using Ubuntu 18.04, but it should work on a MacOS terminal as well. I wrote it to be used as part of a workflow where I could do something like this:

git clone somegitrepowithRcode
bash installLibs.sh -i

And then run RStudio or an editor of my choice (I’ve been using Atom more lately) without having to worry about hitting messages about missing packages.

If I wanted to go further with the commandline I could edit my .bashrc file to add an alias. For now we’ll work on building the regular expressions and then putting them into a bash file.

Using sed

I use the program sed to perform my regular expression matching. I use sed rather than grep because sed is specifically designed to edit streams of text (sed comes from the contraction of String EDitor). The script will be processing lines of code and returning text to an array, directly interacting with a stream of text. sed also gives us some more tools to work with. For this project I will be using one particular flag with sed:

sed -n pattern source

The -n flag tells sed not to print intermediate results to the screen. Without the -n flag sed will print all of the source to the screen and then also print out any matches. When we build the bash script we will want to do something a bit special: we don’t want to return the whole match, we want to generate a regular expression query that results in a substitution, so that our match to library(ggplot2) returns ggplot2 only. That way we will get a list of packages, and not the full declarations.

The Pattern

The general style for sed matching is options/match/substitution/options. To undertake substitution in sed you need to start with the option s/. Our assumption is that each call to a package will occur only once per line of code, but for the neotoma::get_dataset() it should be clear that people can call nested functions or string multiple functions on a single line using %>% pipes. Because of this we need to implement a global search for that pattern. To do this we use a terminal “global” option, or /g, so we write: s/match/substitution/g in most of our sed commands below.

Capturing a user’s R packages

Note: I am using regex101 for many of my code examples. It’s a very useful too, and all complete regular expressions are linked to a page showing how they work in the context of the examples I provided above. I have another post on regular expressions in R that may also be of interest.

Capturing library calls

The first few cases above, where library() is used, should be relatively straightforward. Regex doesn’t just match complete strings, it allows you to use capture groups, specified elements within the full regex match. So for example the regex ^library\((.+)\) will capture (1) any occurrence of library() (2) at the beginning of a line (indicated by ^), (3) with literal brackets (escaped using \( or \)) that (4) enclose some text (.) that is (5) of length one or more (+). By putting the string .+ in brackets we tell the regular expression engine that this match is a special part of the regular expression, a capture group. Since these brackets are to tell the regex engine something special they are not escaped.

In most regex engines, the capture groups can be returned using the notation either $1 or \1. So we could match library(ggplot2) with ^library\((.+)\) and return ggplot2 with \1. You can try this out in the terminal using:

echo 'library(ggplot2)' | sed 's/^library[(]\(.*\)[)]/\1/p'

Process the sed output

The capture string still captures a variety of library calls, whether quoted (library("ggplot2")), a lists of packages (library(ggplot2, cars)), or quoted lists. To manage these various outputs we need to use bash pipes and a function called tr, to clean up any extraneous characters and turn the packages into an array that can be used in bash. Try this:

echo 'library("ggplot2", neotoma, "dplyr", verbose = FALSE)' | \
  sed -n 's/^library[(]\(.*\)[)]/\1/p' | \
  tr "," "\n" | \
  tr -d "[\"\\']" | \
  sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g"

The sed matching (with the -n option) is piped (|) out. With the first tr we translate all occurrences of , to a carriage return (\n). The second tr deletes (-d) occurrences of single or double quotes. We have to escape the quotes otherwise bash would think it was the end of the quoted text, and we place quotes in square brackets to say that either type of quote is acceptable. The last sed command is used to remove the option verbose = FALSE or verbose = TRUE which may or may not be present in the library command. You can see this line within the context of the final bash script in my GitHub Gist.

The code-block above gives a list of packages, separated by a hard return (\n). In a bash script we assign the list to a variable; we can see that things are working by writing the bash file and then executing it from the command line:

#!/bin/bash

library=$(cat R/*.R | \
  sed -n 's/^library[(]\(.*\)[)]/\1/p' | \
  tr "," "\n" | \
  tr -d "[\"\\']" | \
  sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g")

echo $library

We can add a second line, replacing library with require, giving us the first set of match requirements.

Matching package imports in roxygen2

In roxygen2, and when people are building packages, it’s possible to call packages using the statemetn @import or @importFrom. Here we would either declare a single package, or a package and then subsequent functions from that package. For valid roxygen2 markup this needs to be proceeded with #', so we can look for something like this:

^\s*\#+\'\s+\@import\(\?:From\)\?\s\([[:alnum:]]+\)

We begin the line (^), possibly with space (\s* allows zero or more), followed by the special #' character (escaped, and allowing for one or more comment characters: \#+\') with at least one space (\s+) and then @import which could be followed by From (with escaped parentheses followed by a question mark to indicate an optional match: \@import\(From\)\?). The capture group here is defined only as [[:alnum:]] a regex class of all alphanumeric values. This is different than the earlier request where we captured (.*) because in the library call we expected to potentially obtain comma separated lists, and we needed to account for the possibility of quoted package names. This then becomes the third match in the bash script.

Matching from pipes (%<%)

The regular expressions to capture calls within pipes also catches any call where a function is called with its package explicitly, using package::function(). The call requires the use of perl rather than the initial sed since perl allows the use of optional matches, where sed does not.

perl -pe 's/(.*?)([[:alnum:]]+)(::)(.*?)|./ \2/g' | \
 sed '/^\s*$/d')

The regular expression ((.*?)([[:alnum:]]+)(::)([[:alnum:]]+?)|.) matches any set of alphanumeric text that is followed by ::, indicating that it is the package calling the function. The function name, indicated by the second ([[:alnum:]]+?) indicates a lazy match, which tries to match as few elements as possible. We follow this with the ., so that it gives the lazy match something to stop on (a space, a pipe, whatever).

We have to use perl in this case since sed does not recognize non-greedy matches, but the options here are the same. The perl -e flag executes the command in the quotes, and, as before, the -p flag prints the output. This winds up matching a lot of empty space, which is unfortunate, but the sed match then removes any line that contains only spaces to the end of the line: \s*$. This completes the set of regex calls we need for the bash script.

Cleaning an array of packages

Each of these regular expression/perl/sed sequences will return a set of package names. In the bash file these are aggregated into a single long array by chaning them. In some cases the returns from these calls may be separated by only a single space. Passing the library array into tr and replacing spaces with hard returns (\n) gives an array of libraries that we can sort using unique values, returning the unique set of packages.

So the bash file:

#!/bin/bash

library=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
  sed -n 's/^library[(]\(.*\)[)]/\1/p' | \
  tr "," "\n" | \
  tr -d "[\"\\']" | \
  sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g")

library+=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
  sed -n 's/^require[(]\(.*\)[)]/ \1/p' | \
  tr "," "\n" | \
  tr -d "[\"\\']" | \
  sed "s/verbose\s*=\s*\(\(TRUE\)\|\(FALSE\)\)/ /g")

library+=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
  sed -n 's/^.*\@import\(From\)\?\s\([a-zA-Z]*\)\s.*/ \2/p')
library+=$(cat $(find . -type f \( -name \*.R -o -name \*.Rmd \)) | \
  perl -e 's/(.*?)([[:alnum:]]+)(::)(.*?)|./ \2/g' | sed '/^\s*$/d')

installs=$(tr ' ' '\n' <<< "${library[@]}" | sort -u | tr '\n' ' ')

echo $installs

Returns Bchron dplyr fields ggplot2 gridExtra maps mgcv neotoma plyr purrr purrrlyr raster readr reshape2 rgdal rmarkdown svglite viridis if you locally clone a project I am currently working on. If the user is not interested in installing the packages the results from the bash script may look like this:

 The package ggforce hasn't been installed.
 The package ggmap hasn't been installed.
 The package giphyR hasn't been installed.
 The package gstat hasn't been installed.
 The package hdf5 hasn't been installed.
 The package highlight hasn't been installed.

Using the script without installing packages may be a good first step, since it will indicate the extent to which packages are required, and also, the bash script is working with the current set of R scripts. For this reason, we use an installation flag.

Installing packages

In the bash script I allow the flag -i using a set of commands at the top of the bash file:


rinstall=0

while getopts "i" OPTION
do
	case $OPTION in
		i)
			echo Running installLib with the option to install packages.
			rinstall=1
			;;
	esac
done

This uses the bash getopts command, and echos to the screen when a user chooses to install the packages. If they’ve chosen to install the packages then we need to check each package name against the set of currently installed packages.

Is the package installed?

First we need the current path for R libraries, which we obtain from the command .libPaths(). Assuming R can be called globally, we can execute rpath=$(Rscript -e "cat(.libPaths())") in the bash script. This sets an internal bash variable to the array of paths. For each path element we test whether any of the packages are already installed by looking for the directory using test -d "$paths/$onePkg". If the package is not present then install remains 0, otherwise it is changed to 1.

From there we use the equalities to test whether to install the package or not. If the package isn’t installed and the flag has not been set, then simply print to the screen:

test $install -eq 0 && \
  printf "  The package %s hasn\'t been installed.\n" $onePkg

Otherwise, if the flag -i has been used, then install the package from the main cran repository:

test $install -eq 0 && \
  test $rinstall -eq 1 && \
  printf "  * Will now install the package.\n" && \
  Rscript -e "install.packages (\"$onePkg\", repos=\"http://cran.r-project.org/\")"

Wrapping it up

So, at this point, we can git clone, and copy our bash script (wherever it is) into the cloned directory:

git clone git@github.com:SimonGoring/RegularExpressionR.git
cp installLib.sh ./RegularExpressionR/installLib.sh
cd ./RegularExpressionR
bash installRpkg.sh -i

and we will have all of our packages installed. For me, this is a huge time saver. If you have suggestions, comments, or want to use the script, check it out of my GitHub gist. Feel free to comment or edit anything you need.

Caveats

The whole script eventually runs through all R files and checks them all, pulling all the packages and then running the install.packages() command through Rscript. As mentioned before this will not work on all of the possible options for installing packages, but in most cases, failures will generally either result in trying to install invalid packages (e.g., TRUE or =), or it will fail to detect a package call. In addition, this will not install packages that are installed using devtools::install_github(), however, it will install devtools, and, subsequently, if install_github() is called explicitly within the scripts, then the package should be installed.