EarthCube Funded Projects

1 Using Neo4j to Explore NSF EarthCube Awards

EarthCube is a set of funded projects, supported by NSF’s Geosciences directorate and the Division of Advanced Cyberinfrastructure. EarthCube is defined as “a community effort to promote interdisciplinary geoscience by enabling technology, organization, and culture that facilitates connectivity through standards and protocols to existing and emerging resources.”, so I thought I would take the opportunity to leverage a few emerging technologies to look at connectivity across the organization.

I’ve been involved with EarthCube since an early workshop at the University of Wisconsin focused on community standards for geochronology, user needs and capacity building. Since then I have been co-PI on an award to help integrate the Neotoma Paleoecological Database and Paleobiology Database through a common API (currently in development as the EarthLife Consortium). I currently sit on EarthCube’s Leadership Committee, so, as much as this post is driven by my own interest, it’s also got an element of propaganda to it.

2 Obtaining the Data

2.1 The Search Tool

The National Sciences Foundation has a very handy Awards Search tool. The NSF Awards Search API documentation is fairly well documented, and provides an opportunity for developing programmatic tools around awards searches. I’ve developed this document as a one-off, and tried to use fairly straightforward approach, to focus the code around the use of the RNeo4j package, rather than the API wrapper.

To get the data I used the NSF Awards Search web interface and simply searched for any award (expired or active) that contained the term EarthCube.

Once this is done you get a set of results from projects that have included "EarthCube" in their text somewhere, or are funded through the EarthCube program at the NSF level (as opposed to the EarthCube Governance). The returned list can be exported to one of several formats.

For this project I’ve exported the results to csv, and the file is available from the GitHub repository for this project as a raw CSV file.

3 A Brief Introduction to Neo4j and Graph Databases

Neo4j is a provider for a particular kind of database called a graph database. A graph database models data through relationships. While relational databases such as MySQL or SQL Server model relationships through joins, graph databases provide an opportunity to both assign relationships, but also add further context to those relationships. I downloaded the neo4j Community Edition and had it set up and running on my computer fairly quickly.

In the context of NSF Grants we can think of objects (or nodes), such as people, institutions or awards, and we can think of their relationships to one another. For example, the Person “Simon Goring” has a relationship with the Institution “University of Wisconsin-Madison”, which is that I “Work For” the University, and, given sufficient data, we could add attributes to this relationship, such as “Start Date”, or “End Date”. Similarly, I have a relationship with Award 1541002, in that I am a “Co-PI”, and, although we won’t in this simple example, we could add attributes here are well, for example, the salary support I requested (which is non-trivial to obtain). In defining these nodes, and the relationships between them, we can begin to develop a data model for these awards.

Since we’re building in R, I’m using the Rneo4j package. The author of the package, data scientist Nicole White, has some nice tutorials for using the Rneo4j package. Having the package was incredibly helpful, but I still needed to make a dive into some other neo4j resources, including the Cypher refcard.

This document takes some of those basics, adds a (semi) scientific dataset, and (hopefully) provides you with another example for building a graph database with neo4j.

4 Defining the Model

Once the data is loaded, we need to define a model, and data content for each element. This project uses a fairly simple data model, focusing on people, places and awards.

While the individual nodes are fairly straightforward, we can pull out specific nodes and examine how the information in the awards csv file could fit into them. This is also the first time we get to look at the data in some detail. Although the data contains personal information, it is within the public domain and so there should be no expectation of privacy. However, I will focus on the data that directly pertains to myself so that we’re respectful.

library(dplyr, verbose = FALSE, quietly = TRUE)
library(purrr, verbose = FALSE, quietly = TRUE)

funded_project <- read.csv('Awards.csv', stringsAsFactors = FALSE) %>% mutate_all(as.character)

# Find the row that includes my data:
my_row <- grep("Simon Goring", funded_project$Co.PIName.s.)

4.1 Cleaning the Data

The raw data has 25 columns. The NSF report does not contain a significant amount of data, although it does include an email address for the lead PI at each institution. It’s possible to pull more information from awards by exporting XML files for awards, but I think that’s too much work for this post. For the sake of this exercise we’ll ignore that. To begin cleaning the data we need to understand the data. So let’s look at the available columns:

colnames(funded_project)
##  [1] "AwardNumber"             "Title"                  
##  [3] "NSFOrganization"         "Program.s."             
##  [5] "StartDate"               "LastAmendmentDate"      
##  [7] "PrincipalInvestigator"   "State"                  
##  [9] "Organization"            "AwardInstrument"        
## [11] "ProgramManager"          "EndDate"                
## [13] "AwardedAmountToDate"     "Co.PIName.s."           
## [15] "PIEmailAddress"          "OrganizationStreet"     
## [17] "OrganizationCity"        "OrganizationState"      
## [19] "OrganizationZip"         "OrganizationPhone"      
## [21] "NSFDirectorate"          "ProgramElementCode.s."  
## [23] "ProgramReferenceCode.s." "ARRAAmount"             
## [25] "Abstract"

Of these, ProgramManager, PrincipalInvestigator, and CO.PIName.s. are the named fields (this is the annoying R behaviour of switching in periods). I’m going to ignore the email address. The first thing I noticed is that the CO.PIName.s. may be comma separated, so we need to extract unique names using strsplit:

funded_project$Co.PIName.s.[my_row]
## [1] "Simon Goring, Shanan Peters"

To generate a list of all unique individuals, we can write a pretty short script. This can define both the individual, and also a count of the total grants they are a part of, if you want to use node size for the visualization.

people <- c(funded_project$PrincipalInvestigator, 
            funded_project$ProgramManager,
            unlist(strsplit(funded_project$Co.PIName.s., ', '))) %>% 
  data.frame(name = .) %>% 
  group_by(name) %>% 
  summarise(count = n())

knitr::kable(people[grep("Simon", people$name),])
name count
Simon Goring 1

We can do the same thing for Organizations, although this is simplified since NSF provides a single award number per organization:

# Organizations:
orgs <- funded_project %>% 
  dplyr::select(Organization, OrganizationState, OrganizationCity) %>% 
  group_by(Organization, OrganizationState, OrganizationCity) %>% 
  summarise(count = n())

knitr::kable(head(orgs[order(orgs$count, decreasing = TRUE),]))
Organization OrganizationState OrganizationCity count
University of Colorado at Boulder CO Boulder 18
University Corporation For Atmospheric Res CO Boulder 14
Columbia University NY NEW YORK 11
University of California-San Diego CA La Jolla 11
University of Southern California CA Los Angeles 10
Woods Hole Oceanographic Institution MA WOODS HOLE 10

The last nodes to be generated are the awards, which are linked to sub-awards. We’ll pull this information together, since it’s all tied to each individual row in the csv file. I’m going to ignore the NSF Organization and Program for now, but that information is there, so we can do a much deeper dive into the data if we want:

grants <- funded_project %>% 
  select(AwardNumber,
         Title,
         StartDate,
         EndDate,
         AwardedAmountToDate,
         Abstract)

grants$URL <- paste0("http://nsf.gov/awardsearch/showAward?AWD_ID=", 
                                       funded_project$AwardNumber)

In general the award data is pretty straightforward. The one issue I found was an artifact of the way proposals are awarded. In practice, you write a single grant, but submit a different proposal for each institution. This means that there is one parent-award, and then a set of sub-awards. We would expect that all sub-awards have the same title, but this isn’t the case, or at least, not quite the case. Some preliminary string matching showed me that there’s often an edit distance of \(d_{e} < 10\) (using the default “Optimal String Alignment” distance) among sub-awards. I wanted to standardize titles before I got too far, so I used the stringsidt R package.

library(stringdist)

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

for (i in 1:nrow(grants)) {
  
  # Index through each title, matching against all other titles:
  matches <- stringdist(tolower(grants$Title[i]), tolower(grants$Title))
  
  if (sum(matches < 5) > 1) {
    # Look for matches with an edit distance of < 5.
    grants$Title[matches < 5] <- grants$Title[which(matches == getmode(matches[matches < 5]))[1]]
    funded_project$Title[matches < 5] <- grants$Title[which(matches == getmode(matches[matches < 5]))[1]]
  }
}

Now all our near-duplicate titles are standardized, and we’re left with 158 unique awards and 271 sub-awards, with 436 individuals represented across 96 organizations.

5 Building the Graph

Once we’ve cleaned the data (minimally) and understand the funded_project table structure, we can move to the process of building the database. Since each element here is pulled from the table, we can use the table’s structure to build the graph. This first requires us to initialize the database, and then begin to populate it. Before I ran this code I started an instance of neo4j. My instance is accessed through my localhost port 7474:

library(RNeo4j)

# These are the default settings, in general I like to have my username and password
# in a seperate file that I then call using `scan`:

ec_graph <- startGraph("http://localhost:7474/db/data", username = "neo4j", password = "NeoPW")

clear(ec_graph, input = FALSE)

In general, there’s no need to clear() the graph, except that I’ve rendered this Rmd file a few times & don’t want to overwrite the graph, without making sure that all the old information is gone.

So now, we have the elements, we have the relationship types, and now we need to begin to populate the graph. Given an empty graph we can begin by looping through each award (each grant element) and creating nodes. Since we don’t want to create duplicate nodes or relationships, we want to always check if a particular node exists, and then decide whether to, either create the node, or pass on a NULL value.

I didn’t find a good tool to do this using RNeo4j, although you can use the CREATE UNIQUE cypher command in the native neo4j browser (and presumably wrapped by the cypher command). You can do it afterwards, clearing duplicate nodes, but my laptop is pretty old at this point, so I wanted to keep the overhead down. To do that I wrote two helper functions, checkNode(), which checks for a node in the graph before it adds it in, and checkRel(), which does the same thing for relationships.

I welcome additions to this document. Feel free to fork & make pull requests to this repository: http://github.com/EC_Engagement

checkNode <- function(x, graph, type) {
  node_gr <- getNodes(graph, 
               paste0("MATCH (t:",type, ") WHERE t.",names(x)[1]," IN ['", gsub("'", '"', x[1]), 
               "'] RETURN t"))
  if (is.null(node_gr)) {
    node_gr <- createNode(graph = graph, .label = type, x)
  }
  
  return(node_gr)
  
}

checkRel <- function(x, y, type, graph) {
  rel_check <- getRels(graph, 
               paste0("MATCH (x:",getLabel(x), ")-[r]-(y:",getLabel(y), ") WHERE x.",
                      names(x)[1]," IN ['", gsub("'", '"', x[1]), 
                      "'] AND y.",
                      names(y)[1]," IN ['", gsub("'", '"', y[1]), 
                      "'] RETURN r"))
  if (is.null(rel_check)) {
    rel_check <- createRel(.fromNode = x, .relType = type, .toNode = y)
  }
  
  return(rel_check)
}

Once those functions are loaded we add in all the nodes using purrr’s really nice by_row() function to go through each row of the people, orgs, and grants data.frames.

people_nodes <- people %>% by_row(checkNode, graph = ec_graph, type = "Person")
organization <- orgs   %>% by_row(checkNode, graph = ec_graph, type = "Organization")

subaward_gr    <- grants %>% 
  select(AwardNumber, AwardedAmountToDate, URL) %>% 
  by_row(checkNode, graph = ec_graph, "Sub_Award")
  
award_gr    <- grants %>% 
  select(Title, Abstract, StartDate, EndDate) %>% 
  by_row(checkNode, graph = ec_graph, "Award")

In general these steps each take a fairly significant amount of time. It’s faster when I render() with rmarkdown, but slow when I run it natively in RStudio. I suspect that some of this is the result of first attempting to check if nodes are duplicates, some of it is my laptop. Given that we are creating these from scratch it’s probably fine to assume each node is new and skip the matching. Having said that, it’s nice to be sure that everything is working cleanly from the start.

So, once the nodes are built, we go row-wise and build the relationships:

for (i in 1:nrow(grants)) {
  
  # Find out who was funded on the grant:
  people_gr <- getNodes(ec_graph, 
                 paste0("MATCH (p:Person) WHERE p.name IN ['", 
                 paste(gsub("'", ".", unique(c(funded_project$PrincipalInvestigator[i],
                        strsplit(funded_project$Co.PIName.s.[i], split = ', ')[[1]]))), 
                       collapse = "','"), 
                 "'] RETURN p"))
  
  # Grant funds go to institutions:
  org_gr <-  getSingleNode(ec_graph, 
                      paste0("MATCH (o:Organization) WHERE o.Organization IN ['", 
                             as.character(unique(funded_project$Organization[i])), 
                             "'] RETURN o"))
  
  award_gr <- getSingleNode(ec_graph, 
                      paste0('MATCH (a:Award) WHERE a.Title IN ["', 
                             as.character(funded_project$Title[i]), 
                             '"] RETURN a'))
  
  subaw_gr <- getSingleNode(ec_graph, 
                            paste0("MATCH (s:Sub_Award) WHERE s.AwardNumber = '", 
                                   funded_project$AwardNumber[i], 
                                   "' RETURN s"))
  
  # Funded individuals are based at institutions:
  ## Not sure if I should cut this, it looks like people are associated with multiple
  ## institutions through grants.
  lapply(people_gr, 
         function(x) { checkRel(x = x, y = org_gr, type = "Based_At", graph = ec_graph) })
  
  checkRel(x = subaw_gr, y = org_gr,   type = "Awarded_to",  graph = ec_graph)
  
  checkRel(x = subaw_gr, y = award_gr, type = "Subaward_of", graph = ec_graph)
  
  # Funded individuals apply for the grants:
  lapply(people_gr, 
         function(x) { checkRel(x = x, y = subaw_gr, type = "Applied_for", graph = ec_graph) })
  
  # A program officer manages the awards:
  
  po_gr <- getSingleNode(ec_graph, 
                         paste0("MATCH (p:Person) WHERE p.name IN ['",
                                unique(funded_project$ProgramManager[i]),
                                "'] RETURN p"))

  checkRel(x = po_gr, y = award_gr, type = "Manages", graph = ec_graph)

}

This is a pretty ugly way of doing it. There’s probably an easier way, but I’ve run into problems with matching up elements of the lists returned by getNodes() to the data in the funded_project table to the unique node lists. Ultimately, I went back and just decided to make a big loop. In some ways (hopefully!) this makes things a bit more transparent if you’re just learning how to do this.

6 Checking that things worked

We can take a look at a local network, given that we’ve built the full graph. Let’s use the cypher query language developed by neo4j to take a look a one local network:

rels <- getRels(graph = ec_graph, 
                  query = paste0('MATCH ptp = (p:Person {name:"Simon Goring"})-[*0..2]-(pp:Person)',
                                 'RETURN relationships(ptp)'))

nodes <- getNodes(graph = ec_graph, 
                  query = paste0('MATCH ptp = (p:Person {name:"Simon Goring"})-[*0..2]-(pp:Person)',
                                 'RETURN nodes(ptp)'))

node_edge <- function(node, edge) {
 
 nodes <- lapply(1:length(node), function(x){
   data.frame(id   = sapply(node[[x]], function(y) y$metadata$id),
              type = sapply(node[[x]], function(y) y$metadata$labels[[1]]), 
              name = sapply(node[[x]], 
                            function(y) {
                              if (y$metadata$labels %in% "Person") return(y$data$name)
                              if (y$metadata$labels %in% "Organization") return(y$data$Organization)
                              if (y$metadata$labels %in% "Sub_Award")    return(y$data$AwardNumber)
                              if (y$metadata$labels %in% "Award")    return(y$data$Title)}),
              count = sapply(node[[x]], function(y) ifelse(y$metadata$labels %in% c("Sub_Award", "Award"), 
                                                           1, y$data$count)))})
 
 nodes <- nodes %>% bind_rows() %>% unique()
 
 edges <- lapply(1:length(edge), function(x) {
   if (length(edge[[x]]) == 0) return(NULL)
   data.frame(source = sapply(edge[[x]], function(y) stringr::str_extract(y$start, "[0-9]*$")),
              target = sapply(edge[[x]], function(y) stringr::str_extract(y$end, "[0-9]*$")))
 })
 
 edges <- edges %>% bind_rows() %>% unique()
 
 nodes$newID <- seq(0, nrow(nodes) - 1)
 
 edges$newSource <- nodes$newID[match(edges$source, nodes$id)]
 edges$newTarget <- nodes$newID[match(edges$target, nodes$id)]
 
 edges <- edges[order(edges$newTarget),]

 return(list(nodes = nodes, edges = edges))
}

summary_graph <- node_edge(nodes, rels)

summary_graph$nodes$type[which(summary_graph$nodes$name == "Simon Goring")] <- "Me"
networkD3::forceNetwork(Links  = summary_graph$edges,
                        Nodes  = summary_graph$nodes,
                        Source = "newSource",
                        Target = "newTarget",
                        NodeID = "name",
                        Nodesize = "count",
                        Group  = "type",
                        linkWidth = 1,
                        opacity = 1,
                        zoom = TRUE,
                        legend = TRUE)

So, the process to get the data into networkD3 isn’t super straightforward. I had to write a little function to break apart the cypher queries. But regardless, you can now see my local network. Other cypher queries could be used to generate other patterns. For example, we could find out how close organizations are to one another. I think this is another place that helper functions would be really useful. Either methods for RNeo4j objects that would allow them to plug right into the plotting functions in networkD3 and igraph, or a helper function to convert between data classes.

7 Institutional Proximity

To understand how tightly insitutions are connected through EarthCube grants we have to construct a cypher query (below) that checks distances across the network:

MATCH (o:Organization)
WITH collect(ID(o)) AS orgs
MATCH (o1:Organization), (o2:Organization)
MATCH path=shortestPath((o1)-[*..9]-(o2))
WHERE ID(o1) IN orgs 
  AND ID(o2) IN orgs 
  AND o1 <> o2
  AND NONE(rel in rels(path) WHERE type(rel)='Manages')
RETURN o1.Organization AS start, o2.Organization AS end, length(path) AS steps

We pair Organizations, ensuring that the start (o1) and end (o2) organizations are not the same (AND o1 <> o2). I added the extra constraint for the path that it not pass through an NSF Program officer. This is why we use the NONE and type(rels)='Manages constraint. A program officer is a Person, so we can’t eliminate them on that basis (and it makes sense they are a Person since Program Officers come from academia, and so theoretically could also apply for grants), but we can ensure that the paths we select don’t pass through a “Manages” relationship.

We search for paths up to 9 steps in length ([*..9]), obtaining the shortest path in each case (we could use allShortestPaths if we wanted multiple options, but here we just care about the end points and their length).

path_dist <- cypher(graph = ec_graph, query = "MATCH (o:Organization)
  WITH collect(ID(o)) AS orgs
  MATCH (o1:Organization), (o2:Organization)
  MATCH path=shortestPath((o1)-[*..9]-(o2))
  WHERE ID(o1) IN orgs AND ID(o2) IN orgs AND o1 <> o2
  AND NONE(rel in rels(path) WHERE type(rel)='Manages')
  RETURN o1.Organization AS start, o2.Organization AS end, length(path) AS steps")

avg_dist <- path_dist %>% 
  group_by(start) %>% 
  summarise(dist = mean(steps), n = n()) %>% 
  arrange(dist) %>% filter(n > 10)

knitr::kable(head(avg_dist, n = 10))
start dist n
Columbia University 5.141026 78
University of Colorado at Boulder 5.166667 78
Arizona Geological Survey 5.384615 78
Stroud Water Research Center 5.604938 81
Utah State University 5.609756 82
Incorporated Research Institutions for Seismology 5.708861 79
University Corporation For Atmospheric Res 5.797468 79
University of Southern California 5.846154 78
University of Texas at Austin 5.947368 76
University of Wisconsin-Madison 6.027397 73

I’m just showing the ten institutions with the closest paths to other institutions. I’m happy to report that UW-Madison is one of them! There were a couple institutions that had very low mean distances, but were only connected to one other institution, that’s why I added the filter(n > 10) constraint in generating avg_dist above.

Given a bit of knowedge about EarthCube and the funded projects, it makes sense that many of these universities and institutions are at the top of the list. The average distance between institutions is about 7 steps, with 16 institutions failing to find close neighbors within 9 steps. In general this is because the organization only ever participated in single-institution proposals with EarthCube, and the PIs were not involved in other EarthCube proposals. An additional wrinkle is that the proposals examined here contained the term “EarthCube” anywhere within the data, including in the abstract. It’s possible that unrelated projects that referred to standards coming out of EarthCube could have been counted here. Regardless, that’s more a statement about the impact of the EarthCube program than anything else.

Mean path length is tightly coupled to the number of institutions an institution is connected to. This makes sense intuitively, but the relationship has an interesting sharp downward curve:

plot(dist ~ n, data = avg_dist, pch = 19, col = ifelse(avg_dist$start == "University of Wisconsin-Madison", 2, 4))

8 One Last Pretty Graph

rels <- getRels(graph = ec_graph, 
                  query = paste0('MATCH ptp = (p:Person {name:"Simon Goring"})-[*0..5]-(pp:Person)',
                                 'RETURN relationships(ptp)'))

nodes <- getNodes(graph = ec_graph, 
                  query = paste0('MATCH ptp = (p:Person {name:"Simon Goring"})-[*0..5]-(pp:Person)',
                                 'RETURN nodes(ptp)'))

summary_graph <- node_edge(nodes, rels)

summary_graph$nodes$type[which(summary_graph$nodes$name == "Simon Goring")] <- "Me"
networkD3::forceNetwork(Links  = summary_graph$edges,
                        Nodes  = summary_graph$nodes,
                        Source = "newSource",
                        Target = "newTarget",
                        NodeID = "name",
                        Nodesize = "count",
                        Group  = "type",
                        linkWidth = 1,
                        opacity = 1,
                        zoom = TRUE,
                        legend = TRUE)

Here you can see the path that links me to other EarthCube researchers, up to 5 steps away. Lots of it runs through our Integrative Activities Grant, which links UW-Madison to five other institutions, but also through the University of Wisconsin as a hub for connections. An interesting feature of our IA grant is that it links to many institutions that are associated with only one EarthCube grant.

9 Conclusions

This is my first real deep-dive into using neo4j, but it was pretty fun. I didn’t show you the user interface, but the browser is nice and clean, and relatively easy to use. I crashed Chrome a few times while testing out some of my cypher queries, but re-connecting to the database was super simple.

I think that the package is really great, and made it pretty easy to get my feet wet. But as I started doing more complicated things, I feel like either I need to get more comfortable with the RNeo4j tools, or the package could use some more helper functions. Better integration with packages like networkD3 or igraph would be fantastic. That said, take a look at my hacky code and let me know where I can improve.

As I said above, you’re welcome to fork, make pull requests or otherwise tinker with the code in my EC_Engagement GitHub repository. Enjoy!