vignettes/pkgnet-intro.Rmd
pkgnet-intro.Rmd
pkgnet
is an R package designed for the analysis of R
packages! The goal of the package is to build graph representations of a
package’s various types of dependencies. This can inform a variety of
activities, including:
Below is a brief tour of pkgnet
and its features.
pkgnet
represents aspects of R packages as graphs. The
two default reporters, which we will discuss in this vignette, model
their respective aspects as directed graphs: a package’s dependencies on
other packages, and the interdependencies of functions within a package.
Before we look at the output of pkgnet
, here are few core
concepts to keep in mind.
Units of the analysis are represented as nodes, and
their dependency relationships are represented as edges
(a.k.a. arcs or arrows). In pkgnet
, the nodes could be
functions in the package you are examining, or other packages that the
package depends on. The direction of edges point in the direction of
dependency—the tail node depends on the head node.1
In the example dependency graph above:
Following the direction of the edges allows you to figure out the dependencies of a node—the other nodes that it depends on. On the flip side, tracing the edges backwards allows you to figure out the reverse dependencies (i.e., dependents) of a node—the other nodes that depend on it.
pkgnet
can analyze any R package locally installed. (Run
installed.packages()
to see the full list of packages
installed on your system.) For this example, let’s say we are analyzing
a custom built package, baseballstats
.
To analyze baseballstats
, run the following two lines of
code:
library(pkgnet)
report1 <- CreatePackageReport(pkg_name = "baseballstats")
THAT’S IT! You have generated a lot of valuable information with that one call for an installed package.
However, if the full source repository for the package is available
on your system, you can supplement this report with other information
such as code coverage from covr. To do so,
specify the path to the repository in
CreatePackageReport
.
library(pkgnet)
report2 <- CreatePackageReport(
pkg_name = "baseballstats"
, pkg_path = <path to the repo>
)
CreatePackageReport()
creates an HTML report with the
pertinent information, and it also returns an object with the report
information and more. The location of the HTML report is
specified in the messages in the terminal, but it should render
automatically in your browser.
These will display in the HTML report, and their content will also be
attached as public bindings in the PackageReport
object
returned from CreatePackageReport()
.
SummaryReporter
: This section displays general
information about the package. The returned object contains basic
information like the package name and path.
DependencyReporter
: This section displays information
regarding the packages upon which the current package under analysis
depends. This includes both base and third-party R packages. The
returned object contains graph visualizations, graph measures and data
tables among other methods.
FunctionReporter
: This section displays information
regarding the functions within the current package under analysis and
their interdependence network. The returned object contains graph
visualizations, graph measures and data tables among other methods.
InheritanceReporter
: While not generated by default (as
not all packages are object oriented), this reporter is very useful when
investigating the parent-child structure of R6, S4 or Reference Class (a.k.a. “R5”)
objects. The inheritance graph is displayed in the report along with
other information. The returned object contains graph visualizations,
graph measures and data tables among other methods.
Aside from the Package Summary section and its returned object, each
reporter is based around a graph structure. Let’s look at the
FunctionReporter
from baseballstats
in more
detail.
Here’s how the Function Network Visualization looks
for baseballstats
. Note, its appearance differs based on if
pkg_path
is specified in
CreatePackageReport()
:
Default
All functions and their dependencies are visible. For example, we can
see that both batting_avg
and slugging_avg
functions depend upon the at_bats
function.
We also see that nothing depends on the on_base_pct
function. This might be valuable information to an R package
developer.
With Coverage Information
Same as the default visualization except we can see coverage information as well (Pink = 0%, Green = 100%).
It appears the function with the most dependencies,
at_bats
, is well covered. However, no other functions are
covered by unit tests.
Metrics for the nodes (either packages, functions, or classes depending on the reporter) are contained in a table:
colSubset <- c('node','type','betweenness','outDegree','inDegree','numRecursiveDeps')
report2$FunctionReporter$nodes[,..colSubset]
#> Key: <node>
#> node type betweenness outDegree inDegree numRecursiveDeps
#> <char> <char> <num> <int> <int> <int>
#> 1: OPS function 0.0 2 0 3
#> 2: at_bats function 0.0 0 2 0
#> 3: batting_avg function 0.5 1 1 1
#> 4: on_base_pct function 0.0 0 0 0
#> 5: slugging_avg function 0.5 1 1 1
Note, a few of these metrics provided by default are from the field of Network Theory. You can leverage the Network Graph Model Object described below to derive many more.
Network-level measures are contained in a
network_measures
list.
report2$FunctionReporter$network_measures
#> $packageTestCoverage.mean
#> [1] 0.0952381
#>
#> $packageTestCoverage.betweenessWeightedMean
#> [1] 0
#>
#> $graphOutDegree
#> [1] 0.3
#>
#> $graphInDegree
#> [1] 0.3
#>
#> $graphBetweenness
#> [1] 0.03125
The network model object itself is contained in the
pkg_graph
attribute. The igraph formatted object itself is
directly accessible via pkg_graph$igraph
.
report2$FunctionReporter$pkg_graph$node_measures(c('hubScore', 'authorityScore'))
#> Key: <node>
#> node hubScore authorityScore
#> <char> <num> <num>
#> 1: OPS 1 0.0
#> 2: at_bats 0 1.0
#> 3: batting_avg 1 0.5
#> 4: on_base_pct 0 0.0
#> 5: slugging_avg 1 0.5
report2$FunctionReporter$pkg_graph$igraph
#> IGRAPH b97bc79 DN-- 5 4 --
#> + attr: name (v/c)
#> + edges from b97bc79 (vertex names):
#> [1] slugging_avg->at_bats batting_avg ->at_bats
#> [3] OPS ->slugging_avg OPS ->batting_avg
With the reports and objects produced by pkgnet
by
default, there is plenty to inform us on the inner workings of an R
package. However, we may want to know MORE! Since the igraph objects are available, we can
leverage those graphs for further analysis.
In this section, let’s examine a larger R package, such as lubridate.
If you would like to follow along with the examples in this section,
run these commands in your terminal to download and install
lubridate
2.
# Create a temporary workspace
mkdir -p ~/pkgnet_example && cd ~/pkgnet_example
# Grab the lubridate source code
git clone https://github.com/tidyverse/lubridate
cd lubridate
# If you want the examples to match exactly
git reset --hard 9797d69abe1574dd89310c834e52d358137669b8
# Install it
R CMD install .
Let’s examine lubridate
’s functions through the lens of
each function’s total number of dependents (i.e., the other functions
that depend on it) and its code’s unit test coverage. In our graph model
for the FunctionReporter
, the subgraph of paths leading
into a given node is the set of functions that directly or indirectly
depend on the function that node represents.
# Run pkgnet
library(pkgnet)
report2 <- CreatePackageReport(
pkg_name = "lubridate"
, pkg_path = "~/pkgnet_example/lubridate"
)
# Extract Nodes Table
funcNodes <- report2$FunctionReporter$nodes
# List Coverage For Most Depended-on Functions
mostRef <- funcNodes[order(numRecursiveRevDeps, decreasing = TRUE),
.(node, numRecursiveRevDeps, coverageRatio, totalLines)
][1:10]
#> node numRecursiveRevDeps coverageRatio totalLines
#> 1: month 81 1 1
#> 2: tz 79 1 1
#> 3: reclass_date 68 1 1
#> 4: date 67 1 1
#> 5: is.Date 60 1 1
#> 6: is.POSIXt 57 1 1
#> 7: wday 56 1 1
#> 8: is.POSIXct 55 1 1
#> 9: .deprecated 55 0 10
#> 10: as_date 52 1 1
Inspecting results such as these can help an R package developer decide which function to cover with unit tests next.
In this case, check_duration
, one of the most
depended-on functions (either directly or indirectly), is not covered by
unit tests. However, it appears to be a simple one line function that
may not be necessary to cover in unit testing.
check_interval
, on the other hand, might benefit from some
unit test coverage as it is a larger, uncovered function with a similar
number of dependencies.
Looking at that same large package, let’s say we want to explore
options for consolidating functions. One approach might be to explore
consolidating functions that share the same dependencies. In that case,
we could use the igraph
object to highlight functions with
the same out-neighborhood via Jaccard
similarity.
# Get igraph object
funcGraph <- report2$FunctionReporter$pkg_graph$igraph
funcNames <- igraph::vertex_attr(funcGraph, name = "name")
# Jaccard Similarity
sim <- igraph::similarity(graph = funcGraph
, mode = "out"
, method = "jaccard")
diag(sim) <- 0
sim[sim < 1] <- 0
simGraph <- igraph::graph_from_adjacency_matrix(adjmatrix = sim, mode = "undirected")
# Find groups with same out-neighbors (similarity == 1)
sameDeps <- igraph::max_cliques(graph = simGraph
, min = 2
)
# Write results
for (i in seq_along(sameDeps)) {
cat(paste0("Group ", i, ": "))
cat(paste(funcNames[as.numeric(sameDeps[[i]])], collapse = ", "))
cat("\n")
}
#> Group 1: divisible_period, make_date
#> Group 2: parse_date_time2, fast_strptime
#> Group 3: .deprecated_fun, .deprecated_arg
#> Group 4: stamp_date, stamp_time
#> Group 5: epiweek, isoweek
#> Group 6: ms, hm
#> Group 7: quarter, semester
#> Group 8: am, .roll_hms
#> Group 9: modulo_interval_by_duration, modulo_interval_by_period
#> Group 10: .difftime_from_pieces, .duration_from_units
#> Group 11: divide_period_by_period, xtfrm.Period
#> Group 12: int_diff, %--%
#> Group 13: isoyear, epiyear
#> Group 14: nanoseconds, microseconds, picoseconds, milliseconds
#> Group 15: period_to_seconds, check_period, multiply_period_by_number, format.Period, divide_period_by_number, add_period_to_period
#> Group 16: myd, dmy, yq, ymd, dym, mdy, ydm
#> Group 17: hours, weeks, minutes, years, days, months.numeric, seconds, seconds_to_period
#> Group 18: C_force_tz, hour.default, mday.default, c.POSIXct, .mklt, yday.default, year.default, minute.default, second.default
#> Group 19: ehours, emilliseconds, eyears, eseconds, epicoseconds, enanoseconds, eminutes, olson_time_zones, edays, emicroseconds, eweeks
#> Group 20: dmy_h, ydm_hms, ymd_hms, dmy_hm, ymd_h, ydm_hm, ydm_h, dmy_hms, ymd_hm, mdy_hms, mdy_hm, mdy_h
Now, we have identified twenty different groups of functions within lubridate that share the exact same dependencies. We could explore each group of functions for potential consolidation.
Want to know even more about the pkgnet
package?!
Run pkgnet
on itself!
install.packages("pkgnet")
pkgnetObj <- CreatePackageReport("pkgnet", c(DefaultReporters(), InheritanceReporter$new()))
Want to see pkgnet
reports for other
packages?
Check out the pkgnet Gallery.
Want to ship a pkgnet
report with your R
package?
Include it a vignette()
in your package. See Publishing
Your pkgnet Package Report.
This follows the Unified Modeling Language (UML) framework, a widely used standard for software system modeling.↩︎
Examples from version 1.7.3 of Lubridate↩︎