| Title: | Computation of Node and Path-Level Risk Scores in Scientific Models |
|---|---|
| Description: | It leverages the network-like architecture of scientific models together with software quality metrics to identify chains of function calls that are more prone to generating and propagating errors. It operates on tbl_graph objects representing call dependencies between functions (callers and callees) and computes risk scores for individual functions and for paths (sequences of function calls) based on cyclomatic complexity, in-degree and betweenness centrality. The package supports variance-based uncertainty and sensitivity analyses after Puy et al. (2022) <doi:10.18637/jss.v102.i05> to assess how risk scores change under alternative risk definitions. |
| Authors: | Arnald Puy [aut, cre] (ORCID: <https://orcid.org/0000-0001-9469-2156>) |
| Maintainer: | Arnald Puy <[email protected]> |
| License: | GPL-3 |
| Version: | 0.2.0 |
| Built: | 2026-05-31 06:45:26 UTC |
| Source: | https://github.com/arnaldpuy/softwarerisk |
Given a directed call graph (tidygraph::tbl_graph) with a node attribute for
cyclomatic complexity, this function:
computes node-level metrics (in-degree, out-degree, betweenness),
calculates a node risk score as a weighted combination of rescaled metrics,
enumerates all simple paths from entry nodes (in-degree = 0) to sink nodes (out-degree = 0),
computes path-level summaries and a path-level risk score.
calculates a gini index and the slope of risk at the path-level.
all_paths_fun( graph, alpha = 0.6, beta = 0.3, gamma = 0.1, p = 1, eps = 1e-12, complexity_col = "cyclo", weight_tol = 1e-08 )all_paths_fun( graph, alpha = 0.6, beta = 0.3, gamma = 0.1, p = 1, eps = 1e-12, complexity_col = "cyclo", weight_tol = 1e-08 )
graph |
A directed |
alpha, beta, gamma
|
Numeric non-negative weights for the risk score,
constrained such that |
p |
Numeric scalar. Power parameter for the weighted power mean.
Must be finite and lie in the interval |
eps |
Numeric. Small positive constant |
complexity_col |
Character scalar. Name of the node attribute containing
cyclomatic complexity. Default |
weight_tol |
Numeric tolerance for enforcing the weight-sum constraint.
Default |
The normalized node metrics are computed using scales::rescale() and denoted
by a tilde .
The risk score for node is computed as
the weighted power mean of normalized metrics:
where is the power-mean parameter. When this reduces to
a weighted sum (additive). In the limit , this reduces
to a weighted geometric mean, implemented with a small constant to
ensure numerical stability:
The path-level risk score is calculated as
where is the risk of the -th function in path and
is the number of functions in that path. The equation behaves like a
saturating OR-operator: is at least as large as the maximum individual
function risk and monotonically increases as more functions on the path become risky,
approaching 1 when several functions have high risk scores.
The Gini index of path is computed as
where is the mean node risk in path .
Finally, the trend of risk is defined by the slope of the regression
where is the risk score of the function at position along
path (ordered from upstream to downstream execution) and is a residual term.
The returned paths tibble includes path_cc, a list-column where each
element is the vector of per-node cyclomatic complexity values along the path.
A named list with two tibbles:
Node-level metrics with columns name, cyclomatic_complexity,
indeg (in-degree), outdeg (out-degree), btw (betweenness), risk_score.
Path-level metrics with columns path_id, path_nodes,
path_str, hops, path_risk_score, path_cc, gini_node_risk,
risk_slope, risk_mean, risk_sum
# synthetic_graph is a tidygraph::tbl_graph with node attribute "cyclo" data(synthetic_graph) # additive risk (p = 1, default) out1 <- all_paths_fun( graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, p = 1, complexity_col = "cyclo" ) # power-mean risk (p = 0 ~ weighted geometric mean) out2 <- all_paths_fun( graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, p = 0, eps = 1e-12, complexity_col = "cyclo" ) out1$nodes out1$paths# synthetic_graph is a tidygraph::tbl_graph with node attribute "cyclo" data(synthetic_graph) # additive risk (p = 1, default) out1 <- all_paths_fun( graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, p = 1, complexity_col = "cyclo" ) # power-mean risk (p = 0 ~ weighted geometric mean) out2 <- all_paths_fun( graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, p = 0, eps = 1e-12, complexity_col = "cyclo" ) out1$nodes out1$paths
Computes the Gini index (a measure of inequality) for a numeric vector.
Non-finite (NA, NaN, Inf) values are removed prior to computation.
If fewer than two finite values remain, the function returns 0.
gini_index_fun(x)gini_index_fun(x)
x |
Numeric vector. |
The Gini index ranges from 0 (perfect equality) to 1 (maximal inequality).
A numeric scalar giving the Gini index of x.
gini_index_fun(c(1, 1, 1, 1)) gini_index_fun(c(1, 2, 3, 4)) gini_index_fun(c(NA, 1, 2, Inf, 3))gini_index_fun(c(1, 1, 1, 1)) gini_index_fun(c(1, 2, 3, 4)) gini_index_fun(c(NA, 1, 2, Inf, 3))
Compute how much the risk score of the riskiest paths would decrease if selected high-risk nodes were made perfectly reliable (risk fixed to 0), and visualise the result as a heatmap.
path_fix_heatmap(all_paths_out, n_nodes = 20, k_paths = 20)path_fix_heatmap(all_paths_out, n_nodes = 20, k_paths = 20)
all_paths_out |
A list returned by |
n_nodes |
Integer, number of top-risk nodes (by |
k_paths |
Integer, number of top-risk paths (by |
For each of the top n_nodes nodes ranked by risk_score and
the top k_paths paths ranked by path_risk_score, the function
sets the risk of that node to 0 along the path (for all its occurrences)
and recomputes the path risk score under the independence assumption,
using
The improvement
is used as the fill value in the heatmap cells.
Bright cells indicate nodes that act as chokepoints for a given path. Rows with many bright cells correspond to nodes whose refactoring would improve many risky paths (global chokepoints), while columns with a few very bright cells correspond to paths dominated by a single risky node.
A list with two elements:
delta_tbl: a tibble with columns node, path_id
and deltaR, containing the reduction in path risk score
when fixing the node in that path.
plot: a ggplot2 object containing the heatmap.
data(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") res <- path_fix_heatmap(all_paths_out = out, n_nodes = 20, k_paths = 20) resdata(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") res <- path_fix_heatmap(all_paths_out = out, n_nodes = 20, k_paths = 20) res
Plot the top n_paths paths ranked by their mean risk score,
with horizontal error bars representing the uncertainty range
(minimum and maximum risk) computed from the Monte Carlo samples
stored in uncertainty_analysis.
path_uncertainty_plot(ua_sa_out, n_paths = 20)path_uncertainty_plot(ua_sa_out, n_paths = 20)
ua_sa_out |
A list returned by |
n_paths |
Integer, number of top paths (by mean risk) to include in the plot. Defaults to 20. |
This function is designed to work with the paths component of
the output of uncertainty_fun(). For each path, it summarises the
vector of path risk values by computing the mean, minimum and maximum values, and then displays
these summaries for the n_paths most risky paths.
A ggplot2 object.
data(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") results <- uncertainty_fun(all_paths_out = out, N = 2^10, order = "first") path_uncertainty_plot(ua_sa_out = results, n_paths = 20)data(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") results <- uncertainty_fun(all_paths_out = out, N = 2^10, order = "first") path_uncertainty_plot(ua_sa_out = results, n_paths = 20)
Visualizes the most risky entry-to-sink paths (by decreasing path_risk_score)
computed by all_paths_fun(). Edges that occur on the top paths are
highlighted, with edge colour mapped to the mean path risk and edge width
mapped to the number of top paths using that edge. Nodes on the top paths
are emphasized, with node size mapped to in-degree and node fill mapped to
binned cyclomatic complexity.
plot_top_paths_fun( graph, all_paths_out, model.name = "", language = "", top_n = 10, alpha_non_top = 0.05 )plot_top_paths_fun( graph, all_paths_out, model.name = "", language = "", top_n = 10, alpha_non_top = 0.05 )
graph |
A directed |
all_paths_out |
Output from |
model.name |
Character scalar used in the plot title (e.g., model name). |
language |
Character scalar used in the plot title (e.g., language name). |
top_n |
Integer. Number of highest-risk paths to display (default 10). |
alpha_non_top |
Numeric between 0 and 1. Alpha (transparency) for edges that are not on the top-risk paths. Smaller values fade background edges more. |
The function selects the top_n paths by sorting paths_tbl on
path_risk_score (descending). For those paths, it:
builds an edge list from path_nodes,
marks graph edges that appear on at least one top path,
computes path_freq (how many top paths include each edge),
computes risk_mean_path (mean of risk_sum across top paths that
include each edge),
highlights nodes that appear on any top path.
Node fills are based on cyclomatic_complexity using breaks
(-Inf, 10], (10, 20], (20, 50], (50, Inf] as per Watson & McCabe (1996).
This function relies on external theming/label objects theme_AP() and
lab_expr being available in the calling environment or package namespace.
A ggplot object (invisibly). The plot is also printed as a side effect.
Watson, A. H. and McCabe, T. J. (1996). Structured Testing: A Testing Methodology Using the Cyclomatic Complexity Metric. NIST Special Publication 500-235, National Institute of Standards and Technology, Gaithersburg, MD. doi:10.6028/NIST.SP.500-235
data(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") p <- plot_top_paths_fun(synthetic_graph, out, model.name = "MyModel", language = "R", top_n = 10) pdata(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") p <- plot_top_paths_fun(synthetic_graph, out, model.name = "MyModel", language = "R", top_n = 10) p
Computes the slope of a simple linear regression of a numeric vector
against its index (seq_along(x)). Non-finite (NA, NaN, Inf) values
are removed prior to computation. If fewer than two finite values remain,
the function returns 0.
slope_fun(x)slope_fun(x)
x |
Numeric vector. |
The slope is estimated from the model
,
where . The function returns the estimated slope
.
This summary is useful for characterizing monotonic trends in ordered risk values along a path.
A numeric scalar giving the slope of the fitted linear trend.
slope_fun(c(1, 2, 3, 4)) slope_fun(c(4, 3, 2, 1)) slope_fun(c(NA, 1, 2, Inf, 3))slope_fun(c(1, 2, 3, 4)) slope_fun(c(4, 3, 2, 1)) slope_fun(c(NA, 1, 2, Inf, 3))
A synthetic directed graph with cyclomatic complexity values to illustrate the functions of the package.
data(synthetic_graph)data(synthetic_graph)
A tbl_graph with:
55 nodes
122 directed edges
The graph is stored as a tbl_graph object with:
Node attributes: name, cyclo
Directed edges defined by from → to
theme_AP() provides a minimalist, publication-ready theme based on
ggplot2::theme_bw(), with grid lines removed, compact legends, and
harmonized text sizes. It is designed for dense network and path-visualization
plots (e.g. call graphs, risk paths).
theme_AP()theme_AP()
The theme:
removes major and minor grid lines,
uses transparent legend backgrounds and keys,
standardizes text sizes for axes, legends, strips, and titles,
reduces legend spacing for compact layouts.
This theme is intended to be composable:
it should be added to a ggplot object using + theme_AP().
A ggplot2::theme object.
ggplot2::ggplot(mtcars, ggplot2::aes(mpg, wt)) + ggplot2::geom_point() + theme_AP()ggplot2::ggplot(mtcars, ggplot2::aes(mpg, wt)) + ggplot2::geom_point() + theme_AP()
Runs a full variance-based uncertainty and sensitivity analysis (UA/SA) for node
risk scores using the results returned by all_paths_fun() and the functions
provided by the sensobol package (Puy et al. 2022).
uncertainty_fun(all_paths_out, N, order, eps = 1e-12)uncertainty_fun(all_paths_out, N, order, eps = 1e-12)
all_paths_out |
A list produced by |
N |
Integer. Base sample size used for Sobol' matrices. |
order |
Passed to |
eps |
Numeric. Small positive constant |
Uncertainty is induced by jointly sampling the weights
(renormalized to sum to 1) and the power parameter
used in the node-risk definition:
For each node, risk scores are repeatedly recalculated using the sampled parameter
combinations, producing a distribution of possible outcomes. Sobol' first-order
and/or total-order sensitivity indices are then computed for all four parameters
(, , , and ), quantifying how much of
the variance in the node risk score is attributable to each parameter.
Parameter labels and the Sobol' design.
Internally the design samples four independent values
(a_raw, b_raw, c_raw, p_raw) because the Sobol' quasi-random sequence
requires independent uniform inputs. Before evaluating the risk model, the raw
draws are transformed: the three weight draws are normalised to sum to one,
yielding , , ; and p_raw is mapped linearly
to . The sensitivity indices are then attributed to the
transformed parameters and the output labels them as alpha, beta, gamma,
and p rather than the internal raw names, so the results are directly
interpretable in terms of the model parameters.
Path-level uncertainty is obtained by propagating node-level uncertainty draws through the path aggregation function:
where are node risks along path .
All uncertainty metrics are computed from the first N Sobol draws (matrix A), while sensitivity indices use the full Sobol' design.
For more information about the uncertainty and sensitivity analysis and the output of this function, see the sensobol package (Puy et al. 2022).
The returned node table includes the following columns:
name: name of the node.
uncertainty_analysis: numeric vector of length giving the
uncertainty draws in the node risk score (from Sobol matrix A).
sensitivity_analysis: object returned by sensobol::sobol_indices()
for that node, containing Sobol' indices labelled alpha, beta, gamma,
and p. These correspond to the normalised weights and the power-mean
exponent respectively. The indices are computed on the transformed
parameters (see Details).
The returned paths table includes:
path_id: path identifier.
path_str: sequence of function calls for each path.
hops: number of edges.
uncertainty_analysis: numeric vector giving the uncertainty draws in the path risk score.
gini_index: numeric vector giving the uncertainty draws in the gini index.
risk_trend: numeric vector giving the uncertainty draws in the risk trend.
A named list with:
A tibble of node results.
A tibble of path results.
Puy, A., Lo Piano, S., Saltelli, A., and Levin, S. A. (2022). sensobol: An R Package to Compute Variance-Based Sensitivity Indices. Journal of Statistical Software, 102(5), 1–37. doi:10.18637/jss.v102.i05
data(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") # Power-mean risk (increase N to at least 2^10 for a proper UA/SA) results <- uncertainty_fun(all_paths_out = out, N = 2^2, order = "first") results$nodes results$pathsdata(synthetic_graph) out <- all_paths_fun(graph = synthetic_graph, alpha = 0.6, beta = 0.3, gamma = 0.1, complexity_col = "cyclo") # Power-mean risk (increase N to at least 2^10 for a proper UA/SA) results <- uncertainty_fun(all_paths_out = out, N = 2^2, order = "first") results$nodes results$paths