Last updated: 2021-06-29

Checks: 7 0

Knit directory: globalIRmap/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200414) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 5eed8f5. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    R/.Rhistory
    Ignored:    analysis/.Rhistory
    Ignored:    globalIRmap_rep/
    Ignored:    renv/library/
    Ignored:    renv/staging/

Untracked files:
    Untracked:  .drake/
    Untracked:  .gitignore
    Untracked:  figtabres.docx
    Untracked:  schema.ini

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/methods_workflow.Rmd) and HTML (docs/methods_workflow.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 5eed8f5 messamat 2021-06-29 wflow_publish(c(“analysis/about.Rmd”, “analysis/index.Rmd”, “analysis/methods_gettingstarted.Rmd”,
Rmd 8abae5d messamat 2021-06-29 Update doc
html 8abae5d messamat 2021-06-29 Update doc
html 6e12f71 messamat 2021-06-14 Build site.
Rmd 5e433c3 messamat 2021-06-14 Publish new pages
html 9b48bde messamat 2021-01-06 Build site.
Rmd f1d9dcf messamat 2021-01-06 Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation
html f1d9dcf messamat 2021-01-06 Start building up workflowr website, start incorporating mandrake (but wait as very unstable still), plan gauge selection documentation

Overall workflow

This study leverages the respective strengths of R (for data wrangling, statistics, and figure-making) and Python (for spatial analysis and mapping). As a result, re-producing it requires going back and forth between these two languages and platforms. At the broadest level, the main steps of this analysis were the following:
1. Pre-processing and formatting global river network environmental attributes in Python.
2. Pre-processing and formatting spatial datasets aside from hydro-environmental attributes in Python.
3. R analysis in R.
4. Final formatting of analysis outputs for mapping in Python.
5. Map results and comparisons in ArcMap.

Below, we briefly explain how each of these steps was implemented, but additional data not currently available publicly are needed to fully reproduce the analysis. Please contact and/or bernhard.lehner(at)mcgill.ca for additional information should you want to re-produce the results from this study. In addition please note that processing these data takes weeks of continuous computing on a normal workstation.

1. Pre-processing and formatting global river network environmental attributes

Main purpose: download and compute additional hydro-environmental attributes for the global river network than those already included in RiverATLAS.

Github repository structure for globalIRmap_HydroATLAS_py

Set-up

utility_functions.py:
- import key modules.
- defines utility functions used throughout the analysis.
- defines the basic folder structure of the analysis.

runUplandWeighting.py:
- define functions for routing data on river network

Download data

Downloading data requires the creation of a file called “configs.json” with login information for earthdata and alos. For guidance on formatting the json configuration file, see here.

Pre-format data

  • format_HydroSHEDS.py: create
    • a coastal band raster (~ 10 pixels inland at ~450 m resolution)
    • HydroSHEDS regions of contiguous land surfaces in raster and polygon format
  • format_MODISmosaic.py: extract and mosaic MODIS ocean mask.
  • format_GLAD.py: format surface water dynamics dataset, removing ocean pixels and aggregating data from 30 m resolution to 15 sec (~450 m) resolution (i.e., computing statistics of e.g. percentage area of seasonal surface water).
  • format_WorldClim2.py: resample WorldClim2 rasters (30 sec native resolution) to HydroSHEDS resolution (15 sec) and fill gaps.
  • format_GAIandCMIv2.py:
    • compute Climate Moisture Index (based on WorldClimv2 precipitation and GAIv2 potential evapotranspiration data)
    • resample GAI and CMI rasters to HydroSHEDS resolution (15 sec)
  • format_SoilGrids250m.py: mosaic tiles, compute aggregate texture values for (0-100 cm), reproject and aggregate rasters (250 m) to HydroSHEDS resolution (15 sec).
  • format_worldpop.py: aggregate (from 3 sec to 15 sec resolution)and mosaick country population rasters, associate each population pixel to a river reach (with long-term mean annual flow > 0.1 m3/s), and compute population that is closest to each reach.

Associate hydro-environmental attributes to RiverATLAS river reaches

  • runUplandWeighting_batch.py: route hydro-environmental characteristics along the global river network to yield rasters of the average value of a given hydro-environmental characteristic (e.g., global aridity index) across the entire upstream area of each pixel. Compute rasters for worldclim, GAI, CMI, soilgrids textures from 0 to 100 cm, and surface water dynamics.
  • runHydroATLASStatistics.py: create statistics tables of hydro-environmental attributes for every river reach in RiverATLAS. This code requires a fair amount of manual adjustment of local paths and must direct to a local master spreadsheet with the parameters of all statistics to compute. Please contact for more information and for an example of such a table.

Workflow summary

Execute:
1. scripts for downloading data in any order
2. format_MODISmosaic.py
3. format_HydroSHEDS.py
4. format_WorldClim2.py
5. other formatting scripts in any order
6. runUplandWeighting_batch.py
7. runHydroATLASStatistics.py

2. Pre-processing and formatting spatial datasets aside from hydro-environmental attributes

Main purpose: compile and pre-process global river network; download and spatially pre-process streamflow gauging stations (reference data for model training and testing), national hydrographic datasets, and on-the-ground visual observations of flow intermittence.

Github repository structure for globalIRmap_py

Set-up

utility_functions.py:
- imports key modules.
- defines utility functions used throughout the analysis.
- defines the basic folder structure of the analysis.

setup_localIRformatting.py: - defines folder structure for formatting data to compare modeled estimates of global flow intermittence to national hydrographic datasets (Comparison_databases) and to in-situ/field-based observations of flow intermittence (Insitu_databases). - defines functions used in formatting data for the comparisons

Download data

  • download_GSIM.py: download Global Streamflow Indices and Metadata (GSIM) archive from pangaea repositories.
  • download_format_IRdata.py: Download and format national hydrographic datasets and download on-the-ground observation of river intermittence.
    • U.S.A.: download National Hydrography Plus (NHDPlus) medium and high resolution, add attributes (drainage area, mean annual flow), export attribute table for analysis in R, divide datasets into subsets by drainage area and discharge size classes, subselect HydroATLAS basins that overlap the NHDPlus.
    • France (data were given by Ton Snelder): divide dataset into subsets by drainage area and stream order size classes, subselect HydroATLAS basins that overlap France.
    • Brazil: download national hydrographic dataset, identify first order streams through network analysis.
    • Australia: download Australian Geofabric, divide dataset into subsets by drainage area size classes, subselect HydroATLAS basins that overlap Australia.
    • Observatoire National Des Etiages (ONDE, France): download ONDE dataset for 2012-2019, download French “Carthage” hydrographic network for formatting.
    • Pacific Northwest PROSPER: download PROSPER dataset of flow state observations, download continuous parameter grids (CPG) of topography data in the Pacific Northwest for formatting.

Format data

  • format_RiverATLAS.py: format RiverATLAS river network.
    • Intersect RiverATLAS reaches with lakes
    • Spatially associate RiverATLAS reaches with HydroBASINS level 05
    • Export attribute table of RiverATLAS including those included in River ATLAS 1.0 and new hydro-environmental attributes computed for this study.
  • format_stations.py: subselect, format, and spatially join gauging stations with RiverATLAS river network.
    • Join GRDC stations to nearest river reach in RiverATLAS.
    • Manually check and correct the location of all those GRDC stations that are more than 50 meters or whose reported drainage area differs byh more than 10% from associated river reach.
    • Subset GSIM stations according to the criteria outlined in this website’s tab and the article’s supplementary information.
    • Snap GSIM stations to nearest RiverATLAS river reache within 200 m
    • Manually and correct the location of every GSIM station.
    • Flag all GRDC and GSIM stations within 3 km from a coastline
  • format_FROndeEaudata.py: format on-the-ground visual observations of flow intermittence from the Observatoire National Des Etiages (ONDE) across mainland France (for more explanations of the processing approach, see Section VIB in the Supplementary Information of the article).
    • Spatially join every point (site) in the ONDE network to the nearest river reach in the Carthage river network
    • Manually check and correct the location of all sites (based on site name, ID attribute, initial location)
    • Automatically join river reaches in the Carthage network with ONDE sites to RiverATLAS river network (see detailed process in Supplementary Information of the article)
    • Manually check and correct the location of each ONDE site/association between ONDE sites and RiverATLAS network.
    • Extract site attributes: how far down the RiverATLAS reach the site is as a percentage of the reach length, drainage area and discharge at the ONDE site and at the pourpoint (downstream end) of the corresponding RiverATLAS reach.
  • format_PNWdata.py:
    • Subset points to keep the same as those used by Jaeger et al. (2019) to which we added valid observations before 2004.
    • Spatially join observation points to NHDplus high resolution
    • Extract drainage area for each observation point
    • Spatially join points to closest RiverATLAS river reaches
    • Remove those that are over 500 m away from a RiverATLAS reach, with a drainage area < 10 km2, or that considerably differ in drainage area with nearest river reach.
    • Mannualy check and correct the location of most sites (see criteria in Supplementary Information).
    • Extract site attributes: how far down the RiverATLAS reach the site is as a percentage of the reach length, drainage area and discharge at the site and at the pourpoint (downstream end) of the corresponding RiverATLAS reach.

Workflow summary

Execute:
1. scripts for downloading data in any order
2. format_RiverATLAS.py
3. format_stations.py
4. format_FROndeEaudata.py
5. format_PNWdata.py

3. R analysis

Main purpose: QA/QC streamflow gauging station records; develop and validate random forest models, compare predictions to hydrographic datasets and on-the-ground observations, generate tables, make non-spatial figures and generate tabular predictions for global river network and spatial data for other mapping.

Github repository structure for globalIRmap

The structure of the Github repository stems from the fact that this project is formatted as an R package, relies on drake for organizing the analysis workflow, renv for dependency management, and includes all documents used for this workflowr website. All files and directories whose role is mainly structural (and therefore whose content can be ignored for the analysis) are marked with a X; directories are in bold, files are in italic.

  • R/ — core of the analysis (based on the recommended project structure for drake).
  • analysis/ X — Rmarkdown files used in building this website.
  • archived/ X — legacy scripts not used in final workflow.
  • assets/ X — files used for website (e.g., map in home tab).
  • docs/ X — HTML rendering of Rmarkdown files from analysis/ folder, used in building this website.
  • log/ X — log file for drake.
  • man/ X — Rmarkdown rendering of function documentations for package (automatically built with roxygen2).
  • renv/ X — files necessary for local dependency management by renv (including the project package library when restored).
  • shinyapp/globalIRmap_gaugesel/ X — Shiny app for this website.
  • .Rbuildignore X — files and directories to exclude from package building.
  • .Rprofile X — used to activate renv and workflowr for new R sessions launched in the project.
  • .gitattributes X — Github internal file.
  • DESCRIPTION X - Package description.
  • LICENSE - Package license.
  • NAMESPACE X - Specifies which functions are available outside of your package.
  • README.md X — README for Github.
  • _drake.R — configuration script for drake workflow, this specific file name is required by the drake package.
  • _workflowr.yml X — workflowr-specific configuration file (for website building).
  • drake_cache.csv X — CSV cache log file for drake targets.
  • figtabres.Rmd — Rmarkdown report of study results; create plots for manuscript.
  • globalIRmap.Rproj X — R project file.
  • interactive.R — script for user to manually run (!most importantly! containg r_make() statement to launch drake workflow/build targets in separate, clean R session (for reproduceability).
  • renv.lock X — renv lockfile, describing the state of the project’s library (installed packages and their version).

Set-up

After downloading the project from Github (see Getting started), launch the R project (globalIRmap.Rproj) and execute the following lines:

renv::restore() # respond y, restores all R packages with their specific version
remotes::install_github('messamat/globalIRmap') #install project package so that the help documentation can be accessed for project functions (e.g., ?format_gaugestats)

To check the steps of the analysis, see R/IRmapping_plan.R, which contains the drake “plan”, the high-level catalog of all the steps in the workflow (see the corresponding chapter in the drake user manual). This plan defines the order of functions to use, their inputs and outputs (usually, targets), and the relationship among targets and steps. The R object for a plan is a data.frame with columns named target and command; each row represents a step in the workflow; each command.Each command is a concise expression that makes use of our functions, and each target is the return value of the command (the target column has the names of the targets, not the values). In this analysis, functions from planutil_functions.R were used to create complex branches in the plan (re-running the entire model and result formatting pipeline but when defining non-perennial watercourses as those that cease to flow at least 30 days per year, rather than 1 day per year).

To get information on a function, simply type ?function in the R console. For instance, ?comp_GRDCqstats.

Running the analysis

Provided that your were given the necessary data, the entire analysis can simply be re-run with the following code found in interactive.R:

library(drake)
r_make() # recreates the analysis

r_make() is the central, most important function of the drake approach. It runs all the steps of the workflow in the correct order, skipping any work that is already up to date. Because of how drake tracks global functions and objects as dependencies of targets, the use of r_make() is needed to run the analysis pipeline in a clean reproducible environment. If all targets are up to date in the .drake/ directory, then nothing will be run.

Inspecting results

If you were provided intermediate targets (i.e., a .drake/ directory; or once you have re-run the analysis), you can load individual targets in the environment with the following commands (even if the targets are not up to date due to e.g. a change in source path).

loadd(globaltables_gad_id_cmj)
print(globaltables_gad_id_cmj)

tab <- readd(globaltables_gad_id_cmj)
print(tab)

To reproduce the tables and (non-map) figures in the manuscript, run the following line in the R console: rmarkdown::render('figtabres.Rmd', encoding = 'UTF-8'). Note that nearly all figures and tables were manually adjusted for aesthetic purpose prior to inclusion in manuscript, so that the rendition of figtabres.Rmd will not exactly match the final format in form (but it will match in content).

R analysis output files for subsequent analysis

The R analysis produces a suite of outputs necessary for subsequent mapping and analysis, notably:
- RiverATLAS_predbasic800_20210216.csv: comma-separated value table of model predictions for the RiverATLAS global river network.
- BasinATLAS_v10_lev03_errors.gpkg: polygons of BasinATLAS level 3 subdivisions with attributes. Used to produce Figure 3 in the manuscript main text and Figure S6 in the Supplementary Information.
- GRDCstations_predbasic800.gpkg and GRDCstations_predbasic800_mdur30.gpkg: points of gauging stations used in model development and validation with reference attributes and model predictions. Used to produce Extended Data Fig. 2 and Supplementary Information Fig. S3.
- pnwobs_IPR_predbasic800cat%Y%m%d%H%M%S.shp and ondeobs_IPR_predbasic800cat%Y%m%d%H%M%S.shp: on-the-ground observations of flow intermittence from ONDE (France) and PROSPER (U.S. Pacific Northwest) with reference attributes and model predictions. Used to produce Extended Data Fig. 6.

4. Final formatting of analysis outputs for mapping

Run predict_RF.py in the globalIRmap_py Github repository:
- Format global river network attributes for mapping.
- Divide and export RiverATLAS global river network (with predictions) into subsets by discharge and drainage area size classes for mapping.
- Join ONDE (France) and PROSPER (U.S. Pacific Northwest) on-the-ground observations with river network for mapping (Extended Data Fig. 6).

5. Map results and comparisons

Please contact Mathis Messager for ArcGIS map packages to reproduce specific maps from the manuscript.


sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     rstudioapi_0.13  whisker_0.4      knitr_1.29      
 [5] magrittr_1.5     R6_2.4.1         rlang_0.4.10     fansi_0.4.1     
 [9] stringr_1.4.0    tools_4.0.2      xfun_0.24        utf8_1.1.4      
[13] git2r_0.27.1     htmltools_0.5.0  ellipsis_0.3.2   rprojroot_1.3-2 
[17] yaml_2.2.1       digest_0.6.25    tibble_3.1.1     lifecycle_0.2.0 
[21] crayon_1.3.4     later_1.2.0      vctrs_0.3.8      promises_1.2.0.1
[25] fs_1.5.0         glue_1.4.0       evaluate_0.14    rmarkdown_2.7   
[29] stringi_1.4.6    compiler_4.0.2   pillar_1.6.1     backports_1.1.10
[33] httpuv_1.5.4     renv_0.9.3       pkgconfig_2.0.3