More on simple Bash Shell scripts (examples of “find” and “sed”)

When you conduct a large ensemble of computer simulations with several scenarios, you are going to deal with many data, including inputs and outputs.  You also need to create several directories and subdirectories where you can put or generate the inputs and outputs for your model.  For example, you may want to run a cropping system model across a large region, for 3500 grid cells, and you need to feed your model with the input files for each grid cell. Each grid cell has its own weather, soil, crop and management input files. Or you may want to run your model 100,000 times and each time use one set of crop parameters as an input, to conduct a sensitivity analysis. Another common practice of large simulations is looking for any hidden error that happens during the simulations. For example, your running jobs might look normal, without any obvious crash, but you may still get some kind of “error” or “warning” in your log files. So, you need to find those runs, correct them, delete the wrong files and rerun them to have a full set of outputs. These tasks are basic but could be challenging and very time-consuming if you do not know how to complete them efficiently. Linux environment provides facilities that make our lives easier as Dave said in his blog post, and Bernardo also provided some examples for this type of task. Here are a few more instances of simple but useful commands with “find” and “sed.”

find

Sometimes, you want to know how many files with a specific pattern exist in all the subdirectories in a folder. You can type below command at the upper-level folder. “f” means files, and in front of the “name,” we specify the pattern—for example, files that start with “218”. Or we can look for all the files that have the specific extension [i.e. *.csv] or have a specific strings in their name [i.e. *yield*].

find . -type f -name "218*"

Then we can transfer the listed lines of results [-l] to a count function [wc] with pipe [|]:

find . -type f -name "218*" |  wc -l

You may want to find and delete all files with the specific pattern [i.e. 218_wheat.csv] in all directories in which they might exist. So, after we find the files, we can execute [exec] the remove command [rm]:

find . -type f -name "218_wheat*" -exec rm {} \;

If these files are located in different directories and we don’t want to delete them all, we can also filter the find results by specifying the pattern of path [i.e. output] and then delete them:

find . -type f -path "*/output/*" -name "218_wheat *" -exec rm {} \;

Sometimes, we need to find specific paths instead of files. For example, I have a text file, and I want to copy that into the “CO2” folder, which is located in the “Inputs” folders of several scenario folders:

find . -type d -path "*/Inputs/*" -name "CO2" -exec cp ../CO2_concentration_8.5.txt {} \;

 “d” means directories, so we are searching for directories that contain “Inputs” folder and end with “CO2” folder. Then, we execute the copy command [cp], which copies the text file into the found directories.

If we are looking for a specific string inside some files, we can combine “find” and “grep.” For example, here I am looking for any error file [*.err] that starts with “218cs” if it contains this specific warning: “unable to find”

find . -type f -name “218cs*.err” –exec grep -i “unable to find” {} \;

Or we can look for files that do not contain “success.”

find . -type f -name 218cs*.err" -exec grep -L "success" {} \;

sed

“sed” is a powerful text editor. Most of the time it is used to replace specific string in a text file:

sed -i 's/1295/1360/' 218cs.txt

Here, we insert [i] and substitute [s] a new string [1360] to replace it with the original string [1295]. There might be few duplication of “1295” in a file, and we may just want to replace one of them—for example, one located at line 5:

sed -i '5s/1295/1360/' 218cs.txt

There might be more modifications that have to be done, so we can add them in one line using “-e”:

sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' 218cs.txt

find + sed

If you want to find specific text files (i.e., all the 218cs.txt files, inside RCP8.5 folders) and edit some lines in them by replacing them with new strings, this line will do it:

find . -type f -path "*/RCP8.5/*" -name "218*" -exec sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/'  {} \;

Sometimes, you want to replace an entire line in a text file with a long string, like a path, or keep some space in the new line. For example, I want to replace a line in a text file with the following line, which has the combination of space and a path:

“FORCING1         /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_”

For this modification, I am going to assign a name to this line and then replace it with the whole string that is located at line 119 in text files [global_param_default.txt], which are located in specific directories [with this pattern “RCP4.5/451812118”].

path_new="FORCING1	/home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_"
find . -type f -path "*RCP4.5/451812118/*" -name "global_param_*" -exec sed -i "119s|.*|$path_new|" global_param_default.txt {} +

Sometimes, you want to add a new line at the specific position (i.e., line 275) to some text files (i.e., global_param_default.txt).

find . -type f -name " global_param_*" -exec sed -i "275i\OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1" {} \; 

Now, all of the “global_param_default” files have a new line with this content: “OUTVAR    OUT_CROP_EVAP  %.4f OUT_TYPE_FLOAT  1”.

It is also possible that you want to use a specific section of a path and use it as a name of a variable or a file. For example, I am searching for directories that contain an “output” folder. This path would be one of the them: ./453911731_CCF/output/ Now, I want to extract “453911731” and use it as a new name for a file [output__46.34375_-119.90625] that is already inside that path:

for P in $(find . -type d -name "output"); do (new_name="$(echo "${P}"| sed -r 's/..(.{9}).*/\1/')" && cd "${P}" && mv output__46.34375_-119.90625 $ new_name); done

With this command line, we repeat the whole process for each directory (with “output” pattern) by using “for,” “do,” and “done.” At the beginning of the command, the first search result, which is a string of path, is assigned to the variable “P” by adding $ and () around “find” section .Then, the result of “sed –r” is going to be assigned to another variable [new_name]; “-r” in the sed command enables extended regular expressions.

With the “sed” command, we are extracting 9 characters after “./” and removing everything after 9 characters. Each “.” matches any single character. Parentheses are used to create a matching group. Number 9 means 9 occurrences of the character standing before (in this case “.” any character), and “\1” refers to the first matched substring

“&&” is used to chain commands together. “cd” allows you to change into a specified path, which is stored in $P, and “mv” renames the file in this path from “output__46.34375_-119.90625” to “453911731,” which is stored in $new_name.

Performing Experiments on HPC Systems

A lot of the work we do in the Reed lab involves running computational experiments on High Performance Computing (HPC) systems. These experiments often consist of performing multi-objective optimization to aid decision making in complex systems, a task that requires hundreds of thousands of simulation runs and may not possible without the use of thousands of computing core. This post will outline some best practices for performing experiments on HPC systems.

1. Have a Plan

By nature, experiments run on HPC systems will consume a large amount of computational resources and generate large amounts of data. In order to stay organized, its important to have a plan for both how the computational resources will be used and how data will be managed.

Estimating your computational resources

Estimating the scale of your experiment is the first step to running on an HPC system. To make a reasonable estimate, you’ll need to gather the following pieces of information:

  • How long (in wall clock time) does a single model run take on your local machine?
  • How many function evaluations (for an MOEA run) or ensemble model runs will you need to perform?
  • How long do you have in wall clock time to run the experiment?

Using this information you can estimate the number of parallel processes that you will need to successfully run the experiment. Applications such as running the Borg MOEA are known as, “embarrassingly parallel” and scale quite well with an increase in processors, especially for problems with long function evaluation times (see Hadka and Reed, 2013 for more info). However, many other applications scale poorly, so it’s important to be aware of the parallel potential of your code. A helpful tip is to identify any inherently serial sections of the code which create bottlenecks to parallelization. Parallelizing tasks such as Monte Carlo runs and MOEA function evaluations will often result in higher efficiency than paralellizing the simulation model itself. For more resources on how to parallelize your code, see Bernardo’s post from last year.

Once you have an idea of the scale of your experiment, you’ll need to estimate the experiment’s computational expense. Each HPC resource has its own charging policy to track resource consumption. For example, XSEDE tracks charges in “service units” which are defined differently for each cluster. On the Stampede2 Cluster, a service unit is defined as one node-hour of computing time, so if you run on 100 nodes for 10 hours, you spend 1,000 service units regardless of how many core per node you utilize. On the Comet Cluster, a service unit is charged by the core-hour, so if you run 100 nodes for 10 hours and each utilizes 24 core, you’ll be charged 24,000 service units. Usually, the allocations you receive to each resource will be scaled accordingly, so even though Comet looks more expensive, you likely have a much large allocation to work with. I usually make an estimate of service units I need for an experiment and add another 20% as a factor of safety.

Data management:

Large experiments often create proportionately large amounts of data. Before you start, its important to think about where this data will be stored and how it will be transferred to and from the remote system. Many clusters have limits to how much you can store on different drives, and breaking these limits can cause performance issues for the system. System administrators often don’t take kindly to these performance issues and in extreme cases, breaking the rules may result in suspension or removal from a cluster. It helps to create an informal data management plan for yourself that specifies:

  1. How will you transfer large amounts of data to and from the cluster (tools such as Globus are helpful here).
  2. Where will you upload your code and how your files will be structured
  3. Where will you store data during your experimental runs. Often clusters have “scratch drives” with large or unlimited storage allocations. These drives may be cleaned periodically so they are not suitable for long term storage.
  4. Where will you store data during post processing. This may still be on the cluster if your post processing is computationally intensive or your local machine can’t handle the data size.
  5. Where will you store your experimental results and model data for publication and replication.

2. Test on your local machine

To make the most of your time on a cluster, its essential that you do everything you can to ensure your code is properly functioning and efficient before you launch your experiment. The biggest favor you can do for yourself is to properly test your code on a local machine before porting to the HPC cluster. Before porting to a cluster, I always run the following 4 checks:

  1. Unit testing: the worst case scenario after a HPC run is to find out there was a major error in your code that invalidates your results. To mitigate this risk as much as possible, it’s important to have careful quality control. One helpful tool for quality control is unit testing to examine every function or module in your code and ensure it is working as expected. For an introduction to unit testing, see Bernardo’s post on Python and C++.
  2. Memory checking: in low level code (think C, C++) memory leaks can be silent problem that throws off your results or crash your model runs. Sometimes, memory leaks can go undetected during small runs but add up and crash your system when run in large scale experiments. To test for memory leaks, make sure to use tools such as Valgrind before uploading your code to any cluster. This post features a nice introduction to Valgrind with C++.
  3. Timing and profiling: Profile your code to see which parts take longest and eliminate unnecessary sections. If you can, optimize your code to avoid computationally intensive pieces of code such as nested loops. There are numerous tools for profiling low level codes such as Callgrind and gprof. See this post for some tips on making your C/C++ code faster.
  4. Small MOEA tests: Running small test runs (>1000 NFE) will give you an opportunity to ensure that the model is properly interacting with the MOEA (i.e. objectives are connected properly, decision variables are being changed etc.). Make sure you are collecting runtime information so you can evaluate algorithmic performance

3. Stay organized on the cluster

After you’ve fully tested your code, it’s time to upload and port to the cluster. After transferring your code and data, you’ll likely need to use the command line to navigate files and run your code. A familiarity with Linux command line tools, bash scripting and command line editors such as vim can make your life much easier at this point. I’ve found some basic Linux training modules online that are very useful, in particular “Learning Linux Command Line” from Linked-in learning (formerly Lynda.com) was very useful.

Once you’ve got your code properly organized and compiled, validate your timing estimate by running a small ensemble of simulation runs. Compare the timing on the cluster to your estimates from local tests and reconfigure your model runs if needed. If performing an experiment with an MOEA, run a small number of NFE on a development or debug node confirm that the algorithm is running and properly parallelizing. Then run a single seed of the MOEA and perform runtime diagnostics to ensure things are working more or less as you expect. Finally, you’re ready to run the full experiment.

I’d love to hear your thoughts and suggestions

These tips have been derived from my experiences running on HPC systems, if anyone else has tips or best practices that you find useful, I’d love to hear about them in the comments.

So What’s the Rationale Behind the Water Programming Blog?

By Patrick M. Reed

In general, I’ve made an effort to keep this blog focused on the technical topics that have helped my students tackle various issues big and small. It helps with collaborative learning and maintaining our exploration of new ideas.

This post, however, represents a bit of departure from our normal posts in response to some requests for a suggested reading guide for my Fall 2019 AGU Paul A. Witherspoon Lecture entitled “Conflict, Coordination, and Control in Water Resources Systems Confronting Change” (on Youtube should you have interest). 

The intent is to take a step back and zoom out a bit to get the bigger picture behind what we’re doing as research group and much of the original motivation in initiating the Water Programming Blog itself. So below, I’ll provide a summary of papers that related to the topics covered in the lecture sequenced by the talk’s focal points at various points in time. Hope this provides some interesting reading for folks. My intent here is to keep this informal. The highlighted reading resources were helpful to me and are not meant to be a formal review of any form.

So let’s first highlight Paul Witherspoon himself, a truly exceptional scientist and leader (7 minute marker, slides 2-3).

  1. His biographical profile
  2. A summary of his legacy
  3. The LBL Memo Creating the Earth Sciences Division (an example of institutional change)

Next stop, do we understand coordination, control, and conflicting objectives in our institutionally complex river basins (10 minute marker, slides 6-8)? Some examples and a complex systems perspective.

  1. The NY Times Bomb Cyclone Example
  2. Interactive ProPublica and The Texas Tribune Interactive Boomtown, Flood Town (note this was written before Hurricane Harvey hit Houston)
  3. A Perspective on Interactions, Multiple Stressors, and Complex Systems (NCA4)

Does your scientific workflow define the scope of your hypotheses? Or do your hypotheses define how you need to advance your workflow (13 minute marker, slide 9)? How should we collaborate and network in our science?

  1. Dewey, J. (1958), Experience and Nature, Courier Corporation.
  2. Dewey, J. (1929), The quest for certainty: A study of the relation of knowledge and action.
  3. Hand, E. (2010), ‘Big science’ spurs collaborative trend: complicated projects mean that science is becoming ever more globalized–and Europe is leading the way, Nature, 463(7279), 282-283.
  4. Merali, Z. (2010), Error: Why Scientific Programming Does Not Compute, Nature, 467(October 14), 775-777.
  5. Cummings, J., and S. Kiesler (2007), Coordination costs and project outcomes in multi-university collaborations, Research Policy, 36, 1620-1634.
  6. National Research Council (2014), Convergence: facilitating transdisciplinary integration of life sciences, physical sciences, engineering, and beyond, National Academies Press.
  7. Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, and P. E. Bourne (2016), The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data, 3.
  8. Cash, D. W., W. C. Clark, F. Alcock, N. M. Dickson, N. Eckley, D. H. Guston, J. Jäger, and R. B. Mitchell (2003), Knowledge systems for sustainable development, Proceedings of the National Academy of Sciences, 100(14), 8086-8091.

Perspectives and background on Artificial Intelligence (15 minute marker, slides 10-16)

  1. Simon, H. A. (2019), The sciences of the artificial, MIT press.
  2. AI Knowledge Map: How To Classify AI Technologies
  3. Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Publishing Company, Reading, MA
  4. Coello Coello, C., G. B. Lamont, and D. A. Van Veldhuizen (2007), Evolutionary Algorithms for Solving Multi-Objective Problems, 2 ed., Springer, New York, NY.
  5. Hadka, D. M. (2013), Robust, Adaptable Many-Objective Optimization: The Foundations, Parallelization and Application of the Borg MOEA.

The Wicked Problems Debate (~22 minute marker, slides 17-19) and the emergence of post normal science and decision making under deep uncertainty.

  1. Rittel, H., and M. Webber (1973), Dilemmas in a General Theory of Planning, Policy Sciences, 4, 155-169.
  2. Buchanan, R. (1992), Wicked problems in design thinking, Design Issues, 8(2), 5-21.
  3. Kwakkel, J. H., W. E. Walker, and M. Haasnoot (2016), Coping with the Wickedness of Public Policy Problems: Approaches for Decision Making under Deep Uncertainty, Journal of Water Resources Planning and Management, 01816001.
  4. Ravetz, J. R., and S. Funtowicz (1993), Science for the post-normal age, Futures, 25(7), 735-755.
  5. Turnpenny, J., M. Jones, and I. Lorenzoni (2011), Where now for post-normal science?: a critical review of its development, definitions, and uses, Science, Technology, & Human Values, 36(3), 287-306.
  6. Marchau, V. A., W. E. Walker, P. J. Bloemen, and S. W. Popper (2019), Decision making under deep uncertainty, Springer.
  7. Mitchell, M. (2009), Complexity: A guided tour, Oxford University Press.

Lastly, the Vietnam and North Carolina application examples.

  1. Quinn, J. D., P. M. Reed, M. Giuliani, and A. Castelletti (2019), What Is Controlling Our Control Rules? Opening the Black Box of Multireservoir Operating Policies Using Time-Varying Sensitivity Analysis, Water Resources Research, 55(7), 5962-5984.
  2. Quinn, J. D., P. M. Reed, M. Giuliani, A. Castelletti, J. W. Oyler, and R. E. Nicholas (2018), Exploring How Changing Monsoonal Dynamics and Human Pressures Challenge Multireservoir Management for Flood Protection, Hydropower Production, and Agricultural Water Supply, Water Resources Research, 54(7), 4638-4662.
  3. Trindade, B. C., P. M. Reed, and G. W. Characklis (2019), Deeply uncertain pathways: Integrated multi-city regional water supply infrastructure investment and portfolio management, Advances in Water Resources, 134, 103442.
  4. Gold, D., Reed, P. M., Trindade, B., and Characklis, G., “Identifying Actionable Compromises: Navigating Multi-City Robustness Conflicts to Discover Cooperative Safe Operating Spaces for Regional Water Supply Portfolios.“, Water Resources Research,  v55, no. 11,  DOI:10.1029/2019WR025462, 9024-9050, 2019.