When you conduct a large ensemble of computer simulations with several scenarios, you are going to deal with many data, including inputs and outputs. You also need to create several directories and subdirectories where you can put or generate the inputs and outputs for your model. For example, you may want to run a cropping system model across a large region, for 3500 grid cells, and you need to feed your model with the input files for each grid cell. Each grid cell has its own weather, soil, crop and management input files. Or you may want to run your model 100,000 times and each time use one set of crop parameters as an input, to conduct a sensitivity analysis. Another common practice of large simulations is looking for any hidden error that happens during the simulations. For example, your running jobs might look normal, without any obvious crash, but you may still get some kind of “error” or “warning” in your log files. So, you need to find those runs, correct them, delete the wrong files and rerun them to have a full set of outputs. These tasks are basic but could be challenging and very time-consuming if you do not know how to complete them efficiently. Linux environment provides facilities that make our lives easier as Dave said in his blog post, and Bernardo also provided some examples for this type of task. Here are a few more instances of simple but useful commands with “find” and “sed.”
find
Sometimes, you want to know how many files with a specific pattern exist in all the subdirectories in a folder. You can type below command at the upper-level folder. “f” means files, and in front of the “name,” we specify the pattern—for example, files that start with “218”. Or we can look for all the files that have the specific extension [i.e. *.csv] or have a specific strings in their name [i.e. *yield*].
find . -type f -name "218*"
Then we can transfer the listed lines of results [-l] to a count function [wc] with pipe [|]:
find . -type f -name "218*" | wc -l
You may want to find and delete all files with the specific pattern [i.e. 218_wheat.csv] in all directories in which they might exist. So, after we find the files, we can execute [exec] the remove command [rm]:
find . -type f -name "218_wheat*" -exec rm {} \;
If these files are located in different directories and we don’t want to delete them all, we can also filter the find results by specifying the pattern of path [i.e. output] and then delete them:
find . -type f -path "*/output/*" -name "218_wheat *" -exec rm {} \;
Sometimes, we need to find specific paths instead of files. For example, I have a text file, and I want to copy that into the “CO2” folder, which is located in the “Inputs” folders of several scenario folders:
find . -type d -path "*/Inputs/*" -name "CO2" -exec cp ../CO2_concentration_8.5.txt {} \;
“d” means directories, so we are searching for directories that contain “Inputs” folder and end with “CO2” folder. Then, we execute the copy command [cp], which copies the text file into the found directories.
If we are looking for a specific string inside some files, we can combine “find” and “grep.” For example, here I am looking for any error file [*.err] that starts with “218cs” if it contains this specific warning: “unable to find”
find . -type f -name “218cs*.err” –exec grep -i “unable to find” {} \;
Or we can look for files that do not contain “success.”
find . -type f -name 218cs*.err" -exec grep -L "success" {} \;
sed
“sed” is a powerful text editor. Most of the time it is used to replace specific string in a text file:
sed -i 's/1295/1360/' 218cs.txt
Here, we insert [i] and substitute [s] a new string [1360] to replace it with the original string [1295]. There might be few duplication of “1295” in a file, and we may just want to replace one of them—for example, one located at line 5:
sed -i '5s/1295/1360/' 218cs.txt
There might be more modifications that have to be done, so we can add them in one line using “-e”:
sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' 218cs.txt
find + sed
If you want to find specific text files (i.e., all the 218cs.txt files, inside RCP8.5 folders) and edit some lines in them by replacing them with new strings, this line will do it:
find . -type f -path "*/RCP8.5/*" -name "218*" -exec sed -i -e '5s/1295/1360/' -e '32s/1196/1200/' -e '10s/default/1420/' {} \;
Sometimes, you want to replace an entire line in a text file with a long string, like a path, or keep some space in the new line. For example, I want to replace a line in a text file with the following line, which has the combination of space and a path:
“FORCING1 /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_”
For this modification, I am going to assign a name to this line and then replace it with the whole string that is located at line 119 in text files [global_param_default.txt], which are located in specific directories [with this pattern “RCP4.5/451812118”].
path_new="FORCING1 /home/fs02/pmr82_0001/tk662/calibration/451812118/forcings/data_"
find . -type f -path "*RCP4.5/451812118/*" -name "global_param_*" -exec sed -i "119s|.*|$path_new|" global_param_default.txt {} +
Sometimes, you want to add a new line at the specific position (i.e., line 275) to some text files (i.e., global_param_default.txt).
find . -type f -name " global_param_*" -exec sed -i "275i\OUTVAR OUT_CROP_EVAP %.4f OUT_TYPE_FLOAT 1" {} \;
Now, all of the “global_param_default” files have a new line with this content: “OUTVAR OUT_CROP_EVAP %.4f OUT_TYPE_FLOAT 1”.
It is also possible that you want to use a specific section of a path and use it as a name of a variable or a file. For example, I am searching for directories that contain an “output” folder. This path would be one of the them: ./453911731_CCF/output/ Now, I want to extract “453911731” and use it as a new name for a file [output__46.34375_-119.90625] that is already inside that path:
for P in $(find . -type d -name "output"); do (new_name="$(echo "${P}"| sed -r 's/..(.{9}).*/\1/')" && cd "${P}" && mv output__46.34375_-119.90625 $ new_name); done
With this command line, we repeat the whole process for each directory (with “output” pattern) by using “for,” “do,” and “done.” At the beginning of the command, the first search result, which is a string of path, is assigned to the variable “P” by adding $ and () around “find” section .Then, the result of “sed –r” is going to be assigned to another variable [new_name]; “-r” in the sed command enables extended regular expressions.
With the “sed” command, we are extracting 9 characters after “./” and removing everything after 9 characters. Each “.” matches any single character. Parentheses are used to create a matching group. Number 9 means 9 occurrences of the character standing before (in this case “.” any character), and “\1” refers to the first matched substring
“&&” is used to chain commands together. “cd” allows you to change into a specified path, which is stored in $P, and “mv” renames the file in this path from “output__46.34375_-119.90625” to “453911731,” which is stored in $new_name.