I often write customized C++ scripts to manipulate large data files. There’s obviously a time and place for this, since you get ultimate control on every aspect of how your data looks going in and coming out. We’ve written about this before, and I think string processing is an important skill no matter what language. There’s a post about matlab (and another one here), some sample bash scripting, and a post about python among other things. You should also see Matt’s series on python data analysis, since I’m doing some shameless plugging!
Anyway… little did I know that something very complicated in C++ can be easily done in linux/unix with “split”!
To split a large file into smaller files with, say, 100 lines, you use: “split -l 100 myLargerFile.txt” There are also commands to change the filenames of the output files, and so forth.
grep allows you to find an expression in one or more files in a folder on Linux. I find it useful for programming. Say, for example, I want to look for the string “nrec” in a set of source code and header files. Maybe “nrec” is a variable and I forgot where I declared it (if this sounds a little too specific to be merely an example, you’re right. This is what I’m having to do right this second!). The grep command is:
grep -in “nrec” *.*
What this means is, search for the “nrec” expression in every file in the folder. There are two useful flags set here as well. ”i” means that the search is case insensitive (that is, NREC and NrEc and nrec are each treated as equal). ”n” means that the program will show me the line number of each occurrence of my desired phrase. There are other options that I’m not using, including “inverting” a search to find all occurrences of NOT that phrase, suppressing the file name or only showing the file name, etc.
If you were curious, here’s a sample of the output:
iras.h:144: int num_flow_datapoints; //originally: NRec
(If you’re curious, the first instance is in a header file, on line 144. I’m translating this code from one language to another, and originally the variable was called “nrec”. So in the header file I made a note that now my variable is called something else. In the second instance, I had copied the original code into my file as a placeholder, so now I know that I need to use my new name in its place. Also, the “i” flag in grep is helpful since fortran is not case-sensitive, and here you can see there were two different case styles for this variable even in our simple example.)
For more info, please consult some casual reference such as this excellent post about linux command line utilities, a similar blog post about grep, and of course the Linux man page for the command. Also look at 15 grep tips. As usual, remember that “man [insert command here]” gives you all the low-down on each command you’d like to learn.
Thanks for reading and please comment with additional tips or questions!