Data Visualization Resources

Martha recently shared a great list of visualization resources with me, for me to augment and share with the group. I highly recommend spending some time with these books, which can be a valuable source of ideas.

  • John Tukey, primarily a mathematician [FFT algorithms and box plots] (Bell Labs, Princeton), his book Exploratory Data Analysis (1970) started this off with good design concepts using pencil & paper visualizations.  A focus on making visualization informative and easy to do.
  • Edward Tufte (www.edwardtufte.com) 4 worthwhile books for concepts and print design: Visual Display of Quantitative Information, Envisioning Information, Visual Explanations, Beautiful Evidence, all from Graphics Press but do not spend the money to attend his short course unless you really want to see items from his collection of very old books
  • William S. Cleveland (Bell Labs, Purdue Statistics): Visualizing Data (1993) The Elements of Graphing Data (1994). These books are classics, but possibly now dated?
  • Colin Ware (ccom.unh.edu/vislab) The Data Visualization Research Lab. Also authored a very interesting textbook that considers visualization in the context of human perception: Information Visualization (2004, Elsevier)
  • Manuel Lima, Visual Complexity: mapping patterns of information, Princeton U Press, 2011 (where Martha found the ‘radial convergence diagram’)
  • Nathan Yau (flowingdata.com) 2 books: Visualize This (2011) and Data Points (2013) Infographics, but with strong statistics background. Books also include useful sections on parsing data in different formats and software recommendations; many suggestions compatible with ‘big data’ applications. (Parenthetical from Matt: “Infographics” are not popular with visualization folks. They are the sort of thing that turns up of the front page of USA today, and to which Tufte especially objects. They can, however, be done well, as Yau’s site exhibits. The webcomic xkcd.com also occasionally feaures some really neat infographics.)
  • Andrew Gelman (andrewgelman.com) blog by Bayesian statistician, but see Statistical Graphics category of blog posts
  • Kaiser Fung (junkcharts.typepad.com) regular blog posts on ‘bad’ graphics with suggestions for improvements; also see the links under Common Interests
  • Stephen Few (www.perceptualedge.com/blog) Visual Business Intelligence see the Examples with proposed fixes (recommended by Thorsten Wagener)
  • Robert Kosara (eagereyes.org) Visualization and Visual Communication (author works for Tableau Software, but the website is independent) More real life examples with commentary; see Lists of Influences for even more resources
  • Toby Segaran and Jeff Hammerbacher, eds., Beautiful Data, O’Reilly, 2009 compilation of articles from several authors dealing with specific problems, but be aware of the proprietary software referenced
  • Katy Börner, Atlas of Science, MIT Press, 2010 the ultimate ‘coffee table book’ for the interface between science visualization and art

Announcing version 1.0 of pareto.py

Jon and I are pleased to announce version 1.0 of pareto.py, the free, open-source ε-nondominated sorting utility. Find the ε-nondominated solutions in your data today! ε-nondominated sorting lets you specify the resolution of your objectives to obtain a set of meaningfully distinct efficient points in your data set. (Confused? See this post for more context.)

ε-nondominated sorting

ε-nondominated sorting: Red ε-boxes are dominated, yellow solutions are ε-nondominated.

  • available on GitHub
  • licensed under LGPL
  • pure Python implementation with minimal dependencies (only standard library, does not require numpy)
  • ε-nondominated sorting
  • “importable”: use pareto.pyfrom other Python scripts
  • “pipelineable”: include pareto.py in a Unix pipeline
  • tag-along variables: columns (even non-numeric ones) other than those to be sorted may be retained in the output
  • verbatim transcription of input: nondominated rows in the output appear exactly the same as they did in the input
  • flexible column specification: e.g. 0-5 6 10-7
  • mix minimization and maximization objectives
  • skip header rows, commented rows, and blank lines
  • annotate each row in the output with the file it came from, including stdin
  • add line numbers to annotations
  • index columns from the end of the row, allowing rows of varying lengths to be sorted together

(source code for the impatient)

(previous announcement: version 0.2)

Running tmux on the cluster

tmux is a terminal multiplexer — a program that lets you do more than one thing at a time in a terminal window. For example, tmux lets you switch between running an editor to modify a script and running the script itself without exiting the editor. As an added bonus, should you lose your ssh connection, your programs are still running inside tmux and you can bring them up again when you re-connect. If you’ve ever used GNU Screen, it’s the same idea.

Here’s what it looks like in the MinTTY terminal that comes with Cygwin.   In this example I’ve split the window across the middle, with an editor open in the bottom and a command prompt at the top.  (You can do vertical splits too.)

tmux terminal multiplexer showing a split window

tmux terminal multiplexer showing a split window

Building tmux

I recently built tmux again. It’s pretty easy to do, and it takes up less than 7 megabytes of disk space. Here’s how:

Make a directory for your own installs. Mine is called ~/local

mkdir ~/local

You’ll probably have to build libevent 2.0 first if, like me, it’s not already on your system.

Make a directory for building things, if you haven’t already got one.

mkdir build
cd build

Download libevent. Be sure to replace 2.0.21 with the latest stable version.
wget https://github.com/downloads/libevent/libevent/libevent-2.0.21-stable.tar.gz

Extract libevent

tar xzvf libevent-2.0.21-stable.tar.gz
cd libevent-2.0.21-stable

Build libevent. I used an absolute path with the prefix to be on the safe side.

./configure --prefix=/path/to/my/home/directory/local
make
make install

Download and extract tmux. Get the latest version, which is 1.8 at the time of this writing.

wget http://downloads.sourceforge.net/tmux/tmux-1.8.tar.gz
tar xzvf tmux-1.8.tar.gz
cd tmux-1.8

If you rolled your own libevent, you’ll need to set appropriate CFLAGS and LDFLAGS when you run configure. Otherwise you can skip export CFLAGS and export LDFLAGS.

export CFLAGS=-I/path/to/my/home/directory/local/include
export LDFLAGS="-Wl,-rpath=/path/to/my/home/directory/local/lib -L/path/to/my/home/directory/local/lib"
./configure --prefix=/path/to/my/home/directory/local
make
make install

Then add tmux to your path, and you’re done. Do this by putting the following lines in your .bashrc:

export MYPACKAGES=/path/to/my/home/directory/local/
export PATH=${MYPACKAGES}/bin:${PATH}

Announcing pareto.py: a free, open-source nondominated sorting utility

Jon Herman and I are pleased to announce version 0.2 of pareto.py, a free, open-source nondominated sorting utility. Released under the LGPL, pareto.py performs a nondominated sort on the data in any number of input files. To the best of our knowledge, this is the only standalone nondominated sorting utility available.

Implemented using only the Python standard library, pareto.py features:

  • epsilon-nondominated sorting
  • “importable” design: can be used from other Python scripts
  • tag-along variables: columns (even non-numeric ones) other than those to be sorted may be retained in the output
  • verbatim transcription of input: nondominated rows in the output appear exactly the same as they did in the input
  • optionally skip header rows, commented rows, and blank lines

We welcome feature requests, bug reports, and pull requests. If you’re in a hurry, here is the source code for version 0.2.

(Edit: bump to version 0.2)

Comparing Data Sets: Are Two Data Files the Same?

Jon and I were looking at nondominated sorting, and the question came up, “how do you validate a sorting routine?”  You’ve got to compare the resulting data, but doing so is not straightforward.  You can’t just diff the text files that come out, because there’s no guarantee everything comes out in the same order.  Sorting and then diff-ing the files still doesn’t help, because output formatting may differ between sorting routines.  So I wrote a Python script that evaluates whether two data tables are the same, to within a specified tolerance:

https://github.com/matthewjwoodruff/datacomparison

Matlab and Matplotlib Plotting Examples

A while back, Jon published a set of Matlab plotting examples on GitHub.  I’m not sure if this has been publicized yet — I couldn’t find it with a quick search of the blog, so it couldn’t hurt to publicize it again!

I’ve started adding my own Python/Matplotlib equivalents to Jon’s Matlab examples, in the hopes of turning this into a comprehensive library of basic plotting examples.  I’m not done yet, but I’ll add more as time permits.  I take requests, so if there’s anything you’d like to see, please leave a note in the comments.

Here’s a quick list of what’s there now:

  • line plots (Matlab and Python)
  • line plots with shading between curves (Python)
  • stacked area plots (Matlab)
  • animated gifs like the one from Jon’s recent post (Matlab and Python)
  • histograms and cdf plots (Matlab)
  • parallel coordinate plots (Matlab)
  • gridded data contour and surface plots (Matlab and Python)
  • scatter plots (Matlab and Python)
  • non-gridded data contour plots (Matlab and Python)

Writing a Paper in Markdown Using Pandoc

I’ve struggled up to now with the tension between drafting papers in Word (easy for co-authors to use for marking up revisions) and using LaTeX to prepare them for publication (because Word fights you and actively thwarts your efforts the whole time if you try to make a paper look half-decent.) When I start in Word and switch to LaTeX, there’s an awkward phase in the middle where I have to fix all of my quotation marks and em-dashes, and all of my equations, tables, and citations are completely broken.

Recently I discovered Pandoc, and I think it will streamline the transition quite a lot.  Pandoc is a document converter that converts several input formats to many output formats.  Here’s the list from running pandoc --help:


Input formats:  native, json, markdown, markdown_strict, markdown_phpextra, markdown_github, markdown_mmd, rst,  mediawiki, docbook, textile, html, latex

Output formats: native, json, docx, odt, epub, epub3, fb2, html, html5, s5, slidy, slideous, dzslides, docbook, opendocument, latex, beamer, context, texinfo, man, markdown, markdown_strict, markdown_phpextra, markdown_github, markdown_mmd, plain, rst, mediawiki, textile, rtf, org, asciidoc

That’s a lot of document formats!  In particular, it supports a “native” dialect of Markdown that it does a great job of translating both to LaTeX and to docx (Microsoft Word). Other nifty things you can do include:

  • Convert LaTeX to Word, including your BibTeX citations
  • Making PDFs from html (if you have LaTeX installed)
  • Writing Beamer presentations in Markdown (and exporting the LaTeX sources for the slides)
  • Use BibTeX citations in Markdown

I’m using Pandoc Markdown to draft my next paper, and while it’s not as full-featured as LaTeX for things like internal references, I find that it’s easier to write Word documents in Markdown than it is to write them in Microsoft Word!  To give just one example, it lets you caption figures properly.  Try making a figure in Word, adding a caption, and then moving or deleting the figure. The caption stays put.  Why on earth would I want that to happen? If I delete a figure in Pandoc Markdown, it takes extra effort to leave the caption behind.  In addition, when I switch my output format from Word to LaTeX source, Pandoc makes a figure environment with a \caption{} automatically.

My planned workflow is:

  1. Draft in Pandoc Markdown
  2. Convert Markdown to docx and share with co-authors
  3. Update Markdown sources based on revisions to the Word document
  4. Repeat 1-3 until the paper is mostly done
  5. Convert Markdown to LaTeX
  6. Final revisions and formatting in LaTeX

I’ll follow up to this post as I progress with drafting the paper. Right now I’m enjoying Pandoc Markdown quite a lot, and I highly recommend it.

 

All of the Analysis Code for my Latest Study is on GitHub

I’ve published to GitHub all of the code I wrote for the paper I’m currently working on.  This includes:

  • Python PBS submission script
  • Python scripts to automate reference set generation using MOEAFramework
  • Python scripts to automate hypervolume calculation using MOEAFramework and the WFG hypervolume engine
  • Python / Pandas scripts for statistical summaries of the hypervolume data
  • Python scripts to automate Sobol’ sensitivity analysis using MOEAFramework and tabulate the results.  (If I were starting today, I’d have an SALib version too.)
  • Python / Pandas / Matplotlib figure generation scripts:
    • Control maps for hypervolume attainment
    • Radial convergence plots (“spider plots”) for Sobol’ global sensitivity analysis results
    • Bar charts for Sobol’ global sensitivity analysis results
    • CDF plots (dot / shaded bar, plus actual CDF plots) for hypervolume attainment
    • Parallel coordinate plots
    • Input file generation for AeroVis glyph plotting
    • Joint PDF plots for hypervolume attainment across multiple problems

Not all of the figures I mentioned will turn up in the paper, but I provide them as examples in case they prove helpful.

Connecting to an iPython HTML Notebook on the Cluster Using an SSH Tunnel

Magic

I didn’t have the time or inclination to try to set up the iPython HTML notebook on the conference room computer for yesterday’s demo, but I really wanted to use the HTML notebook. What to do?

Magic.

Magic in this case means running the iPython HTML notebook on the cluster, forwarding the HTTP port that the HTML notebook uses, and displaying the session in a web browser running locally. In the rest of this post, I’ll explain each of the moving parts.

iPython HTML Notebook on the Cluster

The ipython that comes for free on the cluster doesn’t support the HTML notebook because the python/2.7 module doesn’t have tornado or pyzmq. On the plus side, you do have easy_install, so setting up these dependencies isn’t too hard.

  1. Make a directory for your personal Python packages:
    mkdir /gpfs/home/asdf1234/local
  2. In your .bashrc, add
    export PYTHONPATH=$HOME/local/lib/python2.7/site-packages
  3. python -measy_install --prefix /gpfs/home/asdf1234/local tornado
    python -measy_install --prefix /gpfs/home/asdf1234/local pyzmq

If you have a local X server, you can check to see if this works:

ssh -Y asdf1234@cluster 
ipython notebook --pylab=inline

Firefox should pop up with the HTML notebook. It’s perfectly usable like this, but I also didn’t want to set up an X server on the conference room computer. This leads us to…

Forwarding Port 8888

By default, the HTML notebook serves HTTP on port 8888. If you’re sitting in front of the computer, you get to port 8888 by using the loopback address 127.0.0.1:8888.
127.0.0.1 is only available locally. But using SSH port forwarding, we can connect to 127.0.0.1:8888 from a remote machine.

Here’s how you do that with a command-line ssh client:

ssh -L8888:127.0.0.1:8888 asdf1234@cluster

Here’s how you do it with PuTTY:

putty

Click “Add.”  You should see this:

putty2

Now open your connection and login to the remote machine. Once there, cd to the directory where your data is and type

ipython notebook --pylab=inline

If you’re using X forwarding, this will open up the elinks text browser, which is woefully incapable of handling the HTML notebook. Fortunately that doesn’t sink the demo. You’ll see something like this:

elinks

This means that the iPython HTML notebook is up and running. If you actually want to use it, howerver, you need a better browser. Fortunately, we opened up SSH with a tunnel…

Open the Notebook in Your Browser

This was the one part of the demo that wasn’t under my control. You need a modern web browser, and I just had to hope that someone was keeping the conference room computer more or less up to date. My fallback plan was to use a text-mode ipython over ssh, but the notebook is much more fun! Fortunately for me, the computer had Firefox 14.

In your URL bar, type in

http://127.0.0.1:8888

If everything works, you’ll see this:
dashboard
And you’re off to the races!

What Just Happened?

I said earlier that 127.0.0.1 is a special IP address that’s only reachable locally, i.e. on the machine you’re sitting in front of. Port 8888 on 127.0.0.1 is where ipython serves its HTML notebook, so you’d think the idea of using the HTML notebook over the network isn’t going to fly.

When you log in through ssh, however, it’s as if you are actually sitting in front of the computer you’re connected to. Every program you run, runs on that computer. Port forwarding takes this a step further and presents all traffic on port 8888 to the remote computer as if it were actually on the remote computer’s port 8888.

Installing IPython

Intro

If you were at yesterday’s meeting, you’ve seen my enthusiasm for the ipython HTML notebook. This is a short post on how to get it set up.

Windows

iPython

Download the zip file from the iPython release archive. Pick the biggest release version number. Unzip it. Do

python setup.py install

Watch it complain about how you don’t have setuptools. Install setuptools as below and try again.

Setuptools

If you don’t have Python setuptools yet, go get it from PyPI, the Python Package Index.

Tornado

Now that you have setuptools:

python -measy_install tornado

PyZMQ

python -measy_install pyzmq

Linux (but not the cluster)

On Fedora, sudo yum install ipython. Probably sudo apt-get install ipython on Ubuntu, but let me know in the comments if I’m wrong about that and I’ll update the post. You might also need to install tornado and pyzmq, but I don’t recall having to do this. (I did my Fedora install a while ago.)

Starting the iPython HTML Notebook

iPython notebook --pylab=inline

A web browser should pop up. Click on “new notebook.”