Customizing color matrices in matplotlib

In this post I intend to pass on some tricks on matplotlib color matrix customization.  I am guilty of beautifying some of my color matrices with Adobe Illustrator in the past, re-arranging labels, titles, colormaps, etc.  However, this time I had to generate way too many of them and I could see the beautifying process becoming extremely painful.  I will simply demonstrate how to do the following three plots simultaneously with relatively few lines of code in the hopes of providing useful elements for your own plot cutomization.

plot1a.png

plo2.png

plt3.png

Plot 1- Plot 3  were generated with the following script which I will explain in detail later int this post:

import glob
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

#Listing your files
files = glob.glob('./attainment_matrices/*.out')

#Organizing your files
data_plot1=[np.genfromtxt(f) for f in files[8:12]]
data_plot2=[np.genfromtxt(f) for f in files[0:4]]
data_plot3=[np.genfromtxt(f) for f in files[16:20]]
data=[data_plot1,data_plot2,data_plot3]

#Organizing titles and labels
plot_titles=['Plot 1','Plot 2', 'Plot 3']
subplot_titles= ['Subplot 1','Subplot 2', 'Subplot 3','Subplot 4']
labels= ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
y_labels= ['Y Title a$\longrightarrow$','Y Title b $\longrightarrow$','Y Title c $\longrightarrow$']
cmap_labels=['Colormap label a$\longrightarrow$', 'Colormap label b$\longrightarrow$', 'Colormap label c$\longrightarrow$']

# Some variables to adjust subplots if necessary
left = 0.125 # the left side of the subplots of the figure
right = 0.9 # the right side of the subplots of the figure
bottom = 0.3 # the bottom of the subplots of the figure
top = 0.82 # the top of the subplots of the figure
wspace = 0.2 # the amount of width reserved for blank space between subplots
hspace = 0.5 # the amount of height reserved for white space between subplots

#Font sizes
plot_fontsize=40
subplot_fontsize=32
tick_label_fontsize=22 # Ticks, colormap, x and y labels use this fontsize

#x-label adjustments
rotation= 45 # rotation of labels
adjust=0 #if you want the x labels to be displayed right at the middle then adjust=0.5

x=np.arange(0,5.5)
y=np.linspace(0,100,1001)

#colormaps
colormap=['Set3_r', 'YlGnBu','Paired']

# the j is the iteration variable for each subplot, and the l is the iteration variable
# for each plot.
for l in range(len(plot_titles)):
fig, ax=plt.subplots(1,len(subplot_titles),sharey=True)
plt.subplots_adjust(left=left, bottom=bottom, right=right, top=top, wspace=wspace, hspace=hspace)
#setting the titles wrapped by a transparent grey box at position=(x,y)
fig.suptitle(plot_titles[l], fontsize=plot_fontsize,
bbox={'facecolor':'grey', 'alpha':0.1, 'pad':12}, position=(0.1827, .95))

for j in range(len(subplot_titles)):
a= ax[j].pcolor(x,y,data[l][j], cmap=colormap[l])
ax[j].set_title(subplot_titles[j], fontsize= subplot_fontsize, y=1.03)
#Set the y-label only in the first subplot
ax[0].set_ylabel(y_labels[l], fontsize=tick_label_fontsize)
ax[j].set_xticks(x + adjust, minor=False)
#ax[j].set_xlim(left=0, right=5)
#ax[j].set_ylim(0,100)
ax[j].set_xticklabels(labels[:], rotation=rotation)
ax[j].tick_params(labelsize=tick_label_fontsize)

#colorbar settings:
leftc= 0.12504
bottomc=.13
width_c=.775
height_c=0.04
cbar_ax= fig.add_axes([leftc,bottomc,width_c,height_c])
#cbar= fig.colorbar(a, cax=cbar_ax, orientation='horizontal')
cbar = fig.colorbar(a,cax=cbar_ax, ticks=[0, 0.5, 1], orientation='horizontal')
cbar.ax.set_xticklabels(['Low', 'Medium', 'High'])
cbar.set_label(cmap_labels[l], fontsize=tick_label_fontsize, labelpad=25)
cbar.ax.tick_params(labelsize=tick_label_fontsize)

plt.show()
plt.show()

First, in lines 1 though 4 I specify the required libraries.  I use glob.glob to list the files for the analysis with their full path in line 8.  Then if you want to see the order in which the files are listed you can simply run the print command as follows:

print files[:]

And you should be able to see the order of the files like so:

[‘./data_directory/file1.out’, ‘./data_directory/file10.out’, … ‘./data_directory/file24.out’]

I used the numpy genfromtxt function in lines 11-13 to load the data from the specified files while organizing the data that would be used in plot 1, plot 2 and plot 3.   I then made an array of the previous data on line 14 so I could use it in a loop later on.

I organized the titles of the main plots, subplots, the x and y labels, as well as the colormap labels in lines 17-21.  All the parameters required to adjust the aspect ratio of the subplots are listed in lines 24 to 29.    If you simply want all of your subplots to be squared, you can add the aspect=’equal’  parameter directly in the plt.subplot() function.

The font for the plots, subplots, ticks and labels are specified in lines 32 to 34.  The x-labels can be adjusted in multiple ways.  In line 27 I set the rotation of the x-labels to 45 degrees.  If you want the labels to be completely vertical then you would do: rotation=90.  If you want horizontal labels, you don’t need to specify a rotation parameter.  Then, I used the adjust variable to specify the position of the x-label,  adjust=0 specifies that the label will be written starting at the left corner of the bar, if you want the label to be centered, then you can do adjust=0.5.

In line 44,  I list the different colormaps to be used by each plot. The outer loop in line 48, iterates through the 3 plots,  while the inner loop in line 55, iterates through the 4 subplots generated in each plot.    In line 49 we specify the number of rows and columns of subplots that will be generated.  I want them to share the  y axis, hence, sharey=True.   If you want your subplots to also share the x axis, you would simply add ‘sharex=True‘ in line 49.  The plt.subplots_adjust function in line 50, allows you to specify the exact aspect ratio of your subplots, including the white space between them and their location in the figure canvas, this is detailed in lines 24 to 29.  In line 52, I specified the title of the plot as a whole, since I have three different plots, I loop through each of the different titles.  The title is shown in a grey transparent box at the upper left corner of the canvas which was specified by position(x,y).

Lines 56 to 64 show the subplots’ code.  I use the pcolor function to generate the color matrices.  However, there are other methods to create them, such as pcolormesh, imshow, contour, etc.  In line 57 I loop through the subplot titles, then I assign their font size.  Here, the y=1.03 specifies the distance from the subplot title to the plot.  The more distance I want to create the larger this y value should be.  In line 59 I set the y-label, since I only want the y-label to be shown in the left most plot, I fix ax[0].set_ylabel(…), if you want each subplot to have their own y-labels then you can loop through each of them with the subplot iteration variable j, such as ax[j].set_ylabel(…).  Lines 61 to 62 (commented out), show how you could set the x and y axis limits.  In line 63, I set the x_ticklabels; similarly you could set the y_ticklabels if necessary.  The fontsize across all the ticks in line 64.

The colorbar settings are shown in lines 67 through 76.  Observe how you can specify the position of the left bottom corner of the colorbar, and from there you can assign the width and the height of the colorbar.  Note that there’s a couple of ways to specify the colorbar, the first one is shown in line 72, it will generate a colorbar with the default ticks.  However, if you want to cutomize or add text to your colorbar, you would have to do so as shown in lines 73-74.  The ticks parameter in line 73, specifies the position were the labels written in line 74 are displayed.  You can set the colorbar label with .set_label.   I loop through the colormap labels for each plot and assign their fontsize in line 75.  The labelpad allows you to specify the distance between the colorbar and the label.   Finally,  the font size of the colormap ticks are specified in line 76.

I hope you can find some of the previous elements useful when designing your own color matrices ;).

 

 

 

A Guide to Using Git in PyCharm – Part 2

A Guide to Using Git in PyCharm – Part 2

This post is part 2 of a multi-part series intended to describe how to use PyCharm’s integrated Git (version control) features.

In my previous PyCharm post (Part 1), I described how to get started using Git in PyCharm, including:

  • Creating a local git repository
  • Adding and commiting files to a git repository
  • Making and committing changes to a python file

In this post, I will describe:

  • Creating branches
  • Merging branches

These are extremely valuable features of Git, as they allow you to create a new version of your code to modify and evaluate before permanently integrating the changes into the master version of your code.

The following tutorial will assume you already have PyCharm and Git installed, and have a Github account. It also assumes you have a git repository to start with. Follow my steps in the first blog post if you don’t already have a git repository.

Create a new branch by right clicking on your repository project folder, and in the menu select Git–>Repository–>Branches, as is shown below.

figure-1

Select “New Branches” and then enter the name of the new branch in the box (see below for how I named the example branch):

figure-2

You can now see in the version control window that a new branch called “example_branch_1” has been created as a branch from the master branch (you can access this by clicking “version control” window at the bottom).

figure-3

Importantly, PyCharm has now placed you on the new branch (“example_branch_1”). Any changes you make and commit, you are making to the branch, not to the master (the one you started with).

First, right-click on the python file and select Git –> Add.

Now, make some modifications to the python file on this branch.

In the menu at the top of the screen, select VCS –> Commit changes.

In the menu that appears, provide a message to attach to your commit so you know later what changes you made. You can double click on the file name in the window (e.g., on the “blog_post_file.py” file) to review what changes have actually been made. If you like the changes, select “Commit”.

figure-4

In the menu at the very bottom of the screen to the right, you can now go back and “check out” the master branch. Just click master –> Checkout, as is shown below.

figure-6

What now appears on your screen is the previous version of the “blog_post_file.py” file, before the function was modified, as is shown in the image below.

figure-5

The version control menu at the bottom explains the structure (and dates/times) of the master and branches.

You can now go back to the example branch if you want using the same feature.

This feature allows you to quickly snap to different branches that are being modified, and see how they are different. (This feature is also accessible from the menu at the top of the screen through VCS –>Git –> Branches).

If you want to merge your changes to your local master, do the following. From the master branch, you can now select “example_branch_1” in the same menu shown to the bottom right below. This time select “Merge” rather than “Checkout”. Your changes will now be merged onto the master branch, and the modified function will now appear on both the master and the example branch.

figure-7

I will continue to discuss PyCharm’s Git features in future posts in this series.

Basic Machine Learning in Python with Scikit-learn

Basic Machine Learning in Python with Scikit-learn

Machine learning has become a hot topic in the last few years and it is for a reason. It provides data analysts with efficient ways of extracting information from data, allowing it to be used for analysis and modeling purposes.

The Scikit-learn Python library has implementations of dozens of learning algorithms and is freely available for academic and commercial use under the terms of the BSD licence. Some of these algorithms can be extremely useful for our job as water systems analysts, so given the overwelming amount of algorithms implemented in Scikit-learn, I though I would mention a few I find particularly useful for my research. For each method below I included link swith an examples from the Scikit-learn’s website. Instalation and use instructions can be found in their website.

CART Trees

CART trees that can used for regression or classification. Any any tree, CART trees are considered poor (generally high variance) classifiers unless bootstrapped or boosted (see supervised learning), but the resulting rules are easily interpretable.

CART Treeshttp://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

Dimensionality reduction

Principal component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique. It works by finding the basis the maximizes the data’s variance, allowing for the elimination of axis that have low variances. Among its uses are noise reduction, data visualization, as it preserves the distances between data points, and improvement of computational efficiency of other algorithms by getting rid of redundant information. PCA can me used in its pure form or it can be kernelized to handle data sets whose variance is maximum in a non-linear direction. Manifold learning is another way of performing dimensionality reduction by unwinding the lower dimensional manifold where the information lies.

PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_3d.html#sphx-glr-auto-examples-decomposition-plot-pca-3d-py

Kernel PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html#sphx-glr-auto-examples-decomposition-plot-kernel-pca-py

Manifold learninghttp://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

Clustering

Clustering is used to group similar points in a data set. One example is the problem of find customer niches based on the products each customer buys. The most famous clustering algorithm is k-means, which, as any other machine learning algorithm, works well on some data sets but not in others. There are several alternative algorithms, all of which exemplified in the following two links:

Clustering algorithms comparison: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

Gaussian Mixture Models (finds the same results as k-means but also provides variances): http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

Reducing the dimentionality of a dataset with PCA or kernel PCA may speed up clustering algorithms.

Supervised learning

Supervised learning algorithms can be used for regression or classification problems (e.g. classify a point as pass/fail) based on labeled data sets. The most “trendy” one nowadays is neural networks, but support vector machines, boosted and bagged trees, and others are also options that should be considered and tested on your data set. Bellow are links to some of the supervised learning algorithms implemented in Scikit-learn:

Comparison between supervised learning algorithmshttp://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

Neural networks: http://scikit-learn.org/stable/modules/neural_networks_supervised.html

Gaussian Processes is also a supervised learning algorithm (regression) which is also be used for Bayesian optimization:

Gaussian processes: http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py