Machine learning has become a hot topic in the last few years and it is for a reason. It provides data analysts with efficient ways of extracting information from data, allowing it to be used for analysis and modeling purposes.

The Scikit-learn Python library has implementations of dozens of learning algorithms and is freely available for academic and commercial use under the terms of the BSD licence. Some of these algorithms can be extremely useful for our job as water systems analysts, so given the overwelming amount of algorithms implemented in Scikit-learn, I though I would mention a few I find particularly useful for my research. For each method below I included link swith an examples from the Scikit-learn’s website. Instalation and use instructions can be found in their website.

### CART Trees

CART trees that can used for regression or classification. Any any tree, CART trees are considered poor (generally high variance) classifiers unless bootstrapped or boosted (see supervised learning), but the resulting rules are easily interpretable.

**CART Trees**: http://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

### Dimensionality reduction

Principal component Analysis (PCA) is perhaps the most widely used dimensionality reduction technique. It works by finding the basis the maximizes the data’s variance, allowing for the elimination of axis that have low variances. Among its uses are noise reduction, data visualization, as it preserves the distances between data points, and improvement of computational efficiency of other algorithms by getting rid of redundant information. PCA can me used in its pure form or it can be kernelized to handle data sets whose variance is maximum in a non-linear direction. Manifold learning is another way of performing dimensionality reduction by unwinding the lower dimensional manifold where the information lies.

**Manifold learning**: http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py

### Clustering

Clustering is used to group similar points in a data set. One example is the problem of find customer niches based on the products each customer buys. The most famous clustering algorithm is k-means, which, as any other machine learning algorithm, works well on some data sets but not in others. There are several alternative algorithms, all of which exemplified in the following two links:

**Clustering algorithms** comparison: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

**Gaussian Mixture Models** (finds the same results as k-means but also provides variances): http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm-covariances-py

Reducing the dimentionality of a dataset with PCA or kernel PCA may speed up clustering algorithms.

### Supervised learning

Supervised learning algorithms can be used for regression or classification problems (e.g. classify a point as pass/fail) based on labeled data sets. The most “trendy” one nowadays is neural networks, but support vector machines, boosted and bagged trees, and others are also options that should be considered and tested on your data set. Bellow are links to some of the supervised learning algorithms implemented in Scikit-learn:

Comparison between **supervised learning algorithms**: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

**Neural networks:** http://scikit-learn.org/stable/modules/neural_networks_supervised.html

Gaussian Processes is also a supervised learning algorithm (regression) which is also be used for Bayesian optimization:

**Gaussian processes:** http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py

Wow, simple neural networks are right out of box now in sklearn! Good news for everyone 🙂 I think that ML algorithms are not stable and useful for geosciences, because of its low interpretation capability if you directly use ML for regression (modelling). But as an additional tool (pre or post-processing) they are promising! Thank you for your post!

Pingback: Water Programming Blog Guide (Part I) – Water Programming: A Collaborative Research Blog