Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 2)

This blog post follows part 1 of this series and provides a tutorial for the statistical analysis I performed on my diagnostic results.  After completing the briefly described diagnostic steps, I had 900 files that looked like this.

./SOW4/metrics/Borg_myLake4ObjStoch_1.metrics
Threshold: 0.75
Best: 0.9254680538114082
Attainment: 0.328
Controllability: 0.0
Efficiency: 1.0

There were 900 as I evaluated 3 metrics for each of 50 seeds for 6 six algorithms. To evaluate statistics, I first had to read in the information of interest from the files. The following snippet of Matlab code is useful for that.

clc; clear all; close all;

algorithms = {'Borg', 'eMOEA', 'eNSGAII', 'NSGAII', 'MOEAD', 'GDE3'};
seeds = (1:1:50);
metrics = {'GenDist'; 'EpsInd'; 'Hypervolume';};
% work = sprintf('./working_directory/'); %specify the working directory
problem = 'Problem';

%Loop through metrics
for i=1:length(metrics)
    %Loop through algorithms
    for j=1:length(algorithms)
        %Loop through seeds
        for k=1:length(seeds)
         %open and read files
            filename = ['./' metrics{i} '_' num2str(75) '_' algorithms{j} '_' num2str(seeds(k)) '.txt'];
            fh = fopen(filename);
            if(fh == -1) disp('Error opening analysisFile!'); end
            values = textscan(fh, '%*s %f', 5, 'Headerlines',1);
            fclose(fh);

            values = values{1};

            threshold(k,j,i)       = values(1);
            best(k,j,i)            = values(2);
            if strcmp(metrics{i},'Hypervolume'); best(k,j,i)   = best(k,j,i)/(threshold(k,j,i)/(75/100)); end; %Normalize the best Hypervolume value to be between 0 and 1
            attainment(k,j,i)      = values(3);
            controllability(k,j,i) = values(4);
            efficiency(k,j,i)      = values(5);
        end

    end

end 

The above code loops through all metrics, algorithms, and seeds to load and read from the corresponding files. Each of the numerical values was stored in an appropriately named 3 dimensional array. For this tutorial, I am going to focus on a statistical analysis of the probability of attainment for the Hypervolume metric across all algorithms. The values of interest are stored in the attainment array.

Below is code that can be used for the Kruskall Wallis nonparametric one-way ANOVA.

P = kruskalwallis(attainment(:,:,3),algorithms,'off');

The above code performs the kruskalwallis test on the matrix of 50 rows and 6 columns storing the Hypervolume attainment values and determines if there is a difference in performance between any 2 of the groups (columns). In this case the 6 groups are the algorithms. The ‘off’ flag indicates whether you want a graphical display. More information can be found using Matlab help as you probably already know.

If this returns a small p-value indicating that there is a difference between at least 2 of your algorithms, it is worthwhile to continue to the Mann Whitney U test. I used the following code to loop through the possible algorithm pairs and determine if there was a statistical difference in the probability of attaining the 75% threshold for hypervolume. I wasn’t very concerned about the P value so it is constantly overwritten, but I stored the binary indicating whether or not to reject the null hypothesis in a matrix. In this case, I used an upper tail test as algorithms with higher probabilities of attaining a threshold are considered to perform better.

for i = 1:length(algorithms)
    for j = 1:length(algorithms)
         [P3, Mann_Whitney_U_Hyp(i,j)] = ranksum(attainment(:,i,3),...
                                     attainment(:,j,3),...
                                  'alpha', 0.05, 'tail', 'right');
    end
end

In the end, you get a table of zeros and ones like that below.

A 1 indicates that the null hypothesis should be rejected at the 5% level of significance.

A 1 indicates that the null hypothesis should be rejected at the 5% level of significance.

As the help function indicates that a 1 indicates that the null hypothesis that two medians are the same can be rejected and the alternative hypothesis tested in this case was that the median of the first group is higher, this table was used to rank algorithm performance. As the loop went row then column, Borg is considered the best performing algorithm in this case as its median is higher than the median for all other algorithms at the 5% level of significance. The rest of my ranking goes NSGAII, eNSGAII, eMOEA, GDE3, and MOEA/D.

Hopefully, this is helpful to at least one person in the future. Feel free to use the snippets in your own code to automate the whole process. 🙂

Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 1)

While working on my diagnostic assessment of the difficulty of the “Lake Problem”, I decided that I would like to check that the differences in performance I observed were statistically significant. This part of the blog post is a broad overview of the steps I took before the statistical analysis and that analysis. Part 2 includes a tutorial. Looking through prior group papers, I drew initial inspiration from the following articles by Dave although I think my final method differs slightly.

Hadka, D. and P. Reed. Diagnostic Assessment of Search Controls and Failure Modes in Many-Objective Evolutionary Optimization. Evolutionary Computation, 20(3):423-452, 2012.

Hadka, D. and P. Reed. Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework. Evolutionary Computation, 21(2):231-259, 2013.

Below is a short summary of the steps preceding my statistical analysis with links to group resources for those who may need more background.  For everyone else, feel free to skip ahead.

Before I could determine the statistical significance of my results, I needed to perform a comparative study of algorithm performance. I got help from sections 8.1 to 8.8 and section 8.10 of the MOEAFramework manual, which can be found at the MOEAFramework website to generate approximate fronts for the problem for multiple random seeds (I used 50) using algorithms found within the MOEAFramework, calculate metrics for each seed across all parameterizations, and determine the best value for each seed across all parameterizations and the probability of attaining a specified threshold (I used 75%) for each seed.  I skipped section 8.9 and performed the analysis described in 8.10 on metrics for each seed’s metrics as many samples are necessary to perform statistics on algorithm performance.  Please note, I also included the Borg evolutionary algorithm in this study, which requires optimization outside of the framework. All other steps for Borg can be performed the same as for the MOEAFramework algorithms.

In this case, I used Latin Hypercube Sampling to generate 500 samples of the feasible parameter space for each algorithm. I chose parameter ranges from Table 1 of the following paper.
Reed, P., et al. Evolutionary Multiobjective Optimization in Water Resources: The Past, Present, and Future. (Editor Invited Submission to the 35th Anniversary Special Issue), Advances in Water Resources, 51:438-456, 2013.

Here is the shell script I used for calculating the best attained values and probability of attaining the 75% threshold for all algorithms and all seeds.

NSEEDS=50
METHOD=Latin
PROBLEM=myLake4ObjStoch
ALGORITHMS=( NSGAII GDE3 eNSGAII MOEAD eMOEA Borg)

SEEDS=$(seq 1 ${NSEEDS})
JAVA_ARGS="-cp MOEAFramework-2.1-Demo.jar"
set -e

# Perform the analysis
echo ""
echo "Hypervolume Analysis:"
for ALGORITHM in ${ALGORITHMS[@]}
do
	for SEED in ${SEEDS}
	do
    NAME=${ALGORITHM}_${PROBLEM}_${SEED}_Hypervolume

    PBS="\
    #PBS -N ${NAME}\n\
    #PBS -l nodes=1\n\
    #PBS -l walltime=96:00:00\n\
    #PBS -o output/${NAME}\n\
    #PBS -e error/${NAME}\n\
    cd \$PBS_O_WORKDIR\n\
        java ${JAVA_ARGS} org.moeaframework.analysis.sensitivity.Analysis \
        -p ${ALGORITHM}_params.txt -i ${ALGORITHM}_${METHOD} -t 0.75 -o ./SOW6/Hypervolume/${ALGORITHM}_${PROBLEM}_${SEED}_Hypervolume.analysis \
        -c -e -m 0 --hypervolume 0.7986 ./SOW6/metrics/${ALGORITHM}_${PROBLEM}_${SEED}.metrics"
    echo -e $PBS | qsub
    done

done

The above shell script is written for Hypervolume, but other methods can be specified easily by indicating a different number after the -m flag and removing the hypervolume flag.  If one wants to analyze Generational Distance, type 1 after the -m flag.  For the Additive Epsilon Indicator, type 4.  Typically, these are the three metrics of interest as explained in this blog post https://waterprogramming.wordpress.com/2013/06/25/moea-performance-metrics/.   The –hypervolume flag allows you to specify the hypervolume of your reference set. That way the individual reference set hypervolume values are normalized to that of the best known approximation to the Pareto front.

Hypervolume can be calculated for a given approximation set using the following command line argument.

java -cp MOEAFramework-2.1-Demo.jar org.moeaframework.analysis.sensitivity.SetHypervolume name_of_reference_set_file 

Finally, I calculated statistics on these values.  I used the Kruskal Wallis and Mann Whitney U (Also called the Wilcoxon Rank Sum) tests that Dave used.  One advantage of these tests is that they are non-parameteric, meaning that assumptions are not made about the distributions from which the data come.  Another advantage I found is that they are built into Matlab as the kruskalwallis and ranksum functions.  I began with the Kruskal Wallis test as it tests the null hypothesis that independent samples from two or more groups come from distributions with equal medians and returns the p-value that this is true given the data.  This test is a useful screen, as there would not have been much point in continuing if this test indicated that the values for all algorithms had been statistically similar.

I received very low p-values using Kruskal Wallis so I continued to the Mann Whitney U test in every case.  The Mann Whitney U test provides a pair-wise comparison.   Matlab has options to test one of three alternative hypotheses

  1. medians are not equal (default, two-tailed test)
  2. median of X is greater than median of Y (right-tailed test)
  3. median of X is less than median of Y (left-tailed test)

I used one tailed tests in order to determine if one algorithm’s performance was superior to another algorithm’s performance.  I then was able to rank algorithm performance on any given metric by determining the number of algorithms each algorithm outperformed according to this test. A detailed walk through is provided in Part 2.