Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 2)

This blog post follows part 1 of this series and provides a tutorial for the statistical analysis I performed on my diagnostic results.  After completing the briefly described diagnostic steps, I had 900 files that looked like this.

Threshold: 0.75
Best: 0.9254680538114082
Attainment: 0.328
Controllability: 0.0
Efficiency: 1.0

There were 900 as I evaluated 3 metrics for each of 50 seeds for 6 six algorithms. To evaluate statistics, I first had to read in the information of interest from the files. The following snippet of Matlab code is useful for that.

clc; clear all; close all;

algorithms = {'Borg', 'eMOEA', 'eNSGAII', 'NSGAII', 'MOEAD', 'GDE3'};
seeds = (1:1:50);
metrics = {'GenDist'; 'EpsInd'; 'Hypervolume';};
% work = sprintf('./working_directory/'); %specify the working directory
problem = 'Problem';

%Loop through metrics
for i=1:length(metrics)
    %Loop through algorithms
    for j=1:length(algorithms)
        %Loop through seeds
        for k=1:length(seeds)
         %open and read files
            filename = ['./' metrics{i} '_' num2str(75) '_' algorithms{j} '_' num2str(seeds(k)) '.txt'];
            fh = fopen(filename);
            if(fh == -1) disp('Error opening analysisFile!'); end
            values = textscan(fh, '%*s %f', 5, 'Headerlines',1);

            values = values{1};

            threshold(k,j,i)       = values(1);
            best(k,j,i)            = values(2);
            if strcmp(metrics{i},'Hypervolume'); best(k,j,i)   = best(k,j,i)/(threshold(k,j,i)/(75/100)); end; %Normalize the best Hypervolume value to be between 0 and 1
            attainment(k,j,i)      = values(3);
            controllability(k,j,i) = values(4);
            efficiency(k,j,i)      = values(5);



The above code loops through all metrics, algorithms, and seeds to load and read from the corresponding files. Each of the numerical values was stored in an appropriately named 3 dimensional array. For this tutorial, I am going to focus on a statistical analysis of the probability of attainment for the Hypervolume metric across all algorithms. The values of interest are stored in the attainment array.

Below is code that can be used for the Kruskall Wallis nonparametric one-way ANOVA.

P = kruskalwallis(attainment(:,:,3),algorithms,'off');

The above code performs the kruskalwallis test on the matrix of 50 rows and 6 columns storing the Hypervolume attainment values and determines if there is a difference in performance between any 2 of the groups (columns). In this case the 6 groups are the algorithms. The ‘off’ flag indicates whether you want a graphical display. More information can be found using Matlab help as you probably already know.

If this returns a small p-value indicating that there is a difference between at least 2 of your algorithms, it is worthwhile to continue to the Mann Whitney U test. I used the following code to loop through the possible algorithm pairs and determine if there was a statistical difference in the probability of attaining the 75% threshold for hypervolume. I wasn’t very concerned about the P value so it is constantly overwritten, but I stored the binary indicating whether or not to reject the null hypothesis in a matrix. In this case, I used an upper tail test as algorithms with higher probabilities of attaining a threshold are considered to perform better.

for i = 1:length(algorithms)
    for j = 1:length(algorithms)
         [P3, Mann_Whitney_U_Hyp(i,j)] = ranksum(attainment(:,i,3),...
                                  'alpha', 0.05, 'tail', 'right');

In the end, you get a table of zeros and ones like that below.

A 1 indicates that the null hypothesis should be rejected at the 5% level of significance.

A 1 indicates that the null hypothesis should be rejected at the 5% level of significance.

As the help function indicates that a 1 indicates that the null hypothesis that two medians are the same can be rejected and the alternative hypothesis tested in this case was that the median of the first group is higher, this table was used to rank algorithm performance. As the loop went row then column, Borg is considered the best performing algorithm in this case as its median is higher than the median for all other algorithms at the 5% level of significance. The rest of my ranking goes NSGAII, eNSGAII, eMOEA, GDE3, and MOEA/D.

Hopefully, this is helpful to at least one person in the future. Feel free to use the snippets in your own code to automate the whole process. 🙂


One thought on “Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 2)

  1. Pingback: Determining Whether Differences in Diagnostic Results are Statistically Significant (Part 1) | Water Programming: A Collaborative Research Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s