Browse code

starting som prediction fine-tuned class-performance visualisation

git-svn-id: https://svn.discofish.de/MATLAB/spmtoolbox/SVMCrossVal@112 83ab2cfd-5345-466c-8aeb-2b2739fb922d

Christoph Budziszewski authored on21/01/2009 16:34:25
Showing1 changed files
1 1
new file mode 100644
... ...
@@ -0,0 +1,307 @@
1
+
2
+%SOM_DEMO4 Data analysis using the SOM.
3
+
4
+% Contributed to SOM Toolbox 2.0, February 11th, 2000 by Juha Vesanto
5
+% http://www.cis.hut.fi/projects/somtoolbox/
6
+
7
+% Version 1.0beta juuso 071197 
8
+% Version 2.0beta juuso 090200 070600
9
+
10
+clf reset;
11
+f0 = gcf;
12
+echo on
13
+
14
+
15
+
16
+
17
+clc
18
+%    ==========================================================
19
+%    SOM_DEMO4 - DATA ANALYSIS USING THE SOM
20
+%    ==========================================================
21
+
22
+%    In this demo, the IRIS data set is analysed using SOM. First, the
23
+%    data is read from ascii file (please make sure they can be found
24
+%    from the current path), normalized, and a map is
25
+%    trained. Since the data also had labels, the map is labelled.
26
+
27
+try, 
28
+  sD = som_read_data('iris.data');     
29
+catch
30
+  echo off
31
+
32
+  warning('File ''iris.data'' not found. Using simulated data instead.')
33
+  
34
+  D = randn(50,4); 
35
+  D(:,1) = D(:,1)+5;     D(:,2) = D(:,2)+3.5; 
36
+  D(:,3) = D(:,3)/2+1.5; D(:,4) = D(:,4)/2+0.3;
37
+  D2 = randn(100,4); D2(:,2) = sort(D2(:,2));
38
+  D2(:,1) = D2(:,1)+6.5; D2(:,2) = D2(:,2)+2.8; 
39
+  D2(:,3) = D2(:,3)+5;   D2(:,4) = D2(:,4)/2+1.5;  
40
+  sD = som_data_struct([D; D2],'name','iris (simulated)',...
41
+		       'comp_names',{'SepalL','SepalW','PetalL','PetalW'});
42
+  sD = som_label(sD,'add',[1:50]','Setosa');
43
+  sD = som_label(sD,'add',[51:100]','Versicolor');
44
+  sD = som_label(sD,'add',[101:150]','Virginica');
45
+  
46
+  echo on
47
+end
48
+
49
+sD = som_normalize(sD,'var');
50
+sM = som_make(sD);
51
+sM = som_autolabel(sM,sD,'vote');
52
+
53
+pause % Strike any key to visualize the map...
54
+
55
+clc 
56
+%    VISUAL INSPECTION OF THE MAP
57
+%    ============================
58
+
59
+%    The first step in the analysis of the map is visual inspection.
60
+%    Here is the U-matrix, component planes and labels (you may
61
+%    need to enlarge the figure in order to make out the labels).
62
+
63
+som_show(sM,'umat','all','comp',[1:4],'empty','Labels','norm','d');
64
+som_show_add('label',sM.labels,'textsize',8,'textcolor','r','subplot',6);
65
+
66
+%    From this first visualization, one can see that:
67
+%     - there are essentially two clusters
68
+%     - PetalL and PetalW are highly correlated
69
+%     - SepalL is somewhat correlated to PetalL and PetalW
70
+%     - one cluster corresponds to the Setosa species and exhibits
71
+%       small petals and short but wide sepals
72
+%     - the other cluster corresponds to Virginica and Versicolor
73
+%       such that Versicolor has smaller leaves (both sepal and
74
+%       petal) than Virginica
75
+%     - inside both clusters, SepalL and SepalW are highly correlated
76
+
77
+pause % Strike any key to continue...
78
+
79
+%    Next, the projection of the data set is investigated. A
80
+%    principle component projection is made for the data, and applied
81
+%    to the map. The colormap is done by spreading a colormap on the
82
+%    projection. Distance matrix information is extracted from the
83
+%    U-matrix, and it is modified by knowledge of zero-hits
84
+%    (interpolative) units. Finally, three visualizations are shown:
85
+%    the color code, with clustering information and the number of
86
+%    hits in each unit, the projection and the labels.
87
+
88
+echo off
89
+
90
+f1=figure;
91
+[Pd,V,me,l] = pcaproj(sD,2); Pm = pcaproj(sM,V,me); % PC-projection
92
+Code = som_colorcode(Pm); % color coding
93
+hits = som_hits(sM,sD);  % hits
94
+U = som_umat(sM); % U-matrix
95
+Dm = U(1:2:size(U,1),1:2:size(U,2)); % distance matrix
96
+Dm = 1-Dm(:)/max(Dm(:)); Dm(find(hits==0)) = 0; % clustering info
97
+
98
+subplot(1,3,1)
99
+som_cplane(sM,Code,Dm);
100
+hold on
101
+som_grid(sM,'Label',cellstr(int2str(hits)),...
102
+	 'Line','none','Marker','none','Labelcolor','k');
103
+hold off 
104
+title('Color code')
105
+
106
+subplot(1,3,2)
107
+som_grid(sM,'Coord',Pm,'MarkerColor',Code,'Linecolor','k');
108
+hold on, plot(Pd(:,1),Pd(:,2),'k+'), hold off, axis tight, axis equal
109
+title('PC projection')
110
+
111
+subplot(1,3,3)
112
+som_cplane(sM,'none')
113
+hold on
114
+som_grid(sM,'Label',sM.labels,'Labelsize',8,...
115
+	 'Line','none','Marker','none','Labelcolor','r');
116
+hold off
117
+title('Labels')
118
+
119
+echo on
120
+
121
+%    From these figures one can see that: 
122
+%     - the projection confirms the existence of two different clusters
123
+%     - interpolative units seem to divide the Virginica
124
+%       flowers into two classes, the difference being in the size of
125
+%       sepal leaves
126
+    
127
+pause % Strike any key to continue...
128
+
129
+%    Finally, perhaps the most informative figure of all: simple
130
+%    scatter plots and histograms of all variables. The species
131
+%    information is coded as a fifth variable: 1 for Setosa, 2 for
132
+%    Versicolor and 3 for Virginica. Original data points are in the
133
+%    upper triangle, map prototype values on the lower triangle, and
134
+%    histograms on the diagonal: black for the data set and red for
135
+%    the map prototype values. The color coding of the data samples
136
+%    has been copied from the map (from the BMU of each sample). Note
137
+%    that the variable values have been denormalized.
138
+
139
+echo off
140
+
141
+% denormalize and add species information
142
+names = sD.comp_names; names{end+1} = 'species';
143
+D = som_denormalize(sD.data,sD); dlen = size(D,1);
144
+s = zeros(dlen,1)+NaN; s(strcmp(sD.labels,'Setosa'))=1;
145
+s(strcmp(sD.labels,'Versicolor'))=2; s(strcmp(sD.labels,'Virginica'))=3;
146
+D = [D, s];
147
+M = som_denormalize(sM.codebook,sM); munits = size(M,1);
148
+s = zeros(munits,1)+NaN; s(strcmp(sM.labels,'Setosa'))=1;
149
+s(strcmp(sM.labels,'Versicolor'))=2; s(strcmp(sM.labels,'Virginica'))=3;
150
+M = [M, s];
151
+
152
+f2=figure;
153
+
154
+% color coding copied from the map
155
+bmus = som_bmus(sM,sD); Code_data = Code(bmus,:); 
156
+
157
+k=1; 
158
+for i=1:5, for j=1:5, 
159
+    if i<j, i1=i; i2=j; else i1=j; i2=i; end
160
+    subplot(5,5,k); cla
161
+    if i<j,
162
+      som_grid('rect',[dlen 1],'coord',D(:,[i1 i2]),...
163
+	       'Line','none','MarkerColor',Code_data,'Markersize',2);
164
+      title(sprintf('%s vs. %s',names{i1},names{i2}))
165
+    elseif i>j,
166
+      som_grid(sM,'coord',M(:,[i1 i2]),...
167
+	       'markersize',2,'MarkerColor',Code);
168
+      title(sprintf('%s vs. %s',names{i1},names{i2}))
169
+    else
170
+      if i1<5, b = 10; else b = 3; end
171
+      [nd,x] = hist(D(:,i1),b); nd=nd/sum(nd); 
172
+      nm = hist(M(:,i1),x); nm = nm/sum(nm);
173
+      h=bar(x,nd,0.8); set(h,'EdgeColor','none','FaceColor','k'); 
174
+      hold on 
175
+      h=bar(x,nm,0.3); set(h,'EdgeColor','none','FaceColor','r'); 
176
+      hold off
177
+      title(names{i1})
178
+    end
179
+    k=k+1;
180
+  end
181
+end
182
+
183
+echo on
184
+
185
+%    This visualization shows quite a lot of information:
186
+%    distributions of single and pairs of variables both in the data
187
+%    and in the map. If the number of variables was even slightly
188
+%    more, it would require a really big display to be convenient to
189
+%    use.
190
+
191
+%    From this visualization we can conform many of the earlier
192
+%    conclusions, for example: 
193
+%     - there are two clusters: 'Setosa' (blue, dark green) and 
194
+%       'Virginica'/'Versicolor' (light green, yellow, reds)
195
+%       (see almost any of the subplots)
196
+%     - PetalL and PetalW have a high linear correlation (see
197
+%       subplots 4,3 and 3,4)
198
+%     - SepalL is correlated (at least in the bigger cluster) with
199
+%       PetalL and PetalW (in subplots 1,3 1,4 3,1 and 4,1)
200
+%     - SepalL and SepalW have a clear linear correlation, but it
201
+%       is slightly different for the two main clusters (in
202
+%       subplots 2,1 and 1,2)       
203
+
204
+pause % Strike any key to cluster the map...
205
+
206
+close(f1), close(f2), figure(f0), clf
207
+
208
+clc 
209
+%    CLUSTERING OF THE MAP
210
+%    =====================
211
+
212
+%    Visual inspection already hinted that there are at least two
213
+%    clusters in the data, and that the properties of the clusters are
214
+%    different from each other (esp. relation of SepalL and
215
+%    SepalW). For further investigation, the map needs to be
216
+%    partitioned.
217
+
218
+%    Here, the KMEANS_CLUSTERS function is used to find an initial
219
+%    partitioning. The plot shows the Davies-Boulding clustering
220
+%    index, which is minimized with best clustering.
221
+
222
+subplot(1,3,1)
223
+[c,p,err,ind] = kmeans_clusters(sM, 7); % find at most 7 clusters
224
+plot(1:length(ind),ind,'x-')
225
+[dummy,i] = min(ind)
226
+cl = p{i};
227
+
228
+%    The Davies-Boulding index seems to indicate that there are
229
+%    two clusters on the map. Here is the clustering info
230
+%    calculated previously and the partitioning result: 
231
+
232
+subplot(1,3,2)
233
+som_cplane(sM,Code,Dm)
234
+subplot(1,3,3)
235
+som_cplane(sM,cl)
236
+
237
+%    You could use also function SOM_SELECT to manually make or modify
238
+%    the partitioning.
239
+
240
+%    After this, the analysis would proceed with summarization of the
241
+%    results, and analysis of each cluster one at a time.
242
+%    Unfortunately, you have to do that yourself. The SOM Toolbox does
243
+%    not, yet, have functions for those purposes.
244
+
245
+pause % Strike any key to continue...
246
+
247
+
248
+clf
249
+clc 
250
+%    MODELING
251
+%    ========
252
+
253
+%    One can also build models on top of the SOM. Typically, these
254
+%    models are simple local or nearest-neighbor models. 
255
+
256
+%    Here, SOM is used for probability density estimation. Each map 
257
+%    prototype is the center of a gaussian kernel, the parameters
258
+%    of which are estimated from the data. The gaussian mixture
259
+%    model is estimated with function SOM_ESTIMATE_GMM and the
260
+%    probabilities can be calculated with SOM_PROBABILITY_GMM.
261
+
262
+[K,P] = som_estimate_gmm(sM,sD);
263
+[pd,Pdm,pmd] = som_probability_gmm(sD,sM,K,P);
264
+
265
+%    Here is the probability density function value for the first data
266
+%    sample (x=sD.data(:,1)) in terms of each map unit (m):
267
+
268
+som_cplane(sM,Pdm(:,1))
269
+colorbar
270
+title('p(x|m)')
271
+
272
+pause % Strike any key to continue...
273
+
274
+%    Here, SOM is used for classification. Although the SOM can be
275
+%    used for classification as such, one has to remember that it does
276
+%    not utilize class information at all, and thus its results are
277
+%    inherently suboptimal. However, with small modifications, the
278
+%    network can take the class into account. The function
279
+%    SOM_SUPERVISED does this.
280
+
281
+%    Learning vector quantization (LVQ) is an algorithm that is very
282
+%    similar to the SOM in many aspects. However, it is specifically
283
+%    designed for classification. In the SOM Toolbox, there are
284
+%    functions LVQ1 and LVQ3 that implement two versions of this
285
+%    algorithm.
286
+
287
+%    Here, the function SOM_SUPERVISED is used to create a classifier
288
+%    for IRIS data set:
289
+
290
+sM = som_supervised(sD,'small');
291
+
292
+som_show(sM,'umat','all');
293
+som_show_add('label',sM.labels,'TextSize',8,'TextColor','r')
294
+
295
+sD2 = som_label(sD,'clear','all'); 
296
+sD2 = som_autolabel(sD2,sM);       % classification
297
+ok = strcmp(sD2.labels,sD.labels); % errors
298
+100*(1-sum(ok)/length(ok))         % error percentage (%)
299
+
300
+echo off
301
+
302
+
303
+
304
+
305
+
306
+
307
+