Aiding Neural Network Based Image Classification with
Fuzzy-Rough Feature Selection
Changjing Shang and Qiang Shen
Abstract? This paper presents a methodological approach
for developing image classifiers that work by exploiting the
technical potential of both fuzzy-rough feature selection and
neural network-based classification. The use of fuzzy-rough fea-
ture selection allows the induction of low-dimensionality feature
sets from sample descriptions of real-valued feature patterns of
a (typically much) higher dimensionality. The employment of
a neural network trained using the induced subset of features
ensures the runtime classification performance. The reduction
of feature sets reduces the sensitivity of such a neural network-
based classifier to its structural complexity. It also minimises
the impact of feature measurement noise to the classification
accuracy. This work is evaluated by applying the approach to
classifying real medical cell images, supported with comparative
studies.
I. INTRODUCTION
Image classifiers implemented with a neural network have
enjoyed much success in many application domains. How-
ever, complex application problems such as real-life medical
image modelling and analysis have emphasised the issues of
feature set dimensionality reduction and feature semantics
preservation. In particular, to capture the essential character-
istics of a real image, many features may have to be extracted
without explicit knowledge of what properties might best
represent the original image apriori. Yet, generating more
features increases computational complexity and in the mean
time, not all such features may be essential to perform
classification. Due to measurement noise use of extra features
may even cause the reduction of the overall representational
power of the feature set and hence the classification accuracy.
Thus, it is desirable to employ a method that can determine
the most significant features, based on sample measurements,
to simplify a neural network-based classifier.
The above observation reflects the need in solving many
real-world classification problems. For example, comparing
normal and abnormal blood vessel structures plays an impor-
tant role in pathology and medicine [13]. Recent development
of nuclear stains and Laser Scanning Confocal Microscopy
(LSCM) has allowed the study of the structure of blood
vessels at the cellular or sub-cellular level. Central to the
classification of cell images is the capture and analysis of
their underlying features. Many feature extraction methods
are available to yield various kinds of characteristic de-
scription of a given image. However, little knowledge is
available as to what features may be most useful to provide
Changjing Shang and Qiang Shen are with the Department of Computer
Science, Abersytwyth University, SY23 3DB, Wales, UK (email: {cns,
qqs}@aber.ac.uk).
the discrimination power between normal and abnormal cells
and between cells of a different type.
Computationally, it is impractical to generate many fea-
tures and then to perform classification based on these
features for rapid diagnosis. A common practice is therefore
to generate a good number of features and select from them
the most informative ones off-line, and then to use those
selected only for classification on-line. For such medical
applications, the features produced ought to have an em-
bedded meaning and such meaning should not be altered
during the selection process. This makes it difficult to utilise
conventional dimensionality reduction techniques such as
Principal Components Analysis (PCA) [3]. This is because
PCA irreversibly destroys the underlying semantics of the
original feature set.
This paper presents an alternative approach to aid building
neural network-based classifiers by exploiting the potential
of fuzzy-rough sets [6], [15] for semantics-preserving feature
selection. The employment of a fuzzy-rough feature selec-
tion mechanism allows the induction of low-dimensionality
feature sets from sample descriptions of feature patterns of a
(typically much) higher dimensionality. Although crisp rough
sets [11] might be adopted for the same purpose [14], they
cannot work against real-valued image features unless further
preprocessing mechanisms like data discretisation are used.
This would require boolean partitions over the domain of
the underlying features extracted from the original images.
Unfortunately, for medical diagnoses, this requirement is
generally very difficult to satisfy. Use of fuzzy-rough sets
considerably reduces such difficulties.
The rest of this paper is organised as follows. Section II in-
troduces the medical image classification problem considered
herein. This, from the viewpoint of real-world application,
justifies the need for the present research and sets up the
background for the experimental investigations to be reported
later. Section III describes the key techniques used in the
work, including feature extraction and fuzzy-rough feature
selection. For completeness, it also briefly outlines the struc-
ture and learning process of multi-layer feedforward neural
network-based classifiers in the present context. Section IV
shows the results of applying this work to the given medical
application, supported by comparative studies. The paper is
concluded in Section V with further work pointed out.
II. CELL IMAGES AND THEIR CLASSIFICATION
The samples of subcutaneous blood vessels used in this
research were taken from patients suffering critical limb
ischaemia immediately after leg amputation. The level of
976
978-1-4244-1819-0/08/$25.00 c?2008 IEEE
amputation was always taken to be in a non-ischaemic
area. The vessel segments obtained from this area represent
internal proximal (normal) arteries, whilst the distal portion
of the limb shows ischaemic (abnormal) ones.
Images were collected using an inverted (Nikon Diaphot)
microscope fitted with a Noran Odyssey LSCM of a x40
objective [13]. Serial optical slices were taken along the z
axis (1?m apart), starting with the LSCM focussing on the
top of a blood vessel in the x-y plane, and moving down from
the layer of adventitial cells, through the layer of smooth
mussel cells, to the last layer of endothelial cells. Nine of
these stacks were captured in different regions along the
vessel length from different tissue samples. The resulting
image database consists of 318 section images, each sized
512 ? 512 with the grey levels ranging from 0 to 255.
Among these images, 154 were obtained from 4 proximal,
non-ischaemic vessels and the rest from 5 distal, ischaemic
vessels.
Examples of the three types of cell image taken from
non-ischaemic resistance arteries are shown in Fig. 1. Their
counterparts taken from ischaemic resistance arteries are
shown in Fig. 2. Note that many of such images for a
given problem case may seem to be rather similar by eye.
It is therefore a difficult task for visual inspection and
classification. Building an image classifier to automatically
classify such images forms the ultimate task of the present
work.
III. TECHNIQUES EMPLOYED
A. Feature extraction with fractal models
To capture and represent many possibly essential charac-
teristics of a given image, fractal models [1], [7] are used
here for feature extraction. Of course, this does not affect
the underlying approach taken in this paper, as any other
feature extraction techniques may be equally applicable.
Fractal models are typically used to characterise the rough-
ness of an image surface at various scales, which is generally
greater than the topological (intuitive) dimension. Different
definitions and their associated computational algorithms
exist for determining the fractal dimensions (FDs). Within
this work, FDs are computed via the estimation of the
variograms of an image surface. A brief overview of this
approach is given below.
Without losing generality, an image Y = {y(s)} is here
assumed to be a Gaussian random field defined on an M?M
lattice ?, where y(s) denotes the grey level of a pixel at
location s =(i, j), i, j =0, 1,...,M?1. Given an image Y ,
its fractal dimension D approximately satisfies the following:
v(d)=cd
(6?2D)
= cd
a
(1)
where a is termed the fractal index, c is a constant and
v(d)=E{y(s + d) ? y(s)} (2)
which is the variogram of the image, with d denoting the
distance between pairs of observations concerned.
(1) Adventitial
(2) Smooth muscle
(3) Endothelial
Fig. 1. Section cell images of proximal non-ischaemic subcutaneous blood
vessels, taken from a human lower limb.
Applying the Least Squares fitting algorithm [7] to model
(1), an estimate of the fractal index ?a (0 ? ?a ? 2) can be
obtained. This leads to an estimation of the fractal dimension
of Y such that
?
D =3? 0.5?a (3)
The estimated FD has a strong intuitive appeal: If the surface
is very smooth, then the fractal dimension is two; if, however,
the surface is extremely rough and irregular, then the fractal
dimension approaches the limit of three.
Note that in the above, the variogram of an image and
hence its FD are both estimated at a fixed image resolution
level. This is done without specifying any spatial direction
along which the set of pairs of observations is constructed.
That is, the image is assumed to be isotropic. By varying the
2008 IEEE International Conference onFuzzy Systems (FUZZ 2008) 977
(1) Adventitial
(2) Smooth muscle
(3) Endothelial
Fig. 2. Section cell images of distal ischaemic subcutaneous blood vessels,
taken from a human lower limb.
resolution level [7] of the image, a set of isotropic fractal
features can therefore be generated.
By imposing a constraint over the direction along which
observations are obtained, a different variogram and fractal
dimension can be estimated over any fixed resolution level.
Such resulting fractal dimensions are termed directional
fractals (DFs), as opposed to the conventional isotropic FDs
that are measured over all possible directions. Obviously,
specifying N different directions leads to N different DFs,
assuming that the images under consideration are all aligned
with respect to a common coordinate origin.
In addition to FDs, in order to capture other potentially
significant information embedded in an image, conventional
statistical measures such as the mean and standard deviation
(STD) can also be utilised. In so doing, a given image
is represented by a feature pattern consisting of a certain
number of multi-resolution and directional fractals and of
simple statistical measures. As to which of such features are
indeed essential to perform classification is of course another
matter. It is the determination of those most informative
features that forms the start-point of this research.
B. Fuzzy-rough sets and feature selection
Fuzzy-rough feature selection [6], [15] is concerned with
the reduction of information or decision systems through the
use of fuzzy-rough sets. Let I =(U, A) be an information
system, where U is a non-empty set of finite objects (the
universe of discourse) and A is a non-empty finite set of
attributes such that a : U ? V
a
for every a ? A, with V
a
being the set of values that attribute a may take. For decision
systems, A = {C ? D} where C is the set of conditional
features and D is the set of decision values. Based on these
notions, the basic concepts most relevant to the present work
of fuzzy-rough feature selection are outlined below:
1) Fuzzy equivalence classes: Fuzzy equivalence classes
[4], [10], [15] are central to the fuzzy-rough set approach
in the same way that crisp equivalence classes are central
to classical rough sets. For decision problems, this means
that the decision values and the conditional values may all
be fuzzy. The concept of crisp equivalence classes can be
extended by the inclusion of a fuzzy similarity relation S
on the universe, which determines the extent to which two
elements are similar in S. The following properties hold as
usual:
? Reflexivity (?
S
(x,x) =1)
? Symmetry (?
S
(x,y) = ?
S
(y,x))
? Transitivity (?
S
(x,z) ? ?
S
(x,y) ? ?
S
(y,z))
Using the fuzzy similarity relation, the fuzzy equivalence
class [x]
S
for objects close to x can be defined:
?
[x]S
(y)=?
S
(x,y) (4)
Obviously, this definition degenerates to the normal definition
of equivalence classes when S is crisp. Note that the family
of normal fuzzy sets produced by a fuzzy partitioning of the
universe of discourse can play the role of fuzzy equivalence
classes [4].
2) Fuzzy lower and upper approximations: These are
fuzzy extensions of their crisp counterparts. Informally, in
crisp rough set theory, the lower approximation of a set con-
tains those objects that belong to it with certainty. The upper
approximation of a set contains the objects that possibly
belong.
Formally, given a subset P of features, the fuzzy P-lower
and P-upper approximations are defined as:
?
PX
(x)= sup
F?U/P
min(?
F
(x), inf
y?U
max{1??
F
(y),?
X
(y)})
(5)
978 2008 IEEE International Conference onFuzzy Systems (FUZZ 2008)
?
PX
(x)= sup
F?U/P
min(?
F
(x), sup
y?U
min{?
F
(y),?
X
(y)})
(6)
where U/P stands for the partition of the universe of
discourse, U with respect to P, and F
i
denotes a fuzzy
equivalence class belonging to U/P. Note that although the
universe of discourse in feature reduction is finite, this is not
the case in general, hence the use of sup and inf above.
Incidentally, it is the tuple that is called a
fuzzy-rough set.
3) Partition of the Universe of Discourse: For an individ-
ual feature, a ? A, the partition of the universe by {a} is
defined by
U/IND({a})={{x|a(x)=?, x ? U}|? ? V
a
} (7)
Clearly, this is the collection of fuzzy equivalence classes for
that feature a itself.
Of course, for feature selection purposes, it is necessary to
find the dependency between various subsets of the original
feature set. For instance, it may be necessary to be able to
determine the degree of dependency of the decision feature(s)
with respect to feature set P = {a,b},a,b? A. In the crisp
case, U/P contains sets of objects grouped together that are
indiscernible according to both features a and b. In the fuzzy
case, objects may belong to many equivalence classes, so the
cartesian product of U/IND({a}) and U/IND({b}) must
be considered in determining U/P. In general,
U/P = ?{a ? P : U/IND({a})} (8)
For example, if P = {a, b}, U/IND({a}) = {N
a
,Z
a
} and
U/IND({b}) = {N
b
,Z
b
}, then
U/P = {N
a
? N
b
,N
a
? Z
b
,Z
a
? N
b
,Z
a
? Z
b
}
In so doing, each set in U/P denotes an equivalence class.
The extent to which an object belongs to such an equivalence
class is therefore calculated by using the conjunction of
constituent fuzzy equivalence classes, say F
i
, i =1, 2,...,n:
?
F1?...?Fn
(x)=min(?
F1
(x),?
F2
(x),...,?
Fn
(x)) (9)
4) Fuzzy-rough feature dependency: The present research
builds on the notion of fuzzy lower approximation to en-
able reduction of datasets containing real-valued features.
Proposed as an extension of crisp rough feature selection, its
working is expected to become identical to the crisp approach
when dealing with discrete-valued features.
Thus, by the extension principle, the membership of an
object x ? U, belonging to the fuzzy positive region can be
defined by (union of the lower approximations):
?
POSP (Q)
(x)= sup
X?U/Q
?
PX
(x) (10)
Object x will not belong to the positive region only if the
equivalence class it belongs to is not a constituent of the
positive region. This is equivalent to the crisp version where
objects belong to the positive region only if their underlying
equivalence class does so.
Using the definition of the fuzzy positive region, a useful
dependency function between a set of features Q and another
set P can be introduced as defined by:
?
prime
P
(Q)=
|?
POSP (Q)
(x)|
|U|
=
summationtext
x?U
?
POSP (Q)
(x)
|U|
(11)
As with crisp rough sets, the dependency of Q on P
is the proportion of objects that are discernible out of the
entire dataset. In the present approach, this corresponds to
determining the fuzzy cardinality of ?
POSP (Q)
(x) divided
by the total number of objects in the universe.
5) Fuzzy-rough QUICKREDUCT algorithm: The fuzzy-
rough feature selection algorithm, named fuzzy-rough
QUICKREDUCT, is derived on the basis of the above fuzzy-
rough dependency measure [15]. It borrows the ideas from
the crisp version of QUICKREDUCT originally proposed in
[2], to direct the search for quality subset of features. The
algorithm is given in Fig. 3. Fundamentally, it employs the
fuzzy-rough dependency function ?
prime
to choose which features
to add to the current subset of features. The algorithm
terminates when the addition of any remaining feature does
not increase the dependency.
FRQUICKREDUCT(C,D).
C, the set of all conditional features;
D, the set of decision features.
(1) R ?{}, ?
prime
best
? 0, ?
prime
prev
? 0
(2) do
(3) T ? R
(4) ?
prime
prev
? ?
prime
best
(5) ?x ? (C ? R)
(6) if ?
prime
R?{x}
(D) >?
prime
T
(D)
(7) T ? R ?{x}
(8) ?
prime
best
? ?
prime
T
(D)
(9) R ? T
(10) until ?
prime
best
== ?
prime
prev
(11) return R
Fig. 3. The fuzzy-rough QUICKREDUCT algorithm
As with the original algorithm, for a dimensionality of n,
the worst case dataset will result in (n2+n)/2 evaluations
of the dependency function. However, fuzzy-rough set-based
feature selection is used off-line for dimensionality reduc-
tion prior to any involvement of an on-line system (e.g. a
classifier) which will employ those features belonging to the
resultant feature subset. Thus, this operation has no negative
impact upon the run-time efficiency of the system.
C. Multilayer feedforward neural network for classification
Each of the classifiers implemented herein consists of
a feature extractor (see Section III-A) and a multilayer
2008 IEEE International Conference onFuzzy Systems (FUZZ 2008) 979
feedforward neural network (MFNN) based classifier, with
these two sub-systems connected in series.
It is well-known that an MFNN accomplishes classification
by mapping input feature patterns onto their underlying
image classes. The design of each MFNN classifier is thus
straightforward: The number of nodes in its input layer is set
to that of the dimensionality of a given feature set produced
by the feature extractor, and the number of nodes within its
output layer is set to the number of underlying classes of
interest. The internal structure of the network is designed to
be flexible and may contain one or two hidden layers. (What
actual number of internal layers and that of hidden nodes in
each hidden layer would be better to use may be determined
by experimental simulations given a fixed number of input
features.)
The training of an MFNN-based classifier is essential
to its runtime performance (done here by using the back-
propagation algorithm [12]). For this, feature patterns that
represent different images, coupled with their respective
underlying image class (i.e. cell type) indices, are selected
as the training data, with the input features being normalised
into the range of 0 to 1.
In training an MFNN classifier, the feature extractor em-
ployed has the same functionality as its counterpart to be
used in the resulting classifier. However, it generates more
features at this stage (perhaps, many more), not knowing
which features are more informative to use. The extracted
features are passed through a subsystem that implements
fuzzy-rough feature selection, removing redundant and less
informative features. When applying such a trained classifier,
only those features selected during the training phase are
required to be extracted of course.
IV. EXPERIMENTAL RESULTS
A. Experimental background
The image database used is the one summarised in Section
II. Eighty-five images are used for training and the remaining
233 images are employed for testing.
During the training phase, for each image, five isotropic
features are created, each having one of the following reso-
lutions: 9 (= log
2
512), 8, 7, 6 and 5. That is, these isotropic
features are created on the top five finest resolutions. To
measure the directional fractals, the following four directions
are used: horizontal (0
?
), first diagonal (45
?
), vertical (90
?
)
and second diagonal (135
?
). In addition, in an attempt to
capture basic statistical information, the mean and standard
deviation (STD) that are readily available are also utilised.
In so doing, a given image is represented by patterns of
11 features. For easy cross-referencing, Table I lists all the
features and their reference numbers.
Different MFNN classifiers were built to accomplish clas-
sification by mapping feature patterns of a different di-
mensionality onto their underlying cell types, with explicit
indications of whether they are normal or abnormal. There
are a total of six output classes for the present problem case,
representing adventitial, smooth mussel and endothelial cell
Ref. No. Feature Meaning Ref. No. Feature Meaning
1 0
?
direction 7 3rd finest resolution
2 45
?
direction 8 4th finest resolution
3 90
?
direction 9 5th finest resolution
4 135
?
direction 10 Mean
5 Finest resolution 11 STD
6 2nd finest resolution
TABLE I
FEATURES AND THEIR REFERENCE NUMBERS.
types of normal tissues, and the same three types of abnormal
ones. To limit the simulation cost, only networks with one
hidden layer were considered. The number of hidden nodes
were determined by systematically varying it during training.
The structure of the best trained network, which has resulted
in the least classification error over the training dataset with
respect to a predefined number of iterations, was then chosen
for use in testing.
B. Comparison with the use of unreduced features
It is important to show that, at least, the use of features
selected does not significantly reduce the classification ac-
curacy as compared to the use of the full set of original
features. For this problem, the fuzzy-rough feature selection
algorithm returns five features, namely, 0
?
DF, 95
?
DF, 5th
finest resolution, mean and STD (i.e. features 1, 3, 9, 10 and
11), out of the original eleven. Table II lists the classification
error rates produced by the best trained MFNNs.
MFNN Dim. Features Structure Error
Reduced 5 1,3,9,10,11 5?10 + 10?6 7.55%
Original 11 1,2,3,4,5,6,7,8,9,10,11 11?24 + 24?6 9.44%
TABLE II
FUZZY-ROUGH-SELECTED VS. ORIGINAL FULL SET OF FEATURES.
It is very interesting to note that the error rate of using the
five selected features is actually lower than that of using the
full feature set. Further, this improvement of performance is
obtained by a structurally much simpler network of 10 hidden
nodes, as opposed to the classifier that requires 24 hidden
nodes to achieve the optimal learning. This is indicative
of the power of fuzzy-rough feature selection in helping
reduce not only redundant feature measures but also the noise
associated with such measurement, reflecting the usefulness
of the present work.
C. Comparison with the use of randomly selected features
The above comparison ensured that no information loss is
incurred due to fuzzy-rough feature reduction. Actually, the
selection process helps to remove measurement noise as a
positive by-product. The question now is whether any other
feature sets of a dimensionality 5 would perform similarly as
those identified via fuzzy-rough selection. To avoid a biased
answer to this, without resorting to exhaustive computation,
980 2008 IEEE International Conference onFuzzy Systems (FUZZ 2008)
1 234 56 7 8910 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
5
10
15
20
25
30
classification error rate (%)
1,2,4,5,
7
4,
7
,
9
,10,11
2,3,
7
,
8
,
9
4,6,
7
,
8
,10
2,4,6,
9
,10
3,5,
7
,
8
,
9
1,2,3,10,11
2,4,5,
7
,
9
2,3,6,
7
,11
1,2,5,6,
8
3,4,5,
9
,11
11
3,4,5,
7
,
9
2,3,4,6,
7
4,5,
7
,
8
,10
3,4,6,
7
,
8
1,4,6,
7
,11
2,3,5,6,
8
4,5,6,10,11 2,
8
,
9
,10,11
2,4,
7
,
9
,10
2,3,6,
8
,10
4,5,
9
,10,11
1,2,6,
9
,11
2,3,5,6,10
4,6,
7
,10,11
1,3,5,6,
7
6,
7
,
8
,
9
,11
2,3,
7
,
8
,10
2,4,6,
7
,
9
3,4,5,
8
,10
FR selected
set0
average error
randomly selected
FR selected
Fig. 4. Fuzzy-rough vs. randomly selected features.
30 sets of five features randomly chosen were used to see
what classification results might be achieved.
Figure 4 shows the error rates of the corresponding 30
classifiers, along with the error rate of the classifier that uses
fuzzy-rough (FR) selected features. The average error of the
classifiers that each employ five randomly selected features
is 19.1%, far higher than that attained by the classifier which
utilises the FR-selected features of the same dimensionality.
This implies that those randomly selected entail important
information loss in the course of feature reduction; this is
not the case for the fuzzy-rough selection-based approach.
D. Comparison with the use of PCA-selected features
This study aimed at examining the performance of using
different dimensionality reduction techniques. In particular,
classifiers that are aided with fuzzy-rough feature selection
are systematically compared to those supported by the use
of PCA. The results are summarised in Table III. In this
table, for the results of using PCA, feature number i, i ?
{1, 2,...,11}, stands for the ith principal component, i.e. the
transformed feature that is corresponding to the ith largest
variance.
MFNN Dim. Features Structure Error
FR 5 1,3,9,10,11 5?10 + 10?67.7%
PCA 1 1 1?12 + 12?657.1%
21,2 2?12 + 12?6 32.2%
3 1,2,3 3?12 + 12?6 31.3%
4 1,2,3,4 4?24 + 24?628.8%
5 1,2,3,4,5 5?20 + 20?6 18.9%
6 1,2,3,4,5,6 6?18 + 18?6 15.4%
7 1,2,3,4,5,6,77?24 + 24?6 11.6%
8 1,2,3,4,5,6,7,88?24 + 24?6 13.7%
9 1,2,3,4,5,6,7,8,99?12 + 12?6 9.9%
10 1,2,3,4,5,6,7,8,9,10 10?20 + 20?6 7.3%
11 1,2,3,4,5,6,7,8,9,10,11 11?8+8?6 7.3%
TABLE III
FUZZY-ROUGH VS.PCA-RETURNED FEATURES.
These results show that, of the same dimensionality (i.e.,
5), the classifier using the features selected by the fuzzy-
rough mechanism has a substantially higher classification
accuracy and moreover, this is achieved via a considerably
simpler network. Further, it is worth recalling that PCA
alters the underlying semantics of the features during its
transformation process. That is, those features marked with
1, 2, ..., 11 in Table III are not the original 11 features, but
their linear combinations.
If more principal features are employed, the error rate may
generally be reduced. However, as compared to the classifier
that uses FR-selected features, an MFNN using PCA-selected
features still generally underperforms, until almost the full
set of principal features is used. Yet, the overall structural
complexity of all such classifiers are more complex than that
of the fuzzy-rough based classifier. The best of them involves
11 ? 8+8? 6 = 136 weights as compared to 5 ? 10 +
10?6 = 110. Additionally, the use of those classifiers based
on PCA-returned features would require many more feature
measurements to achieve comparable classification results.
E. Comparison with the use of crisp rough-selected features
It is interesting to note that the results of applying fuzzy-
rough feature selection to aid the MFNN-based classification
appear to be very similar to those of using crisp rough set-
based selection [2]. In fact, there happened to be only five
features being chosen when crisp rough set-based method
was used at its best [14]. In particular, four of the five features
were the same as those chosen by fuzzy-rough selection,
namely features 1, 9, 10, 11, with the only other different
one being feature number 4 (instead of the present feature
number 3).
However, the crisp approach requires an additional, and
rather subjectively defined, quantity discretisation mecha-
nism to convert real-valued image features into discrete nom-
inal values prior to feature selection. Different discretisation
schemes may lead to a rather different choice of feature
subsets, often one with a higher dimensionality (rather
than 5). As opposed to this, fuzzy-rough feature selection
is directly applied to the real-valued features, with fuzzy
equivalence classes being automatically computed from the
feature values. In addition, the result that the same number
of features was obtained using the crisp rough set-based
approach might have also been affected by the characteristics
of the cell-type classification problem itself because the
dimensionality of the original feature patterns is not very
large.
For a more scaled-up application, with the increase of the
dimensionality of the original feature patterns and the use of
different feature extraction mechanisms, subjective discreti-
sation may become much harder to optimise. This will then
lead to the loss of important information, thereby affecting
the selection of the smallest subset of quality features and
hence the subsequent complexity of the MFNN structure and
their classification accuracy. A more meaningful comparison
between these two approaches however, remains as active
research.
V. C ONCLUSIONS
This paper has presented an approach which supports
the potentially powerful neural network classification sys-
2008 IEEE International Conference onFuzzy Systems (FUZZ 2008) 981
tems with a fuzzy-rough set-based feature reduction method.
Unlike transformation-based dimensionality reduction tech-
niques, this approach retains the underlying semantics of the
selected feature subset. This is very important to help ensure
that the classification results are understandable by the user.
Following this approach, the conventional multi-layer feed-
forward networks, which are sensitive to the dimensionality
of feature patterns, can be expected to become effective on
classification of images whose pattern representation may
otherwise involve a large number of features.
The work has been applied to the real problem of normal
and abnormal blood vessel classification involving different
cell types. Although the application problems encountered
are complex, the resulting selected features are manageable
and the classifier built upon such features generally out-
performs those using more features or an equal number of
features obtained by conventional approaches represented by
PCA. Experimental results have clearly demonstrated this.
Note that comparisons between the use of fuzzy-rough
selected features against that of those obtained by PCA
form a focus of this paper. This is mainly due to the
observation that PCA is a representative approach commonly
taken to perform dimensionality reduction. However, there
exist many alternative methods for dimensionality reduction
(e.g., [16], [17]) which may also outperform PCA for the
present application. Further comparisons to such alternative
techniques will therefore help reveal more details about the
strengths and limitations of the present approach. Work is
ongoing in this direction.
Finally, it is worth indicate that, although the present
research on fuzzy-rough feature selection is incorporated
with neural network-based classifiers, it can be extended to
work with other types of intelligent classification system such
as classical decision trees [9] and fuzzy classifiers [5], [8].
This forms a piece of very interesting further research.
ACKNOWLEDGEMENT
The authors are grateful to Dr Richard Jensen for his
contribution, while taking full responsibility for the views
expressed in this paper.
REFERENCES
[1] S. Chen, J. Keller and J. Crownover. On the calculation of fractal
features from images. IEEE Transactions on Pattern Analysis and
Machine Intelligence,15(1993) 1087?1090.
[2] A. Chouchoulas and Q. Shen. Rough set-aided keyword reduction for
text categorisation. Applied Artificial Intelligence,15(9) (2001) 843?
873.
[3] P. Devijver and J. Kittler. Pattern Recognition: a Statistical Approach.
Prentice Hall, 1982.
[4] D. Dubois and H. Prade. Putting rough sets and fuzzy sets together.
In R. Slowinski (Ed.). Intelligent Decision Support. Kluwer Academic
Publishers, (1992) 203?232.
[5] C. Janikow. Fuzzy decision trees: Issues and methods. IEEE Transac-
tions on Systems, Man, and Cybernetics ? Part B: Cybernetics,28 (1)
(1998) 1?14.
[6] R. Jensen and Q. Shen. New approaches to fuzzy-rough feature selec-
tion. To appear in: IEEE Transactions on Fuzzy Systems.
[7] L. Kaplan. Extended fractal analysis for texture classification and
segmentation. IEEE Transactions on Image Processing, 8 (1999)1572?
1585.
[8] J. Marin-Blazquez and Q. Shen. From approximative to descriptive
fuzzy classifiers. IEEE Transactions on Fuzzy Systems, 10 (4) (2002)
484?497.
[9] T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[10] S.K. Pal and A. Skowron (Eds.). Rough-Fuzzy Hybridization: A New
Trend in Decision Making. Singapore: Springer Verlag. 1999.
[11] Z. Pawlak. Rough Sets: Theoretical Aspects of Reasoning About Data.
Kluwer Academic Publishers, Dordrecht, 1991.
[12] D. Rumelhart, E. Hinton and R. Williams. Learning internal repre-
sentations by error propagating. In: D. Rumelhart and J. McClelland
(Eds.), Parallel Distributed Processing. MIT Press, 1986.
[13] C. Shang, J. McGrath, C. Daly and J. Barker. Modelling and
classification of vascular smooth muscle cell images. IEE Electronics
Letters, 36(18) (2000) 1532?1533.
[14] C. Shang and Q. Shen. Rough feature selection for neural network
based image classification. International Journal of Image and Graph-
ics, 2 (4) (2002) 541?555.
[15] Q. Shen and R. Jensen. Selecting Informative Features with Fuzzy-
Rough Sets and its Application for Complex Systems Monitoring.
Pattern Recognition,37 (7) (2004) 1351?1363.
[16] J. Tenenbaum, V. de Silva and J. Langford. A global geometric
framework for nonlinear dimensionality reduction. Science,290 (5500)
(2000) 2319?2323.
[17] F. Young and R. Hamer. Theory and Applications of Multidimensional
Scaling. Eribaum Associates, Hillsdale, 1994.
982 2008 IEEE International Conference onFuzzy Systems (FUZZ 2008)