Interval-valued Fuzzy-Rough Feature Selection and Application for Handling Missing Values in Datasets Richard Jensen and Qiang Shen Dept. of Computer Science Aberystwyth University, Ceredigion, Wales, UK frkj,qqsg@aber.ac.uk Abstract One of the many successful applications of rough set theory has been to the area of feature selection. The rough set ideol- ogy of using only the supplied data and no other information has many bene ts, where most other methods require sup- plementary knowledge. Fuzzy-rough set theory has recently been proposed as an extension of this, in order to better han- dle the uncertainty present in real data. However, following this approach, there has been no investigation (theoretical or otherwise) into how to deal with miss- ing values e ectively, another problem en- countered when using real world data. This paper proposes an extension of the fuzzy-rough feature selection methodol- ogy, based on interval-valued fuzzy sets, as a means to counter this problem via the representation of missing values in an intuitive way. 1 Introduction Lately there has been great interest in devel- oping computational intelligence methodologies which are capable of dealing with imprecision and uncertainty, and the resounding amount of research currently being done in the areas re- lated to fuzzy and rough sets [12] is represen- tative of this. The success of rough set theory is due in part to three aspects of the theory. Firstly, only the facts hidden in data are anal- ysed. Secondly, no additional information about the data is required for data analysis such as thresholds or expert knowledge on a particular domain. Thirdly, it nds a minimal knowledge representation for data. As rough set theory handles only one type of imperfection found in data, it is complementary to other concepts for the purpose, such as fuzzy set theory. The two elds may be considered analogous in the sense that both can tolerate inconsistency and uncer- tainty - the di erence being the type of uncer- tainty and their approach to it; fuzzy sets are concerned with vagueness, rough sets are con- cerned with indiscernibility. Many deep relationships have been estab- lished and more so, most of the recent studies have concluded at this complementary nature of the two methodologies, especially in the context of granular computing [1]. Therefore, it is de- sirable to extend and hybridize the underlying concepts to deal with additional aspects of data imperfection. Such developments o er a high degree of exibility and provide robust solutions and advanced tools for data analysis [9]. How- ever, there has been no investigation into how such hybridizations may model and cope with missing values in datasets. This is a severely lim- iting factor for the application of these powerful techniques. In this paper, a further extension to fuzzy-rough set theory is proposed, interval- valued fuzzy-rough sets, in order to alleviate this problem. As a result of this, a new feature selec- tion method is developed that not only handles missing values, but also alleviates the problem of de ning overly-speci c type-1 fuzzy similarity relations (i.e. those with crisp membership func- tions) through the use of interval-valued fuzzy sets. The remainder of this paper is structured as follows. In Section 2, the fuzzy-rough hybridi- sation process is reviewed by brie y recalling its ingredients (fuzzy sets and rough sets) as well as its resulting end products (fuzzy-rough sets). Section 3 focuses on the proposed ap- proach, interval-valued feature selection. Initial experimental results that demonstrate the po- tential of the approach are presented in Section 4. Finally, Section 5 concludes the paper and outlines some ideas for future work. Proceedings of the 2008 UK Workshop on Computational Intelligence Page 59 2 Theoretical Background 2.1 Fuzzy Sets Recall that a fuzzy set [14] in U is an U![0;1] mapping, while a fuzzy relation in U is a fuzzy set in U U. For all y in U, the R-foreset of y is the fuzzy set Ry de ned by Ry(x) = R(x;y) for all x in U. If R is re exive and symmetric, i.e., R(x;x) = 1 and R(x;y) = R(y;x) hold for all x and y in U, then R is called a fuzzy tolerance relation. Fuzzy logic connectives play an important role in the hybridisation process. A triangu- lar norm (t-norm for short) T is any increas- ing, commutative and associative [0;1]2![0;1] mapping satisfying T(1;x) = x, for all x in [0;1]. Common examples of t-norms include the minimum, the product and TL de ned by TL(x;y) = max(0;x+y 1) for x;y in [0,1]. An implicator is any [0;1]2![0;1]-mapping I that is decreasing in its rst, and increasing in its second component, and that satis esI(0;0) = 1 and I(1;x) = x, for all x in [0;1]. 2.1.1 Interval-valued Fuzzy Sets An interval-valued fuzzy set eA in U is an ordered triple of the form eA =fhx; A (x); A (x)ijx2Ug (1) where A (x); A 2 [0;1] are the lower and upper membership functions which satisfy 0 A (x) A (x) 1; 8x2U. The lower and upper membership functions correspond to the lower and upper bound of a closed interval de- scribing the membership of x to eA. 2.2 Rough Sets Let I = (U;A) be an information system, where U is a non-empty set of nite objects (the uni- verse of discourse) and A is a non-empty nite set of attributes such that a : U!Va for every a2A. Va is the set of values that attribute a may take. With any P A there is an associ- ated equivalence relation IND(P): IND(P) =f(x;y)2U2j8a2P; a(x) = a(y)g (2) The partition of U, generated by IND(P) is de- notedU/IND(P) (orU/P for simplicity) and can be calculated as follows: U=IND(P) = fU=IND(fag)ja2Pg; (3) where is speci cally de ned as follows for sets A and B: A B =fX\YjX2A;Y 2B;X\Y 6=;g (4) If (x;y) 2IND(P), then x and y are indis- cernible by attributes from P. The equivalence classes of the P-indiscernibility relation are de- noted [x]P. Let X U. X can be approximated using only the information contained within P by con- structing the P-lower and P-upper approxima- tions of X: PX = fx2Uj[x]P Xg (5) PX = fx2Uj[x]P \X6=;g (6) The tuple hPX;PXi is called a rough set. 2.2.1 Feature Selection Let P and Q be sets of attributes inducing equiv- alence relations over U, then the positive region can be de ned as: POSP(Q) = [ X2U=Q PX (7) The positive region contains all objects of U that can be classi ed to classes of U/Q using the information in attributes P. Based on this de nition, dependencies between attributes can be determined. For P, Q A, it is said that Q depends on P in a degree k (0 k 1), denoted P )k Q, if k = P(Q) = jPOSP(Q)jjUj (8) If k = 1, Q depends totally on P, if 0 < k < 1, Q depends partially (in a degree k) on P, and if k = 0 then Q does not depend on P. The reduction of attributes is achieved by comparing equivalence relations generated by sets of attributes. Attributes are removed so that the reduced set provides the same predic- tive capability of the decision attribute as the original. A reduct Rmin is de ned as a mini- mal subset R of the initial attribute set C such that for a given set of attributes D, R(D) = C(D). From the literature, R is a minimal sub- set if R fag(D) 6= R(D) for all a 2 R. This means that no attributes can be removed from the subset without a ecting the dependency de- gree. Hence, a minimal subset by this de ni- tion may not be the global minimum (a reduct of smallest cardinality). The goal of rough set- based feature selection is to discover reducts of smallest cardinality. Proceedings of the 2008 UK Workshop on Computational Intelligence Page 60 2.3 Fuzzy-Rough Sets It is not possible in the original rough set theory to say whether two attribute values are similar and to what extent they are the same; for exam- ple, two close values may only di er as a result of noise, but in rough set theory they are con- sidered to be as di erent as two values of a dif- ferent order of magnitude. It is, therefore, desir- able to develop techniques to provide the means of knowledge modelling for crisp and real-value attributed datasets which utilises the extent to which values are similar. This can be achieved through the use of fuzzy-rough sets [6]. Fuzzy- rough sets encapsulate the related but distinct concepts of vagueness (for fuzzy sets) and indis- cernibility (for rough sets), both of which occur as a result of uncertainty in knowledge De nitions for the fuzzy lower and upper ap- proximations can be found in [13], where a T- transitive fuzzy similarity relation is used to ap- proximate a fuzzy concept X: RPX(x) = inf y2U I( RP (x;y); X(y)) (9) RPX(x) = sup y2U T( RP (x;y); X(y)) (10) Here, I is a fuzzy implicator and T a t-norm. RP is the fuzzy similarity relation induced by the subset of features P: RP (x;y) =Ta2Pf Ra(x;y)g (11) Ra(x;y) is the degree to which objects x and y are similar for feature a, and may be de ned in many ways, for example: Ra(x;y) = 1 ja(x) a(y)jja max aminj (12) Ra(x;y) = exp( (a(x) a(y)) 2 2 a2 ) (13) Ra(x;y) = max(min( (a(y) (a(x) a))(a(x) (a(x) a)) ; ((a(x) + a) a(y)) ((a(x) + a) a(x));0)(14) where a2 is the variance of feature a. As these relations do not necessarily display T- transitivity, the fuzzy transitive closure can be computed for each attribute. The choice of re- lation is largely determined by the intended ap- plication. For feature selection, a relation such as (14) may be appropriate as this permits only small di erences between attribute values of dif- fering objects. For classi cation tasks, such as fuzzy-rough nearest neighbours [8], a more grad- ual and inclusive relation such as (12) should be used. 2.3.1 Feature Selection In a similar way to the original crisp rough set approach, the fuzzy positive region can be de- ned as [10]: POSR P (D) (x) = sup X2U=D RPX(x) (15) An important issue in data analysis is dis- covering dependencies between attributes. The fuzzy-rough degree of dependency of D on the attribute subset P can be de ned in the follow- ing way: 0P(D) = P x2U POSR P (D) (x) jUj (16) A fuzzy-rough reduct R can be de ned as a minimal subset of features that preserves the dependency degree of the entire dataset, i.e. 0R(D) = 0C(D). Based on this, a fuzzy- rough greedy hill-climbing algorithm can be con- structed that uses equation (16) to gauge subset quality. In [10], it has been shown that the de- pendency function is monotonic and that fuzzy discernibility matrices may also be used to dis- cover reducts. However, there is no mechanism for modelling missing values in this framework, and is therefore limited in its application to real world datasets. This motivates the work pro- posed in the following section. 3 Interval-valued FRFS Central to traditional fuzzy-rough feature selec- tion is the fuzzy tolerance relation. From this, the fuzzy-rough lower approximations are con- structed which then form the fuzzy positive re- gions utilised in the degree of dependency mea- sure. Thus, the starting point for the process, type-1 fuzzy tolerance, is critical for its success. It is recognised that type-1 approaches are un- able to address particular types of uncertainty due to their requirement of totally crisp mem- bership functions [11]. An interval-valued ap- proach may therefore be able to better handle this uncertainty and at the same time model the uncertainty inherent in missing values. Cur- rently, there is no way to handle such values in fuzzy-rough set theory. Thus, the starting point for the proposed work is the interval-valued tol- erance relation fRa(x;y). The constituent fuzzy relations for individual attributes can be de ned via an upper (Ra ) and lower (Ra ) membership function, for example: Proceedings of the 2008 UK Workshop on Computational Intelligence Page 61 Ra (x;y) = 1 ja(x) a(y)j jamax aminj m (17) Ra (x;y) = 1 ja(x) a(y)jja max aminj (18) for m 2 (0;1). If m = 1, equation (17) de- generates to a standard type-1 fuzzy tolerance relation. As with type-1 fuzzy-rough feature se- lection, composition of relations is achieved by conjunctively combining the individual fuzzy re- lations fRa with a t-norm T: fR P (x;y) = Ta2Pf fR a (x;y)g (19) = [Ta2Pf Ra (x;y)g;Ta2Pf Ra (x;y)g] Based on the de nitions above, the interval- valued P-lower and P-upper approximation of a concept eX are here de ned as ^R PX (x) = inf y2U I( fR P (x;y); eX(y)) (20) ^ RPX (x) = sup y2U T( fR P (x;y); eX(y))(21) where fRP(x;y) is an interval-valued fuzzy toler- ance relation. The tuple h]RPX; ]RPXi is called an interval-valued fuzzy-rough set. 3.1 Feature Selection For the purposes of feature selection, the work presented in this paper only considers the use of the interval-valued lower approximation: ^R PX (x) = inf y2U I( fR P (x;y); eX(x)) = inf y2U [I( RP (x;y); X (x)); I( RP (x;y); X (x))] This provides a measure of the uncertainty of membership of an object to a given concept eX as a result of the underlying uncertainty in the similarity between this object and others in the universe. Note that the use of RP and RP is reversed due to the properties of fuzzy implica- tion. The resulting interval is a lower and upper bound on the true membership degree. Based on this, the interval-valued positive region is de- ned: ^POS P(D) (x) = sup eX2U=D ^R PX (x) (22) From this, the interval-valued degree of depen- dency of decision features D on a feature subset P is de ned as: ~ P(D) = j]POSP(D)jjUj (23) = 2 4X y2U POSP (D)(y) jUj ; X y2U POSP (D)(y) jUj 3 5 In [5], a normalised version of depen- dency was introduced for FRFS. The equivalent interval-valued normalised version is as follows: ~ P(D) = j]POSP(D)j j]POSC(D)j (24) = 2 41UX y2U POSP (D)(y) POSC (D)(y); 1 U X y2U POSP (D)(y) POSC (D)(y) 3 5 Here, ~ P(D) is an interval [ P (D); P (D)] that describes the extent to which the features in P are predictive of the decision feature(s). P is called a fuzzy decision superreduct to degree if ~ P(D) , and a fuzzy decision reduct to de- gree if moreover for all P0 P, ~ P0(D) < . Note that if a type-1 fuzzy tolerance relation is used, then these de nitions degenerate to their traditional fuzzy-rough counterparts. Core fea- tures (i.e. those that cannot be removed without introducing inconsistencies) may be determined by considering the change in dependency of the full set of conditional features when individual attributes are removed: Core(C) =fa2Cj~ C fag(D)6= ~ C(D)g (25) Subset search can be conducted by whichever mechanism is appropriate; for example, greedy hill-climbing, genetic algorithms [2], ant colony optimization [7] etc. Here, the standard greedy hill-climbing approach is adopted. The depen- dency degree is monotonic in both the lower and upper bounds, and so search continues until [1;1] is reached (no uncertainty) or there is no im- provement in dependency. 3.2 Missing Values The underlying interval-valued tolerance rela- tion can be modi ed in order to model the uncer- tainty resulting from missing values. If an object contains a missing value for a particular feature, then the resulting degree of similarity with other objects is unknown. In an interval-valued con- text, this can be modelled by returning the unit interval when an attribute value is missing for Proceedings of the 2008 UK Workshop on Computational Intelligence Page 62 one or both objects: fR a (x;y) = ( fR a (x;y) if x;y6= ; [0;1] otherwise (26) where missing values are denoted by . Again, relations are composed via a t-norm. The result- ing interval-valued lower approximations, posi- tive regions and dependency can then be used to gauge subset quality in an identical manner to that of the approach in section 3.1. The re- sulting algorithm is called FRFS-II. 4 Initial experimentation To test the robustness of the proposed approach in the presence of missing values four bench- mark datasets were used, obtained from [3] and containing no missing values initially. Values were randomly corrupted to become missing based on a supplied probability of mutation, and the FRFS-II algorithm run, using m = 0:9 for the tolerance relation. This was carried out ten times for each dataset. Table 1 shows the dataset details as well as the probabilities of value corruption and the resulting ranges of numbers of missing values; e.g. for the water 2 dataset and probability of mutation 0.0005%, datasets were produced with 4 to 14 missing val- ues present. For comparison, the type-1 FRFS algorithm was run on the original, uncorrupted data. As the corruption probability increases, the amount of missing data greatly increases, with the nal column showing a degree of miss- ing values not often seen in real data but useful to test the robustness of the approach. The resulting reduct sizes (averaged over re- peated runs) can be seen in table 2. All discov- ered reducts produced a dependency degree of 1 for the uncorrupted data and hence were all fuzzy-rough reducts. This shows the resilience of the approach when faced with corrupted data, but may not necessarily be the case in general as it will depend on the extent of missing values and similarity relation chosen. For small num- bers of missing values, FRFS-II manages to lo- cate identical or equivalent reducts to the orig- inal FRFS approach (which was applied to the uncorrupted data). As the extent of data cor- ruption increases, the task becomes more dif- cult, with FRFS-II selecting more features to compensate for the lack of information but still producing valid fuzzy-rough reducts. 5 Conclusion This paper proposed an interval-valued ap- proach to fuzzy-rough feature selection that suc- cessfully handles the uncertainty that cannot be modelled by a type-1 approach. In partic- ular, this method can handle missing values ef- fectively and in an intuitive way. Future work will involve further experimen- tal investigations, particularly to see if the trend observed in this paper is maintained for other datasets. This will include an analysis of the im- pact of the choice of similarity relation and pa- rameter m. There will also need to be an evalua- tion of the e ectiveness of the discovered reducts for the task of classi cation. Recent work in the area of type-1 FRFS for this purpose has shown that such reducts improve performance [10] and a similar improvement is expected here. Additionally, the work in [4] proposed novel t- norms and implicators for interval-valued fuzzy sets; these should be of great bene t for interval- valued FRFS. References [1] A. Bargiela, W. Pedrycz, Granular Comput- ing. An introduction. Kluwer Academic Pub- lishers, 2002. [2] A.T. Bjorvand and J. Komorowski, Practi- cal Applications of Genetic Algorithms for E cient Reduct Computation, Proc. 15th IMACS World Congress on Scienti c Com- putation, Modelling and Applied Mathemat- ics, A. Sydow, ed., vol. 4, pp. 601{606, 1997. [3] C.L. Blake, C.J. Merz, UCI Reposi- tory of Machine Learning Databases. Irvine, University of California, 1998. http://archive.ics.uci.edu/ml/ [4] C. Cornelis, G. Deschrijver, E.E. Kerre, Advances and challenges in interval-valued fuzzy logic, Fuzzy Sets and Systems, vol. 157, no. 5, pp. 622{627, 2006. [5] C. Cornelis, G. Hurtado Mart n, R. Jensen, D. Sl ezak, Feature Selection with Fuzzy De- cision Reducts, 3rd Int. Conf. on Rough Sets and Knowledge Technology (RSKT?08), pp. 284{291, 2008. [6] D. Dubois, H. Prade, Rough fuzzy sets and fuzzy rough sets, International Journal of General Systems, vol. 17, 91{209, 1990. Proceedings of the 2008 UK Workshop on Computational Intelligence Page 63 Table 1: Number of missing values Dataset Objects Features Missing values (range) 0.0005% 0.001% 0.005% 0.01% 0.1% Heart 270 14 1-3 1-6 12-24 26-43 321-387 Water 2 390 39 4-14 9-30 62-88 123-175 1446-1533 Water 3 390 39 3-13 12-23 65-98 128-163 1462-1531 Wine 178 14 1-5 1-4 7-15 20-30 206-246 Table 2: Reduct size for various data corruption probabilities Dataset Features Original Average reduct size FRFS 0.0005% 0.001% 0.005% 0.01% 0.1% Heart 14 8 8 8 9.1 10.3 14 Water 2 39 7 7 7 7 7.2 11.5 Water 3 39 7 7 7 7.2 7.9 11.8 Wine 14 6 6 6 6 6 9.1 [7] R. Jensen, Performing Feature Selection with ACO, Swarm Intelligence and Data Mining, A. Abraham, C. Grosan and V. Ramos (eds.), Studies in Computational In- telligence, vol. 34, pp. 45{73, 2006. [8] R. Jensen, C. Cornelis, A New Approach to Fuzzy-Rough Nearest Neighbour Classi ca- tion, Proceedings of the 6th Int. Conf. on Rough Sets and Current Trends in Comput- ing, to appear, 2008. [9] R. Jensen, Q. Shen, Computational Intel- ligence and Feature Selection: Rough and Fuzzy Approaches, IEEE Press/Wiley & Sons, 2008. [10] R. Jensen, Q. Shen, New approaches to fuzzy-rough feature selection, IEEE Trans- actions on Fuzzy Systems, to appear, 2008. [11] J.M. Mendel, R.I. John, Type-2 Fuzzy Sets Made Simple, IEEE Transactions on Fuzzy Systems, vol. 10, no. 2, 117-127, 2002. [12] Z. Pawlak, Rough sets, International Jour- nal of Computer and Information Sciences, vol. 11(5), 341{356, 1982. [13] A.M. Radzikowska, E.E. Kerre, E.E., A comparative study of fuzzy rough sets, Fuzzy Sets and Systems, vol. 126, 137{156, 2002. [14] L.A. Zadeh, Fuzzy sets, Information and Control, vol. 8, 338{353, 1965. Proceedings of the 2008 UK Workshop on Computational Intelligence Page 64