Fuzzy Rule Interpolation With $K$-Neighbors for TSK Models

When a fuzzy system is presented with an incomplete (or sparse) rule base, fuzzy rule interpolation (FRI) offers a useful mechanism to infer conclusions for unmatched observations. However, most existing FRI methodologies are established for Mamdani inference models, but not for Takagi–Sugeno–Kang (TSK) ones. This article presents a novel approach for computing interpolated outcomes with TSK models, using only a small number of neighboring rules to an unmatched observation. Compared with existing methods, the new approach helps improve the computational efficiency of the overall interpolative reasoning process, while minimizing the adverse impact on accuracy induced by firing those rules of low similarities with the new observation. For problems that involve a rule base of a large size, where closest neighboring rules may be rather alike to one another, a rule-clustering-based method is introduced. It derives an interpolated conclusion by first clustering rules into different groups with a clustering algorithm and then, by utilizing only those rules that are each selected from one of a given, small number of closest rule clusters. Systematic experimental examinations are carried out to verify the efficacy of the introduced techniques, in comparison with state-of-the-art methods, over a range of benchmark regression problems, while employing different clustering algorithms (which also shows the flexibility in ways of implementing the novel approach).

terms can be readily described and their associative relationships explicitly represented, enabling the inference process to be performed that resembles human reasoning.
Different types of fuzzy rule inference system exist in the literature. Takagi-Sugeno-Kang (TSK) models [3] are one of the conventional that are the most applied. Within such a model, fuzzy sets are used to depict the values of rule antecedents and polynomials to describe the consequents. Firing a rule of this form results in crisp conclusions. As such, TSK models are very suitable for solving regression and prediction problems, over continuously valued domains.
While being generally powerful, conventional rule based systems all suffer from an important limitation, be they fuzzy or not. That is, if the input domain is not completely covered, a novel observation may not always match any rule in the given rule base. In this case, they are unable to produce any conclusion by applying any of the classical rule-firing methods. Such rule bases are termed sparse ones (although this is often taken to simply imply an incomplete rule base) in the literature. To rectify, or at least to reduce the adverse impact of this limitation, fuzzy rule interpolation (FRI) has been introduced [4]. If a newly presented input or observation does not match any of the rules available, FRI can help by generating an intermediate rule through the approximation of those rules close to the observation, from which a potentially relevant conclusion may then be obtained. FRI has been widely applied in performing practical pattern recognition problems and has obtained satisfactory results, such as computer vision and image processing [5]; medical diagnosis [6] and risk analysis [7]; and cyber security and network security [8].
A good number of FRI approaches have been established over the past few decades. Just considering the popular family of transformation-based FRI techniques [9] that generally follow the seminal work on linear interpolation [10], there have been many distinct FRI mechanisms reported in the literature, including adaptive interpolation [11], higher order interpolation [12], and weighted FRI techniques [13]. However, these exemplified techniques are all developed for Mamdani models rather than for TSK models.
TSK inference extension (TSK+) [14] is a recently proposed fuzzy interpolative reasoning method for fuzzy systems employing TSK models. Instead of relying upon computing the matching degrees, it uses a distance metric-based similarity measure to perform interpolative reasoning, by manipulating all rules contained within the rule base. In so doing, when an input matches no rule, a certain output is still obtained. Although TSK+ offers a useful means for innovative inference, it has its own shortcomings. Particularly, it is not sufficiently efficient for many practical applications given its nature of fundamentally requiring the use of all given rules, incurring significant computational overheads. Besides, redundant or possibly, irrelevant rules are also included in any attempt to compute the output. This may introduce undesirable biases into the final interpolated outcomes, thereby reducing the system accuracy.
To address these limitations, a different approach, interpolation with (just) K closest neighbors, is proposed in this article. The work presents two advancements in developing FRI methodologies for TSK models, through two novel implementations of the approach: 1) interpolation with K closest rules (KCR) for sparse rule bases of a small size, and 2) interpolation with K closest rule clusters (CRC) for those of a large size. The underlying principle for both is to perform interpolation using a small number, K of distinctive rules close to an unmatched observation. This follows the common practice as with the state-of-the-art techniques developed for Mamdani models. In so doing, rules with low similarities to an unmatched observation are not fired and hence, their adverse impact on model accuracy minimized, while incurring less computation. In addition, the problem of lacking diversity of rules being involved in subsequent interpolation, caused by the situation where a large-sized sparse rule base is present but the K-nearest rules may be very similar with one another, is resolved. To provide flexibility in implementing the proposed approach, ensuring that it does not rely on any specific clustering algorithm, the implementation for the second method is herein systematically evaluated using five different clustering techniques.
To have a fair comparison over different methods, a range of experimental studies are systematically carried out on different benchmark datasets. Statistic analyses of the results demonstrate that KCR improves the performance over TSK+ and CRC for cases involving sparse rule bases of a small size, and that for cases involving rule bases of a large size, CRC outperforms TSK+ and KCR. Of course, both KCR and CRC offer superior results over the existing approach TSK+. The most recent work as reported in [15] provides a novel approach that makes it possible to automatically select fuzzy rules for interpolation. Having recognized this, experimental studies are also conducted to compare KCR and CRC with this existing approach, with favorable results.
The remainder of this article is structured as follows. For academic completeness, Section II briefly reviews the inference process of the conventional fuzzy systems that use a TSK model, the procedure of TSK+, and the five popular clustering algorithms that are subsequently used for rule clustering. Sections III and IV detail the two aforementioned methods, respectively, introduced for interpolation with KCRs and that with K CRC. Section V presents and discusses the experimental results, in comparison with state-of-the-art alternatives. Finally, Section VI concludes this article and points out interesting future work.

II. BACKGROUND
This section presents the directly relevant background material, including an outline of TSK fuzzy inference systems and TSK+, and a brief description of five popular clustering algorithms each of which may be used to facilitate the rule clustering-based interpolation method.

A. TSK Model-Based Fuzzy Reasoning Systems
The TSK fuzzy systems were originally presented by Takagi, Sugeno, and Kang in 1985 [3]. A TSK model uses fuzzy sets to represent rule antecedents and polynomials to do the consequents, with the computed conclusions represented in crisp values. As presented in Fig. 1, for an unknown observation, the TSK model-based system first calculates the matching degrees between the observation and the rule antecedent in each rule. The weight of a certain rule is then determined by an integrating operation (usually implemented by a minimum operator) on the resulting individual matching degrees. The final outcome is computed by the weighted average of the corresponding rule consequents. A more detailed description of a TSK system and its working is outlined in Algorithm A1, as provided in Appendix A.
If a new observation does not match any rule, the weight α i that is calculated for each rule will be 0. This implies that neither subconclusion nor final result can be derived using traditional approximate reasoning. Thus, any attempt to utilize the conventional TSK model will fail. This is due to the fact that the system utilizes a sparse rule base, or more precisely, an incomplete fuzzy rule base, namely it does not cover the full problem domain. As indicated previously, FRI has been developed to compute the corresponding conclusions for such unmatched observations, by exploiting the approximation of the rules close to them, although the substantial majority of the existing techniques are aimed at interpolative reasoning with Mamdani models [4].

B. TSK Inference Extension
TSK+ [14] represents the current state of the art, providing a fuzzy inference mechanism that extends the original TSK inference to cope with sparse rule bases. Different from the conventional TSK approach that works based on calculating the matching or overlapping degrees with an observation, TSK+ works by employing a similarity measure modified from the Euclidean distance metric [16] to assess any potential relationships between the observation and every given rule. The similarities so measured between the observation and each rule are always positive (with a similarity degree larger than zero). Therefore, each and every rule in the given rule base will be involved in the computation of the inferred conclusion. In so doing, if an observation matches no rules at all in the rule base, an approximate conclusion is still estimated. The inference procedure of TSK+ is summarized in Algorithm A2, in Appendix A.
Fundamentally, TSK+ applies all given rules. This clearly is not efficient, especially for problems that involve a large (however incomplete) rule base. More importantly, redundant or even possibly irrelevant rules are also included in every attempt to compute the final output. This may introduce a certain undesirable bias or noise into the computation process for the generation of the conclusion, further to the introduction of significant computational overheads.

C. Clustering Algorithms
A brief introduction to five popularly used and readily available clustering algorithms is given here. Each of these can be adopted to implement the rule clustering procedure that is required in the subsequent development when facing a large-sized rule base. Any one of these may be employed to carry out the intended task, but these are collectively reviewed to facilitate comparison, in an effort to make an informed choice of the potentially most suitable.
1) K-Means [17]: As one of the most widely used clustering algorithms, it clusters instances into K groups by iteratively updating cluster centers and assigning instances to their closest centers. The underlying objective function (known as the inertia or within-cluster sum-of-squares error) is defined as follows: where U denotes the set of data instances, x ji expresses that the jth instance belongs to the ith cluster, V stands for the set of cluster centers, and v i ∈ V , x j − v i 2 denotes the Euclidean distance between the object x ji and the center v i , K is the number of cluster centers, and N is the number of instances.
This algorithm requires the number of clusters to be prespecified. Many methodologies have been introduced in the literature to determine the number of clusters, K, such as those through the use of the Elbow method [18], the Silhouette-based technique [19], and the Bayesian Information Criterion [20]. The Elbow method is fast and effective, as it determines the value of K simply based on the criterion that adding another cluster does not lead to much improved modeling outcome (with respect to the abovementioned objective function), and hence, it is employed in this article.
2) Gaussian Mixture Models (GMMs) [21]: Being another classical clustering algorithm, this method works by presuming that the distribution of instances conforms to the linear combination of multiple Gaussian distribution functions. This combination function is defined by where x j denotes an instances to be clustered, π ji represents the probability that the instance x j belongs to the ith cluster, and μ i and σ i stand for the mean and standard deviation of the ith Gaussian model, respectively. Theoretically, this model can fit any type of data distribution. The expectation maximization algorithm [22] is the most commonly used algorithm to construct GMMs. In particular, to identify the optimal partitions of GMMs, the log-likelihood function to be maximized is given by The corresponding parameters are updated iteratively such that where N i = Σ N j=1 γ ji and γ ji represent the conditional probability, which is calculated by 3) Fuzzy C-Means (FCM) [23]: This conventional clustering algorithm has been popularly applied in dealing with various problems [e.g., for fuzzy rule base generation [13] and social network modeling [24])]. Unlike any crisp clustering method (say, K-means and GMMs), FCM allows a data instance to belong to different clusters at the same time with different membership degrees. It works by assigning each instance a membership degree to every cluster, based on the measurement of the distance between the instance and an individual cluster center. The closer an instance to a cluster center, the higher the membership degree. To identify the optimal partitions of FCM, the objective function to be minimized is defined by where x j denotes an instance, w is a parameter that signifies the weight of each element, K stands for the number of cluster centers, N is the number of instances, V is the set of cluster centers with v i ∈ V , U is the matrix of membership degree, and u ij ∈ U represents the membership degree of the instance x j belonging to the cluster with the center v i , and x j − v i 2 expresses the similarity between the instance x k and the center v i . The procedure of this algorithm is summarized in Algorithm A3, in Appendix A. Particularly, the membership u ik and the center v i are updated iteratively such that

4) Kernel Fuzzy C-Means (K-FCM) [25]:
Being an extension to the standard FCM, it works by applying a kernel-induced distance measure to replace the original Euclidean distance. Kernel function is a nonlinear mapping that transforms a low dimensional input data space into a feature space with a much higher dimension, aiming at turning the original nonlinear problem into a potentially linear one so as to facilitate problem solving [26]. The following Gaussian radial basis function is one of the commonly used kernel functions, and the one employed in this article, because no additional parameters are required where x j denotes an instance, v i stands for a cluster center, and σ 2 represents the variance of the instance.
As it is a direct extension of FCM, the algorithm procedure is omitted here. However, note that adapted from its original, the underlying objective function is now as follows: with (7) and (8), respectively, transformed to

5) Suppress Fuzzy C-Means (S-FCM) [27]
: This is another extension of FCM, following the motivation for improving on its convergence speed [28]. It is developed on the basis of the rival-checked fuzzy c-means clustering algorithm [29] that speeds up FCM with competitive learning capacity. The underlying mechanism of this approach is to magnify the largest membership degree, u pj while suppressing the others. In order to achieve such an objective, a membership modifying mechanism is added after iteratively updating the membership degrees U , such that where u pj > u ij , i = p and 0 ≤ α ≤ 1.

III. INTERPOLATION WITH KCR
From the specification as well as the application of TSK+, it is easy to reveal that those rules nearest to an unmatched observation generally have a much higher similarity degree than others. This indicates that the interpolated outcomes may be mainly determined by those closest rules, with the rest typically contributing substantially less. Note that in the FRI literature, it is often assumed that the interpolated conclusion is estimated by a certain aggregation of those neighboring rules to the observation [4]. That is, the nearest rules are (normally correctly) considered to contain the most relevant information while those rules far away from the observation are less relevant. Indeed, distant rules may introduce adverse biases into the results, with their use becoming counterproductive. As far-away rules generally have relatively smaller similarity measures against the observation, such biases do not necessarily impose much influence upon the interpolated results, but they do induce significant computational overheads if there are many rules in the rule base (despite its incompleteness). Thus, such biases or the use of remote rules should be minimized, for both the efficiency and the effectiveness of the interpolative reasoning process.
To address the aforementioned issue, a significantly revised inference procedure, termed KCR is introduced in this section. The underpinning idea is that only K-nearest neighboring rules to a given unmatched input are exploited in performing the interpolation, rather than involving all the rules in the sparse rule base.

A. KCR Algorithm
Without losing generality, let a TSK sparse fuzzy rule base consist of m rules with each involving n antecedent variables and being defined by (14) where A i1 , . . ., A in are the fuzzy sets, respectively, taken by the rule antecedent variables x 1 , . . ., x n and a i0 , a i1 , . . ., a in are the parameters specifying the polynomials of a rule's consequent.
Given an observation O(B 1 , . . ., B n ), the KCR algorithm can be summarized as shown in Algorithm 1.  [30] (which is exploited purely for efficiency).

3) Computation of similarity measures between O and
each R i that is one of selected K-nearest rules: Integration of resultant K similarities to obtain interpolated rule with consequent parameters: 6) Execution of interpolated rule to yield interpolated consequent:

B. Similarity Measurement
Algorithm 1 works via the employment of similarity measures. To offer a flexibility in practical applications, as well as to provide an opportunity for comparative evaluation of the algorithm, two distinct similarity measurement methods are introduced here.

1) Similarity With Distance Factor (Similarity-DF)
: This is the same as with similarity measurement means utilized in TSK+. It is a revised version of the measurement presented in [16], with a distance factor (DF) employed to reinforce its sensitivity. Note that as empirically shown in the literature (e.g., [31]), the use of what type of membership function has little impact upon the outcomes of fuzzy rule-based inference provided that the membership functions are appropriately tuned with training data. From this observation and also, for computational simplicity, this work utilizes triangular membership functions to represent fuzzy values unless otherwise stated.
For illustration, suppose that two normalized fuzzy sets are given, represented by triangular membership function A = (a 1 , a 2 , a 3 ) and A = (a 1 , a 2 , a 3 ) [32]. Then, the similarity degree between these two fuzzy values S(A, A ) is defined and computed by and α represent the Euclidean distance between the gravity centers (namely, the representative values [9]) of the two fuzzy values, s is an adjustable parameter that determines the sensitivity of the similarity measure to the distance measure (the larger the value of s the more sensitive DF to α), and β is a sufficiently large integer (which is empirically set to 5) to ensure that DF is approximately normalized as 1 when d becomes 0.
2) Similarity Based on Distance Measures (Similarity-d): As one of the most widely applied similarity measurement method [33], this measures the similarity between two fuzzy sets directly based on the inverse of the distance metric between them where again, . Following the abovementioned specifications of these similarity measurement methods, it is clear that the larger the value of S(A, A ), the nearer and hence, the more similar the two fuzzy sets A and A . In particular, S(A, A ) reaches the maximum value if and only if A and A are identical. Owing to their generality, both can be effective and applicable to capture and reveal the similarities between fuzzy sets (as to be experimentally verified later, while Similarity-DF generally performs better than Similarity-d, at the cost of a little extra computation).

C. KCR Complexity
In Algorithm 1, two distance metrics are exploited. The first is for the use of Euclidean metric to efficiently determine nearest rules, without resorting to the more complicated similarity measurement. The second utilization takes place to find the similarity measures. However, in this latter use, it only plays a small part in helping capture the essential relationships between an observation and the rules. Importantly, the similarity measurement and hence, the second round of application of distance metric is only applied K times for K selected rules rather than for all the rules in the rule base. Typically, K is substantially smaller than the number of the rules available. As such, the proposed approach significantly reduces the running time that would otherwise have to be taken if TSK+ is applied. Furthermore, the employment of Quickselect helps reduce the effort required to select the nearest rules.
Indeed, given m rules each involving n antecedent attributes, the time complexity of KCR is O(mK + nK), where O(mK) is the time complexity taken to implement the selection of KCR. In comparison, TSK+ has a complexity of O(mn) since all rules are fired to derive the final conclusion. Note, however, that generally, K is much smaller than m or n. Thus, the proposed approach has a significantly lower time complexity.

IV. INTERPOLATION WITH K CRC
KCR is efficient. However, when KCR is applied to solve problems that involve a large-sized sparse rule base, the K nearest rules with the greatest similarity degrees may appear to be rather more similar amongst themselves than the rest. This may be expected intuitively, as illustrated in Fig. 2. Thus, if only these K rules are taken to implement interpolation, the results will be also similar to the linear combination of their consequents regardless of what similarity measures may be. Of course, this potential problem is not unique to KCR, but it can arise in TSK+ as well, despite that all rules are involved in rule interpolation there. This is simply because the similarities of the selected K rules are much larger than those measured over the rest. That is, the final interpolated result in TSK+ is also mainly determined by those nearest ones.
The abovementioned analysis prompts the need to extend the diversity of rules selected for use in the rule interpolation process, in an effort to avoid the involvement of far too many similar rules. Driven by this consideration, a clustering-aided interpolative reasoning process is presented here, termed CRC hereafter.
In CRC, to maximize computational efficiency, all fuzzy values appearing in the rule antecedents are approximately represented using their representative values, which are first clustered into different groups, by the use of a clustering method. Those in the same cluster can be intuitively regarded as containing similar information about the mappings between the antecedents and the consequents; after all, they have been deemed to belong to the same cluster. Then, K clusters nearest to an unmatched observation are selected, with K being a small number, where the distance between a cluster and an observation is determined by the use of the Euclidean measurement between the cluster center and the observation. From each selected cluster, the rule that is the closest to the observation is then taken as an element of a set of K-nearest rules to be used for interpolation. Thus, rules measured without necessarily having the higher similarity measures are able to contribute to the creation of the final interpolated consequent.
Note that through the aforedescribed rule selection process, the rule that is of the overall closest distance to the observation is always included to participate in rule interpolation. This is obvious as it always is the representative of a certain cluster of rules since it has the greatest similarity to the observation amongst the entire rule base. Fig. 3 presents an illustrative example of rule clusters and rules selected from the closest clusters that are produced by the CRC algorithm.

A. CRC Algorithm
The abovementioned intuition for the development of the CRC procedure is summarized as given in Algorithm 2, where the similarity measurements are obtained in the same way as with KCR. In this algorithm, it is assumed, without losing generality that a sparse rule base contains m rules [with each specified as per (14)] and an observation O(B 1 , . . ., B n ) (which does not match any of the rules) are given. Note that any of the five different clustering algorithms outlined in Section II-D (and indeed many other clustering methods if preferred) can be employed here to perform rule clustering. returning K rules and corresponding similarities. 7) Integration of K similar rules to obtain interpolated rule, resulting in consequent parameters: 8) Execution of interpolated rule with O, resulting in final consequent outcome: f (B 1 , . . ., B n ) = a 0 + a 1 B 1 + · · · + a n B n 9) Return: f (B 1 , . . ., B n ).

B. CRC Complexity
The time complexity of the proposed CRC procedure is O(KC + KG + nK), where O(KC) and O(KG) represent the complexity incurred for K clusters selection and that for K rules selection with one from each cluster, respectively; and G is the largest number of the rules contained within any single cluster. Compared with KCR, whose time complexity is O(mK + nK), CRC can also help us to reduce the computation cost incurred to perform similarity measurement. CRC does not require the computation of the distances between the observation O and all given rules, but only those from O to the centers of clusters and those to the rules in the K selected clusters. In so doing, O(KC + KG) is in general, smaller than O(mK). In other words, the time complexity can generally be lower than that of CRC. Thus, the complexity of the proposed approach is further reduced in comparison to that of TSK+.

C. Integration of KCR and CRC
Both KCR and CRC procedures can be integrated into a single algorithm, in conjunction with the conventional inference mechanism for TSK models. This integration is straightforward, as shown in Fig. 4. It works by the use of just a small number of closest rules to infer the final conclusion while the conventional method is stuck when an observation matches no rules.

V. EXPERIMENTAL EVALUATION
The performance of the abovementioned novel FRI approach for TSK models is experimentally evaluated, in comparison to the state-of-the-art techniques, including the aforementioned TSK+ and the automated rule selection (AutoRS) based method [15], over ten benchmark datasets. The robustness and effectiveness of the presented approach are also demonstrated by observing the consistency and efficiency of utilizing different clustering methods in supporting CRC.

A. Experimental Setup 1) Datasets Used:
The datasets run include one nonlinear mathematical model and nine real-world benchmark datasets (for regression problems) that have been taken from the UCI machine learning [34], function approximation [35], and evolutionary learning repositories [36]. The details of these employed benchmark datasets are summarized in Table I. For illustration the threshold for determining large-sized sparse rule bases is empirically set to 95, in order to help evaluate the performance of KCR and CRC in relation to the sizes of the sparse rule bases concerned. Note that the Polynomial dataset in the table is produced by randomly sampling from the following 3-D nonlinear function: This nonlinear function has been used to produce a benchmark dataset in [14] and [37], and the random sampling method has been frequently employed in the literature (e.g., [38] and [39]).
2) Performance Evaluation Criteria: To enable thorough evaluation and fair comparison, experimental results are reported using the average obtained from 10×10-fold crossvalidation per dataset. Training sets are used to create sparse fuzzy rule bases (see next) and testing sets to assess the performance, in terms of root-mean-square error (RMSE, in relation to the ground truth). The smaller the value of RMSE is, the higher accuracy the approach has.
3) Sparse Rule Base Generation: In this experimental study, sparse fuzzy rule bases are artificially created from the dense fuzzy rule bases that are induced from the original datasets. This enables a challenging comparison of the FRI results (against those attainable using a full set of rules, although in real applications the full coverage of the problem domain is not assumed). In particular, a sparse rule base is generated by randomly removing a number of rules from the original dense rule base that has been learned by employing a data-driven learning method. To emphasise on the sparsity of the knowledge available, in order to compare against conventional approximate reasoning and state-of-the-art FRI mechanisms (both running on TSK models), only 80% rules are retained to form the sparse fuzzy rule base for each problem case.
The following simple data-driven fuzzy rule learning procedure is employed to generate the original dense rule base: The data instances in a given training dataset are clustered into different categorizes using FCM [23]. Since FCM allows an instance to belong to more than one cluster with different membership values, the worst rule-learning assumption is made here, with the least biased threshold of 0.5 membership value used to determine whether an attribute is taking on a certain fuzzy set as its value. The polynomial consequent of an emerging rule is learned through the popular linear regression approach as described in [40].
For computational simplicity apart from fairness in comparison, as indicated previously, only triangular membership functions are used throughout to represent fuzzy values. The three parameters of a triangular membership function are implemented by the infimum, center, and supremum of the corresponding cluster. Note that, if fine-tuned membership functions are available and employed, improved performance attainable by all interpolation approaches examined can be expected. The  Table I. For easy illustration, consider the sparse rule base generated from the polynomial dataset.  Fig. 5. The two subfigures reflect the results viewed from a different inspection angle of the same inference process. In particular, Fig. 5(a) gives a sideview of the outcomes of running on the entire sparse rule base, and Fig. 5(b) shows a bird's-eye view. As there are substantial amounts of space that are not covered by the learned rules, plenty of observations have matched no rule, resulting in missing values in the output domain. These two subfigures collectively demonstrate the poor outcome of just exploiting the incomplete knowledge in the given problem domain, without the support of FRI.

4) Algorithmic Parameters:
For completeness, the parameters used to implement KCR and CRC for the experiments on different datasets are listed in Table II. Note that the number of selected rules, K for both KCR and CRC is determined by a trial and error process. Particularly, the processes of determining the K for KCR and CRC are exemplified on the polynomial dataset, as shown in Figs. 6 and 7, respectively. For CRC, the  Elbow method is employed to determine the number of clusters required.

5) Automated Rule Selection:
A key issue concerned within this work is the determination of the number of neighborhood rules, K used to perform FRI. In the implementation, K may be a fixed number empirically justified by a trial and error process as indicated earlier. However, AutoRS [15] provides a novel approach that makes automatic selection of fuzzy rules useful for subsequent rule interpolation. Therefore, the proposed approach is also compared with AutoRS which represents the most recent development in the area of FRI. The details of AutoRS are out of the scope of this article, but its main procedure can be summarized as shown in Algorithm A4, presented in Appendix A.

B. Results and Discussion
In the following presentation of experimental results, the outcomes of cross validations are summarised in tabular form (namely , Tables III-V), where the best results are shown in bold.   Table III presents the means and standard deviations of the interpolation results produced by KCR and CRC (with FCM employed as the clustering method). Each FRI mechanism is supported by the use of either of the two similarity measurement methods introduced previously. As can be seen, the FRI algorithms employing Similarity-DF can produce better results on all bar one dataset than those using Similarity-d. In particular, the effectiveness of utilizing Similarity-DF is more obvious on more complex datasets. However, note that KCR with Similarity-d leads to the best inference outcome on the dataset Dee. One possible reason for this is that selected rules have similar weights concerning the conclusion while their distances to the observation are different. Having taken notice of the generally superior results of using Similarity-DF, in the following presentation of the experimental investigations, only Similarity-DF is adopted to measure similarity.

1) On Use of Different Similarity Measurement Methods:
2) On Rule Bases of a Small Size: Table IV shows the means and standard deviations of FRI results, which are averaged outcomes over 10×10-fold cross-validation, for each of the eight compared approaches on the datasets that involve the number of rules being less than or equal to 90. In this table, the notion of "CRC with C" stands for a procedure whose employed clustering method is "C" to create rule clusters, with "C" indicating one of the following: K-means, GMM, FCM, K-FCM, or S-FCM. The comparison with the conventional TSK models is not included herein, due to the fact that TSK models alone cannot derive any conclusion when an observation does not match any of the rules in the rule base. Naturally, if compared, all rule interpolation algorithms would significantly outperform the direct utilization of the TSK models involving a sparse rule base across all problems.
As can be seen from Table IV, regarding those models comprising a sparse rule base of a small size, on average, the overall best results are obtained by KCR. AutoRS also works well; and TSK+ has slightly lower accuracies (despite having a higher computational complexity), which is due to the fact that all rules are involved, bringing forward adverse biases.
Independent of which of the five clustering methods is used, the CRC algorithm does not work well on small sized sparse fuzzy rule bases. This can be expected as those potentially highly relevant rules are most likely to have been clustered into one single cluster while the other clusters of rules offer little useful information to the conclusion. Thus, all rules bar one contribute misleading information to the calculation of the final results, leading to inaccurate interpolated outcomes.
3) On Rule Bases of a Large Size: Table V presents the means and standard deviations of interpolation results returned by each of the eight compared methods, regarding the five TSK models that involve sparse rule bases of a large size. All algorithms are ranked according to their means and standard deviations of the results on each dataset, while those of the same results are deemed to have the same rank. The lower the ranking value, the higher the model accuracy. In summary, the bottom row of this table provides the total rank values calculated by the sum of individual ranks across all seven benchmark datasets.
Clearly, CRC outperforms the rest consistently for such more complex datasets. Examining the experimental results more closely, CRC supported by K-FCM has highly impressive results on the Quake, Delta_ail, and Delta_elv datasets, significantly improving the accuracy of the existing methods. Interestingly, this indicates that the kernel function [see (9)] applied in K-FCM has the ability to make the clustering results more suitable for the distribution of the corresponding sparse rule bases. As reflected by these experimental outcomes, the utilization of any of the five clustering algorithms enables CRC to outperform the other algorithms that do not involve clustering on the sparse rule bases, when these rule bases are of a significant size. This demonstrates the significance of clustering-aided FRI. Indeed, these results positively reflect the intuition that similar-rule crowds do have negative impact upon the accuracy of the final interpolated conclusions. With the support of clustering, CRC successfully avoids involving otherwise far too many similar rules and extends the diversity of rules used for subsequent interpolation. These results also confirm that while all (the sparse) rules are exploited to derive the final outcome, TSK+ does not resolve the problem well. Importantly, the narrow-banded standard deviation values as shown in Table V demonstrate the robustness of the proposed approach.

4) Further Examination on Performance:
In general, CRC performs excellently for large-sized sparse rule bases, as demonstrated earlier. However, to systematically exploit the proposed approach as per Fig. 4, a mechanism is required to determine the threshold at which to decide on whether KCR or CRC is to be used (although in the event when it is unrealistic to identify such a threshold, both methods may be applied to provide suggestions that are still useful for interpolative decision-making). According to Table I, on the stock and laser datasets, the sparse rule bases include 90 or 100 rules. Consider Tables IV and V, by comparing against the outcomes achieved on the other datasets, algorithms with and without clustering have more similar results on these two datasets. Therefore, borrowing the underlying idea of the Elbow method, the threshold of largesized sparse rule bases can be determined as at least including 95 rules (the average number between 90 and 100). This is the empirical basis upon which the preceding experiments have been conducted.
Apart from the issue of determining the threshold, there are occasional situations where the generality of CRC outperforming KCR for large datasets does not necessarily hold. This is evident by considering the cases where CRC supported by K-FCM or S-FCM is utilized to carry out interpolative reasoning with the sparse rule bases which are induced from the polynomial dataset. Fig. 8 depicts the distribution of the results in a box-plot, across all methods investigated. As revealed by this figure, CRC with K-FCM or S-FCM is significantly underperformed than its peers, in comparison to CRC with K-means, GMM, or the original version of FCM. To analyze the causes of such   Table VI, represented by their indices. It can be seen that rule 36 is the nearest to the particular observation and is always selected by all rule selection methods. Rules selected by K-FCM and S-FCM are clearly distinct from the ones that lead to satisfactory results. The likely reason for this is that these two clustering algorithms fail to derive appropriate rule clusters on this dataset, adversely affecting the subsequent rule selection.
Another interesting observation is that what KCR selected are the top three rules also selected by AutoRS. However, with another two rules selected (and employed) on top of these three, AutoRS performs relatively worse than KCR. Note that rule 21 is selected by AutoRS, which is also taken by the top performers like CRC with GMM or FCM (both of which happen to utilize the same selected rules). However, AutoRS underperforms in comparison to the two CRC implementations. This is probably Fig. 9. Inference results (a) using KCR and (b) using CRC with FCM, on the polynomial dataset.
due to the fact that those three rules (of indices 36, 21, and 57) jointly offer the best information for producing accurate outcomes. Although rule 21 is also taken by AutoRS, it is being treated as the one of the lowest weight (as it is taken the last amongst the five rules selected in order). Thus, its potential contribution is delimited, while adding more computation costs to reach the (less well) interpolated result.
To qualitatively visualize the strengths of the proposed approach, Fig. 9 illustrates the best inference results produced by the method without clustering (KCR) and that with clustering (CRC with FCM), over the polynomial dataset. Together with Fig. 5, it can be seen that the novel approach introduced herein enables appropriate conclusions to be generated for data instances unmatched by the sparse rules. Additionally, for this particular dataset, CRC with FCM can produce smoother results than KCR.

VI. CONCLUSION AND FUTURE WORK
This article has presented a novel approach for performing FRI with TSK models. The work has been motivated by the observation that existing FRI approaches are almost completely devised for reasoning with Mamdani models, while the very few developed for TSK models are inefficient. It makes the following two particular contributions to the FRI literature: 1) For models involving a sparse rule base of a small size, the implementation of the approach (KCR) derives accurate interpolation using only a small number of closest rules to an unmatched input. 2) For models comprising a sparse rule base of a large size, the implementation (CRC) first employs a clustering method to categorize the rules into groups and then, performs interpolation by just utilizing one closest rule from each of a small number of the resulting rule clusters. This article has presented results of systematic comparative experimental studies over a range of benchmark datasets, demonstrating the efficacy of both implementations.
This original work has offered many opportunities for further development. For example, while five clustering methods have been considered to support the realization of CRC, they may not be the best to generate the most appropriate categories for all sparse rule bases. More clustering algorithms, particularly modified FCM algorithms (e.g., the global FCM [41] and the possibilistic FCM [42]) may be adopted as the alternative to strengthening the performance. Another point is regarding the specification of the required algorithmic parameters, such as the number of closest rules and that of the nearest clusters. Currently, they are empirically set; creating an automated method to determine these parameters from the training data requires significant further research. Also, AutoRS offers a means for automated selection of the number of closest rules for interpolation, how it may be integrated with the proposed approach to minimize human intervention is worth investigating.
Another important issue is the determination of whether a given rule base is a large or a small one, in order to facilitate an informed choice of which FRI method to use. While the size of a rule base is typically related to that of the datasets concerned, it may also be affected by a number of other factors (e.g., the number of domain features and the distribution of data objects). An automated mechanism to decide on a rule base's size is clearly desirable. In the present implementations, all rule antecedent variables are treated equally. This gives rise to a further interesting piece of active research, aiming at extending weighted representations as per the most recent work of Li et al. [7] and Li et al. [13] to accommodating interpolation with TSK models. Furthermore, for many real-world problems, the inputs are usually time-dependent, the requirements of fuzzy systems may change over time. Therefore, designing a novel system that can dynamically maintain and enrich the sparse fuzzy rule base is also desirable.

APPENDIX
The procedures of fuzzy rule inference methods and their associated algorithms are outlined here for easy reference.  ∧ S(A in , B n ) 3) Integration of all similarity measures, obtaining interpolated rule with following consequent parameters: 4) Computation of interpolated outcome with observation O as input to interpolated rule: f (B 1 , . . ., B n ) = a 0 + a 1 B 1 + · · · + a n B n 5) Return: f (B 1 , . . ., B n ).

Algorithm A4: AutoRS for Interpolation.
Input: Rule base {R i }; Observation O Output: Selected rule set for interpolation U (R) 1) Initialization of candidate rule set, U (R k ): for each antecedent attribute, iteratively adding rules from nearest to furthest until emerging rule set satisfies either of following conditions: where Rep(A ik ) stands for representative value of k-th antecedent fuzzy set of i-th fuzzy rule, and Rep(B k ) for that of k-th feature of unmatched observation O.
2) Assignment of initial U (R) to largest candidate rule.