Sentiment Classification of Drug Reviews Using Fuzzy-rough Feature Selection

Sentiment analysis mines people’s opinions and attitudes regarding a certain issue from source materials. Recently, it has drawn significant attention in a number of application areas. The sentiment analysis of healthcare in general and that of users’ drug experience in particular could shed significant light on how to improve public health and make the right decisions. However, one of the major challenges in sentiment classification lies in the very large number of extracted features. Fuzzy-rough feature selection provides a means by which discrete or real-valued noisy data can be effectively reduced without human intervention. This paper proposes an implementation for automatic sentiment classification of drug reviews employing fuzzy rough feature selection. Experimental results demonstrate that the employment of fuzzy-rough feature selection can indeed significantly reduce the complexity of feature space and the classification run-time overheads while maintaining classification accuracy.


I. INTRODUCTION
Sentiment analysis, also called opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, events, and topics [1].It is a type of subjectivity analysis which examines sentiment in a given textual unit with the objective of understanding the sentiment polarities (i.e., positive, negative, or neutral) of the opinions toward various aspects of a subject [2].A lot of sentiment analysis work has been done in the general areas of electronic products, movies, and restaurants reviews, but not extensively in the public health and medical domains, possibly owing to the concerns regarding privacy and ethic issues [2], [3], [4].However, with the ever increasing popularity of various social media, the online communities and forums make it possible for users to express their experiences and opinions anonymously and freely on drug reviews related to multiple aspects, including effectiveness, side effects, conditions, costs and dosages, which can be leveraged to obtain valuable insights to improve general public health and medical care [5].
Analyzing sentiments concerning various aspects of drug reviews can provide a wealth of information regarding user preferences and experiences, which help make informed decisions by medical professionals, e.g., through strengthened public health monitoring [5].With the popularity of clinical decision support systems such as therapy recommender systems, which aim at helping to find an optimal personalized therapy option for a given patient, the clinicians and the systems will greatly benefit from patients' feedback on therapy [6].The sentiment analysis of post-marketing drug surveillance on the effectiveness and potential risks of adverse drug reactions also plays a major role concerning drug safety once a drug has been released [7].
Sentiment classification can be generally handled with a number of different approaches, including lexicon-based and machine learning-based.A fundamental requirement for lexicon-based approach is a readily available list of pre-labeled words of sentiment expressions in natural language text [4].All the words in an unknown text are then compared to the words in the predefined lexicon, resulting in a polarity score through computing the difference between the numbers of positively and negatively assigned words [8].However, lexiconbased techniques often do not consider the possibly different meanings of the words in different contexts.They usually underperform machine learning-based methods that implement sentiment analysis with a computational model learned by converting raw text into numerical features, typically via the bag of words [9] or alternative mechanisms such as the term frequency-inverse document frequency method [10].
One of the major challenges in sentiment classification, especially for machine-learning techniques lies in the very large number of extracted features that may be irrelevant, redundant or even misleading [11].Feature selection is an important step in sentiment analysis that aims at selecting an optimum feature subset without disrupting their original meanings [12].It helps minimise noise and redundancy in the feature space which would otherwise adversely increase the likelihood of model overfitting and prevent quality features from being incorporated [13].Efficiency is another benefit from the utilization of feature selection.Having a reduced number of features helps to reduce run-time overheads, which also implies relaxed memory and storage requirements.In particular, Fuzzy-rough feature selection (FRFS) [14], [15] provides a means by which discrete or real-valued noisy data (or a mixture of both) can be effectively reduced without the need for user-supplied information.It has been developed and applied in a number of applications (e.g., [16], [17]).However, little attention has been paid to the work of FRFS with application to drug review analysis.
Inspired the above observations, this paper proposes an approach for the automatic sentiment classification of drug reviews employing FRFS.In particular, the proposed work first extracts a list of significant words as features from original drug review documents, whereby both the bag of words and term frequency-inverse document frequency techniques are employed for the generation of discrete and real-valued features, respectively.FRFS is then exploited to identify a minimal representation of the original information by returning a subset of previously extracted features, which is subsequently utilised as input to a number of popular classification algorithms.The case study is conducted on data collected from a popular website on drug information to both consumer and healthcare professionals, for the retrieval of users reviews on drug experience and ratings which are converted to three sentiment labels, i.e., positive, neutral and negative.
The reminder of this paper is organised as follows.Section II introduces the background of recent advances on drug review analysis and the basic concept of FRFS.Section III describes the pipeline whereby FRFS is employed to perform sentiment analysis on drug reviews.Section IV presents and discusses comparative experimental results.Section V concludes the paper and outlines ideas for further development.

II. BACKGROUND
Owing to the significance of mining drug reviews that could contribute to various healthcare stakeholders, a number of advancements have been reported in the recent literature.For example, a framework [8] has been developed to track user experiences regarding drugs and cosmetics on social media data, where learning classifiers are utilised to predict sentiment orientations.To consider the fact that one single sentence may contain multiple clauses discussing multiple aspects of a drug, a clause-level sentiment analysis algorithm [2] has been introduced, which adopts a pure linguistic approach from prior sentiment scores assigned to individual words, while simultaneously taking the grammatical relations between, and semantic annotation of, words into consideration.Apart from sentiment analysis concerning overall satisfaction, side effects and effectiveness of user reviews on drugs have also been investigated [5].Multiple facets of sentiment in the context of medicine including drug reviews are characterised in [4], whereby a quantitative assessment is conducted with respect to word usage and sentiment distribution.
However, none of the existing literature in the relevant areas considers feature selection techniques that may extract feature subsets of minimal knowledge representation without degrading the performance of the learning classifiers.Yet, much work has been established to conduct feature selection.In particular, FRFS has originated from the development of rough set theory, which is able to find minimal knowledge representation through data-driven learning without thresholds or expert knowledge [15].Being complementary to rough sets that are concerned with indiscernibility, fuzzy set theory is concerned with vagueness and has been successfully used in a wide range of domains [18], [19], [20], [21].The hybridization of both theories especially in the area of feature selection has led to the popularity of FRFS with robust solutions.Specifically, a fuzzy-rough set is defined by two fuzzy sets, i.e., a fuzzy lower and a fuzzy upper approximation, obtained by extending the corresponding crisp rough set notions.In the crisp case, elements either belong to the lower approximation with absolute certainty or not at all.In the fuzzy-rough case, elements may have a membership in the range [0,1], allowing greater flexibility in handling uncertainty.Let IS = (U, A) be an information system, where U is a nonempty set of finite objects (the universe) and A is a nonempty finite set of attributes such that a : U → V a for every a ∈ A. V a is the set of values that the attribute a may take.For decision systems, A = {C ∪ D}, where C is the set of input features and D is the set of decision features.The following defines the fuzzy lower and upper approximations: where X is the fuzzy concept being approximated, I is a fuzzy implicator, T is a t-norm, and R B is the fuzzy similarity relation induced by the subset of features B, and x i , x j ∈ X are two arbitrary objects in X.In particular, where µ Ra (x i , x j ) is the degree to which the objects x i and x j are similar for the feature a ∈ A. Many similarity relations can be constructed for this purpose, for example: where σ 2 a is the variance of the feature a, and a(x i ) is the value of a for the object x i .The choices for I, T and the fuzzy similarity relation have great influence upon the resultant fuzzy partitions.
FRFS employs a quality measure termed the fuzzy-rough dependence function γ B (Q) that measures the dependency between two sets of attributes B and Q, which is defined by: where the fuzzy positive region, which contains all objects of U that can be classified into classes of U/Q using the information in B, is defined as: That is to say, γ B (Q) may be viewed as a measure of quality for a given feature subset B ∈ C, with respect to the set of decision features D: A fuzzy-rough reduct R can then be defined as a subset of features that preserves the dependency degree of the entire data set, i.e., γ R (D) = γ C (D).

III. SENTIMENT CLASSIFICATION OF DRUG REVIEWS
This section describes the major modules of the framework developed for sentiment classification of drug reviews using FRFS.Figure 1 depicts the flow chart of the general pipeline.This includes the learning phase where the sentiment classification model is trained with a collection of user online reviews on their drug experience.In particular, a preprocessing step is performed on the training data to prepare for clean documents such that text features can be extracted.FRFS is then executed to search for a minimal knowledge representation.The subset of features returned by FRFS is fed into a certain learning classification algorithm which performs classifier learning.Once trained, only those features selected by FRFS during the training phase are utilized in the learned sentiment model.Within the application or testing phase the functionality of the preprocessing module is the same to that of its counterpart in the training phase, but the feature extraction module is simpler than its counterpart in the training phase as it only needs to produce those features selected by FRFS during the training.

A. Pre-processing
Once the raw drug reviews are collected, a number of preprocessing steps are necessary for the generation of clean documents for further processing.These include the following: 1) Tokenize the reviews such that each review is represented as a collection of words for text analysis; 2) Convert all text data to lowercase, so that the words of different cases could be treated the same to remove redundancy; 3) Erase punctuation and symbols, which can safely be ignored without sacrificing the meaning of the sentence; 4) Remove a list of stop words such as 'and' and 'the' that does not add much meaning to a sentence; 5) Lemmatize the words to reduce words to their dictionary forms such that for example, 'am', 'are' and 'is' can all be converted to 'be'.

B. Text Feature Extraction
Once the preprocessing of raw reviews is completed, each resultant tokenized review is represented as a matrix of the length that is equal to the number of unique terms in the returned corpus.The value of each term in the corresponding document is determined by the application of either the bag of words (BoW) method or the term frequency-inverse document frequency (tf-idf).BoG is a simplifying representation used in natural language processing whereby each drug review is represented as a multiset of words, and the frequency of occurrence of each word is subsequently used as the feature value for training the sentiment model.Different from BoG, the tf-idf method is intended to reflect how important a word is to a document in a certain collection of documents.The tf-idf value of a term t in a document d is defined as: where tf (t, d) is the term frequency, i.e., the raw count of t in d, and idf (t, D) is the inverse document frequency which measures how common or rare of a particular word is across all documents D. In particular, idf (t, D) can be defined as: where N = |D| is the total number of documents considered, |{d ∈ D : t ∈ d}| is the number of documents where the term t appears.In general, the tf-idf value increases proportionally to the number of times a word appears in the document concerned and is offset by the number of documents in the corpus that contain the word.

C. Fuzzy-Rough Feature Selection (FRFS)
Following on the above generation of feature values, an artificial data set can be constructed for subsequent feature selection as generally illustrated in Table I, where l i is the label or sentiment of a certain document d i and v ij is the value of the term t j , j ∈ {1, . . ., M } in d i , i ∈ {1, . . ., N }.Note that it is the v ij that can be computed by the use of either BoW or tf-idf.
As outlined previously in Section II, the evaluation of γ R (D) results in the popular hill climbing-based FRFS algorithm that is termed fuzzy-rough QuickReduct [14].It works by adding to the current candidate feature subset a feature that leads to the highest fuzzy-rough dependency improvement.The application of fuzzy-rough QuickReduct to drug review analysis is shown in Algorithm 1.It terminates when the addition of any remaining feature does not result in an increase in dependency.Note that with the QuickReduct, for a problem with the total number of terms being M in the corpus, the worst case will result in (M 2 + M )/2 evaluations of the dependency function.However, as FRFS is used for dimensionality reduction prior to any involvement of a given application which will exploit those features belonging to the resultant reduct, this operation has no negative impact upon the run time efficiency of the system.

D. Sentiment Classification
The execution of FRFS on the originally preprocessed data results in a set of instances D i = (v i1 , . . ., v iM , l i ) whose dimensionality is typically significantly reduced from that of the artificially generated data set.Here, D i is the i-th review with the sentiment label l i and is composed of M attributes.To demonstrate the efficiency and effectiveness of the feature subset returned by FRFS, four popular machine learning algorithms are employed to train the sentiment classification model, including: • Naive Bayes [22] is a probalistic learning classifier, based on a direct application of the Bayesian theorem with strong independence assumptions.It has often been adopted as a baseline in text classification problems.
• C4.5 Decision Tree [23] is one of the most popular learning classifiers.The tree is constructed by starting with a givenl data set at the root node and iteratively expanding each of the branches, until all instances in the branch belong to the same class or no further information gain is provided by adding more features.
• Random Forest [24] is a meta estimator that combines the idea of bagging and random selection of features to fit a number of decision tree classifiers.It works on various sub-samples of a data set and averages the results to improve the final classification accuracy and control over-fitting.
• JRip [25] is a popular crisp classification rule learning algorithm that follows a divide-and-conquer strategy.Crisp rules are created incrementally one at a time, followed by an immediate simplification procedure.Once a set of rules for a given class is completed, an optimisation process is further imposed to fine tune the rules.
Through the use of any of the above learning mechanisms, the trained model is then utilised to perform classification given an unknown document.

IV. EXPERIMENT
This section presents and discusses the results of an experimental investigation into the proposed framework, starting with a brief introduction to the data set used.

A. Data Set
The data set used for drug review analysis is collected by scraping from the raw HTML files using the Beautiful Soup Library in Python from Druglib.com (which provides drug-related information for both consumers and healthcare professionals).This data set has initially been exploited in [5], which includes a total number of 4142 reviews and which has been split into fixed training and testing partitions based on a stratified random sampling scheme with a splitting ratio of 75% to 25%.In particular, the reviews given by the users are collected from the perspective of side effect, effectiveness and comments, which are then merged together to form the overall reviews.The ratings in the range of 1 to 10 are supplied by users for their overall satisfaction, which are converted into sentiment labels as Positive for ratings between 7 to 10, Neutral for ratings with 5 or 6, and Negative for ratings between 1 to 4, following the exact convention as the collected data was first applied [5].The sentiment distribution is summarised as shown in Figure 2.

B. Experimental Results
The extraction of a collection of raw training drug reviews results in a total number of 903 features, which can be summarised as shown in Figure 3 by a word cloud that creates a visual representation of the text data, where the prominence of individual terms is reflected by size and font.
The fuzzy-rough QuickReduct algorithm is then performed to search for a subset of features that maintain full dependency (in theory or as high dependency as possible in practice), as a complete set of features does.In particular, two different strategies may be taken while searching for the feature subset, namely, the forward approach that starts with an empty candidate set and iteratively adds a feature that results in the highest dependency improvement to the current subset, and the backward approach that starts with the full feature set and iteratively removes a feature that does not lead to any decrease in dependancy.
By the use of BoW for specifying feature values, the completion of the feature selection process leads to a significant reduction in the resultant feature space.FRFS helps to reduce the original 903 features to just 42 when the forward search strategy is adopted, and 56 when the backward search is employed.These results are depicted by the corresponding word clouds as shown in Figure 4 and 5. Importantly, despite both returned feature subsets are substantially smaller, the resulting word clouds are able to identify those key words such as 'take' and 'day' that have also been prominent in the word cloud generated by the use of the complete feature set.The two reduced clouds also share a lot of words in common with the list of top 10 frequent terms as extracted from the established work of [4].The performance of sentiment classification employing the two feature subsets is then verified using four popular machine learning approaches that have been briefly explained in Section III-D.Note that the Weka Machine Learning Toolkit [26] is utilised to implement both the learning classifiers and fuzzy-rough QuickReduct, all involving the use of default parameters unless otherwise explicitly specified.Table II lists the experimental results when BoW is employed to compute the values of extracted features, where Trn shows the training accuracy (%), Tst demonstrates the performance of running independent hold-out test (%), and Time presents the runtime (seconds) overheads of constructing the model while using the corresponding feature subset as indicated.

C. Experimental Analysis
Whilst being one of the most commonly used baselines in text classification, Naive Bayes has generally performed the worst in comparison to others in this study.However, the testing performances using feature subsets searched either forward or backward have made an over 2% and 4% improvement, respectively.Using C4.5 Decision Tree, it is also shown that the employment of the feature subset searched backwards is able to outperform the use of the full original feature set by nearly 4%.For Random Forest, the two testing performances using fuzzy-rough QuickReduct feature selection are exact the same as that using the complete set of features.Together, these results may appear counter intuitive as the use of less features actually performs better.However, this is not a surprise as many of the features contain noise.Their removal can indeed help improve the classification accuracy, as generally proven in the relevant feature selection literature [27].
Similar phenomenon occurs when tf-idf is used to calculate feature values, as shown in Table III.The results of utilising the feature subsets returned by both forward and backward search are shown in Figure 4 and 5. Collectively, these results demonstrate that fuzzy-rough QuickReduct works well for both nominal and real-valued features.It is therefore not a surprise to notice that the averaged testing performances of employing the two reduced feature subsets are also better than those attainable by the use of the complete feature set.That is to say, the reduced feature subset generated by removing those features that may be redundant and even noisy is able to improve upon or at least maintain the existing performances.More importantly, this does not come along with the sacrifice of additional run-time overheads, which is reduced even more significantly over 10 times on average in comparison to that required if the complete feature set is adopted.An interesting observation from the above is that performances using the feature subset returned by backward search are generally better than those returned by forward search.However, when the number of original features is very large, searching backward is computationally more expensive and may even be impractical.After all, running FRFS is mainly to reduce computational costs.Hence, in practical applications of the proposed framework, it is necessary to trade off between effectiveness and efficiency, while deciding on whether forward or backward search is to be employed.As a general guideline, unless it is affordable to run FRFS in a backward search manner, the forward search-based version should be used.
As an initial work here to test the efficacy of running FRFS on collected drug reviews (in the scale of hundreds of text features), the above experimental investigation shows promising results.In particular, the classification performance is maintained while saving massive amounts of run-time overheads.The proposed framework can be expected to bring forward more significant cost-efficiency savings to real-world healthcare analysis on large scale data, although this is subject to further experimental confirmation.Note again, of course that running FRFS for dimensionality reduction is independent of application problems.Running FRFS simply exploits those feature subsets belonging to the resultant reduct during the training prior to the application phase and hence, has no negative impact upon the run time efficiency of the trained system.

V. CONCLUSION
The exploitation of drug reviews is able to shed light on the understanding of users' preference and drug experience, which may be exploited to help with decision making by the medical professionals and improve public health.This paper has proposed a sentiment classification approach using fuzzyrough feature selection with a focussed application on drug review analysis.The case study has shown that supported with FRFS, popular machine learning approaches to learning classifiers are able to produce preserved or event improved performance while significantly reducing feature space and run-time overheads, as compared to the results achievable without FRFS.
Considering the different effects due to the use of different feature subsets that are returned by different search mechanisms, future work will be set to exploit search strategies that may be of most beneficial to the overall performance.It would also be interesting to investigate the use of alternative approaches for learning classifiers (e.g., using the recently proposed fuzzy rule-based models like the one proposed in [28]) which may work better while dealing with the uncertainty inherent in natural language processing.

Fig. 2 .
Fig. 2. Label distribution of training and testing data sets

TABLE II .
RESULTS WITH BOW

TABLE III .
RESULTS WITH TF-IDF