Choosing the right data mining method to find the effect of each parameter over the target

https://datascience.stackexchange.com/questions/2473

16-10-2019
|

문제

I am dealing with a lot of categorical data right now and I would like to use an appropriate data mining method in any tool [preferably R] to find the effect of each parameter [categorical parameters] over my target variable. To give a brief notion about the data that am dealing with, my target variable denotes the product type [say, disposables and non-disposables] and I have parameters like root cause,symptom,customer name, product name etc. As my target can be considered as a binary value, I tried to find the combination of values leading to the desired categories using Apriori but, I have more than 2 categories in that attribute and I want to use all of them and find the effect of the mentioned parameters over each category. I really wanted to try SVM and use hyperplanes to separate the content and get n-dimensional view. But, I do not have enough knowledge to validate the technique, functions am using to do the analysis. Currently I have like 9000 records and each of them represents a complaint from the user. There are lot of columns available in the dataset which is what I am trying to use to determine the target variable [ myForumla <- Target~. ] I tried with just 4 categorical columns too. Not getting a proper result.

Can just the categorical variables be used to develop a SVM model and get visualization with n hyper planes? Is there any appropriate data mining technique available for dealing with just the categorical data?

해결책

You can try Bayesian belief networks (BBNs). BBNs can easily handle categorical variables and give you the picture of the multivariable interactions. Furthermore, you may use sensitivity analysis to observe how each variable influences your class variable.

Once you learn the structure of the BBN, you can identify the Markov blanket of the class variable. The variables in the Markov blanket of the class variable is a subset of all the variables, and you may use optimization techniques to see which combination of values in this Markov blanket maximizes your class prediction.

다른 팁

Have you tried Random Forest to do feature selection for categorical features. Random Forest uses proximity calculation(information gain) on deciding to split on a particular feature. Random Forest: "After each tree is built, all of the data are run down the tree, and proximities are computed for each pair of cases. If two cases occupy the same terminal node, their proximity is increased by one. At the end of the run, the proximities are normalized by dividing by the number of trees. Proximities are used in replacing missing data, locating outliers, and producing illuminating low-dimensional views of the data" For more check this link out: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#varimp

All answers above are quite good paths to follow. But if you want to select between multiple algorithms (all in R =D) and understand why selecting them, try this:

https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Feature_Selection

Last tip: The Feature Ranking approach in that page is exaclty what you want. ;-)

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange