Question

I have a set of data in a .tsv file available here. I have written several classifiers to decide whether a given website is ephemeral or evergreen.

My initial practice was rapid prototyping, did a random classifier, 1R classifier, tried some feature engineering, linear regression, logistic regression, naive bayes...etc etc.

I did all of this in a jumbled-up, incoherent manner however. What I would like to know is, if you were given a set of data (for the sake of argument, the data posted above) how would you analyse it to find a suitable classifier? What would you look at to extract meaning from your dataset initially?

Is what I have done correct in this age of high-level programming where I can run 5/6 algorithms on my data in a night? Is a rapid prototyping approach the best idea here or is there a more reasoned, logical approach that can be taken?

At the moment, I have cleaned up the data, removing all the meaningless rows (there is a small amount of these so they can just be discarded). I have written a script to cross validate my classifier, so I have a metric to test for bias/variance and also to check overall algorithm performance.

Where do I go from here? What aspects do I need to consider? What do I think about here?

Was it helpful?

Solution

You could throw in some elements of theory. For example:

  • the naive bayes classifier assumes that all variables are independent. Maybe that's not the case? But this classifier is fast and easy, so it's still a good choice for many problems, even if the variables are not really independent.
  • the linear regression gives too much weight on samples that are far away from the classification boundary. That's usually a bad idea.
  • the logistic regression is an attempt to fix this problem, but still assumes a linear correlation between the input variables. In other words, the boundary between the classes is a plane in the input variable space.

When I study a dataset, I typically start by drawing the distribution of each variable for each class of samples to find the most discriminating variables.

Then, for each class of samples, I usually plot a given input variable versus another to study the correlations between the variables: are there non-linear correlations? if yes, I might choose classifiers that can handle such correlations. Are there strong correlations between two input variables? if yes, one of the variables could be dropped to reduce the dimensionality of the problem.

These plots will also allow you to spot problems in your dataset.

But after all, trying many classifiers and optimizing their parameters for best results in the cross validation as you have done is a pragmatic and valid approach, and this has to be done at some point anyway.

I understand from the tags in this post that you have used the classifiers of scikit-learn. In case you have not noticed yet, this package provides powerful tools for cross validation as well http://scikit-learn.org/stable/modules/cross_validation.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top