Finding the dichotomy that maximizes information gain for a classifier?

https://cs.stackexchange.com/questions/73778

04-11-2019
|

Question

Suppose that $\Omega$ is a finite probability space,with measure $P$ (we can take $P$ uniform). Let $C \in \{\pm 1 \}$ be a random variable on $\Omega$, the classifier. Let

$$H(C) = H(C, \Omega, P) = \sum_{i \in \pm 1} P(C = i) \log P(C = i)$$

be the entropy of $C$.

For a random variable $Y$ on $\Omega$ and $r \in \mathbb{R}$, define the level set $\{Y = r\} = V_r$, and define

$$H(C | Y) = \Sigma_{r \in \mathbb{R}} P(Y = r) H(C|_{V_r}, V_r, P_{V_r})$$

where $P_{V_r}(B) = P(B \cap V_r) / P(V_r)$ is a measure on $V_r$. $H(C|_{V_r}, V_r, P_{V_r})$ is the entropy of the random variable restricted to the the subset $V_r$, with the conditional probability measure.

As in a decision tree classifier, I want to determine the subset $A \subset \Omega$ that maximizes $H(C) - H(C|1_A)$. That is, I want to maximize the information gain. What is the "best" algorithm for doing so? What is the complexity of this question? Are there good probabilistic algorithms ?

Some computation allows one to reformulate this as finding the subset $A$ that maximizes

$$\sum_e \sum_{i = \pm 1} P(C^i \cap A^e) \log(P(C^i \cap A^e)) + \sum_e P(A^e) \log P(A^e),$$

provided we let $A^e$ mean $A$ or $A^c$, and $C^i = \{C = i\}$.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange