Complexity of / best algorithm for finding the dichotomy that maximizes information gain?
-
04-11-2019 - |
Question
Suppose that $X$ is a finite set with a probability measure $P$. I want to find the subset $A \subset X$ so that the information gain of conditioning on ${A, A^c}$ is maximal. That is, I want to find $A$ that maximizes
$$H(X) -H(X|\{A,A^c\}) = H(X) - (P(A)H(A) + P(A^c) H(A^c)),$$
where $H(A)$ refers to the entropy of the conditional probability distribution so $\mu(B) = P(B \cap A) / P(A)$. (If $P(A) = 0$, then set $P(A) H(A) = 0$. In any case, it won't be a maximizer.)
Since this splits $X$ into two sets, I am calling this a dichotomy, and the question is of finding the dichotomy that produces the largest information gain.
Question: What is the complexity of this problem? (As D.W. points out below, a reasonable corresponding decision problem is - for $t \geq 0$, is there an $A$ so that information gain is $\geq t$? This decision problem is in NP, and we can ask if it is NP-hard, etc.) Is there a good heuristic algorithm for making this choice? What if I ask for an approximately, probably correct algorithm?
I am asking this question since I'm studying decision trees in machine learning and also coding theory, and this seems like a basic question in both settings.
No correct solution