Hmm… This is very interesting about Bayesian. I’ll check that out.
So let’s see. P(A|B) is P(A^B)/P(B)
According to bayesian rule, P(B|A)=P(A^B)/P(A)=P(A^B)/P(B)*P(B)/P(A)=P(A|B)*P(B)/P(A).
So probability of P(B|A) is just the probability of P(A|B) times Probability of B divided by Probability of A. That’s because now we’re dividing by A rather than B. Probability of (terrorist|muslims) is probably 80%. Probability of (muslim|terrorists) is less than 1%. That’s simply because there are way more muslims than terrorists most of which have less violent job. If P(A|B)=1 we have what we call logically B->A
Actually that’s not quite correct. In bayesian theory, P(A|B) means “The probability (degree of confidence) that A is true GIVEN that B is assumed to be true” — not “the probability that B implies A,” or even far worse, Popper’s self-inconsistent “propensity” interpretation that it means the “the probability that B causes A.”
The logical relation B->A has the somewhat counterintuitive boolean representation (not(B and (not A))), which can also be written as ((not B) or A). That is because B->A only demands that when B is true, A must also be true, so ((B=True) and (A=False)) means B->A must be False, whereas if B is false, the implication relationship does not say anything about whether or not A must be true.
Quote
Say B is the probability that a guy is guilty say for mutilating hot babes to pieces with tooth pics. Say A is an evidence that would be true if B is true. Say A is that defendant clothes will be filled with blood. So P(A|B)=1.
Then P(B|A)=P(A|B)*P(B)/P(A)=1 *P(B)/P(A) . Wait a minute. If P(A) is very small than yea P(B|A) should go up significantly. If P(A) is common then it’s circumstantial.
Where does it say that P(B) stuck at 1 once our prior is 1 again? I got to take a look.
That’s not the clearest way of looking at it. Try the example below the following background paragraphs
In bayesian probability theory, all probabilities are conditional on your background information, which consists of the things you assume to be a priori true (your axioms), and whatever empirical data that you have acquired by experience; for short, I’ll write this as the logical predicate “Exp,” for “Experience plus A Priori Assumptions,” or just “E” for short.
Bayesian theory takes it as axiomatic that the probability of a statement that is always False (i.e., a logical contradiction) is zero independent of any condition X, P(False|X) = 0, and likewise the probability of a statement that is always True (a tautology) is unity independent of any condition X, P(True|X) = 1.
Also, you must explicitly specify your “Universe of Discourse” up front, i.e., the set of alternative hypotheses {H1,H2,…,Hn} that you intend to consider. The hypotheses defining the Universe of Discourse are usually taken to be mutually exclusive, i.e., if one hypothesis is true, then all the other hypotheses must be false (this can always be arranged by the logical equivalent of “orthogonalization”), and exhaustive, i.e., no other explanation will be considered. (This latter assumption is not a restriction, since one can always tack on the “catch-all” hypothesis “There is some other explanation that I haven’t thought of yet” — which depending on your degree of humility or arrogance can have an a priori probability that may be quite significant to quite small, as long as it is less than 1 but more than 0.)
Since {H1,H2,…,Hn} are assumed exhaustive and mutually exclusive, exactly one hypothesis must always be true, so it’s taken as an axiom that the logical conjunction H1+H2+…+Hn (“+” means “logical OR”) must be true with certainty, implying that P(H1+H2+…+Hn|X) == 1. Also, since by mutual exclusivity exactly one of the hypotheses can be true while the others must be false, we take it as axiomatic that P(H1+H2+…+Hn|X) = P(H1|X) + P(H2|X) + … + P(Hn|x) == 1.
Since by the first axiom of bayesian probability, P(A + (not A)|X) = 1 for all X, and since only one of A or (not A) can be true, it immediately follows that P(not A|X) = 1 – P(A|X) for all X.
Finally, there is the “chain rule” for factoring joint probabilities into conditionals: P(A&B|X) == P(A|X&B) * P(B|X) == P(B|X&A) * P(A|X). (For readability reasons, this is more often written as P(A,B|X) == P(A|X,B) * P(B|X) == P(B|X,A) * P(A|X), and even to drop the “AND commas” if it won;t result in ambiguity.)
It turns out that the above axioms completely define all of bayesian probability theory, and that from them it’s possible to compute the probability of any statement that can be expressed in terms of the set of hypothesis {H1,H2,…,Hn} and the “background predicate” E representing your axioms and experience. Furthermore, a careful analysis shows that they represent the unique extension of boolean logic to truth-values intermediate between 0 and 1, and that any other set of rules will fail to be consistent with logic. (I’m leaving out some technical details here, as the proof of this theorem turns out to be remarkable subtle.)
Ah I see. So we make P(Something|X) as a new probability universe. Wow I forgot that part of probability when I was in school.
Quote
A number of useful corollaries can be proved from the above axioms, for propositions A, B, and X:
- P(A|X,A) == 1, since it’s given that A is assumed to be true, and by definition P(True|X) = 1;
- P(A,A|X) = P(A|X), since logically A&A == A
- P(B|X,A,A) = P(B|X,A), since logically A&A == A;
- P(A|X,B) = P(A|X), since if A and B are logically independent, knowing B tells us nothing about A;
- P(A,B|X) = P(A|X) * P(B|X), if A and B are logically independent (follows from chain-ruloe plus above);
- P(A+B|X) = P(A|X) + P(B|X) – P(A,B|X), which allows us to treat correlations;
Bayes’ Theorem follows directly from the chain-rule axiom: P(A|BX) = P(B|AX) * P(A|X) / P(B|X). However, this is not the most useful form for reasoning about how to update the a priori probabilities of your hypotheses given new information. Denote your empirical data or new information by the logical predicate “D.” Assume that you also have some “statistical model” that predicts the probability P(D|Hi,E) (your degree of confidence or how “unsurprised” you would be) that you would see data D given your past experience E and assuming that hypothesis “Hi” is true; P(D|Hi,E) is often called the “data likelihood” of hypothesis “Hi.” Bayes’ Theorem allows you to invert P(D|Hi,E) to give the updated or “a posteriori” probability of hypothesis “Hi,” P(Hi|D,E) = P(D|Hi,E) * P(Hi|E) / P(D|E) in terms of the “data likelihood” for “Hi,” the a priori probability P(Hi|E), and a quantity we don’t seem to have, P(D|E), the probability one would observe the data “D” given only our experience, sometimes called the “evidence” provided by the data. However, there is a clever trick: since by hypothesis H1+H2+…+Hn = True, and since P(D&True|X) = P(True|D,X) * P(D|X) = 1*P(D|X) = P(D|X) for all D and X, it follows that:
Code:
P(D|E) = P(D(H1+H2+...+Hn)|E) = P(D&H1 + D&H2 + ... + D&Hn|E) = P(D,H1|E) | P(D,H2|E) + ... P(D,Hn|E)
= P(D|H1,E) * P(H1|E) + P(D|H2,E) * P(H2|E) + ... + P(D|Hn,E) * P(Hn|E)
and now we have expressed P(D|E) entirely in terms of things we know. Hence, bayesian theory allows one to revise one’s a priori probabilities P(Hi|E) to include new data “D” into one set of assumptions and empirical experience “E” if one has a statistical model for estimating the likelihood of observing data “D:”
Great. I see. So P(D|E) will be the probability of D given our natural experience. To know that, we need some a priori (except for E) understanding of what’s likely and what’s not. I get that.
Quote
Code:
P(Hi|D,E) = P(D|Hi,E) * P(Hi|E) / (Sum(k=1..n) P(D|Hk,E) * P(Hk|E))
Note that if the a priori probability P(Hi|E) is zero for some specified “i” (i.e., Hi is a priori false), no amount of data can ever budge it from zero (i.e. false), and that if it’s one (i.e. a priori true), no amount of data can ever budge it from one (i.e. true), since if one P(Hi|E) is one, then all the others must be zero, by the axiom Sum(i=1..n) P(Hi|E) == 1. Hence, one must take an “agnostic” attitude to learn from experience, because if one dogmatically rejects a given hypothesis (or blindly accepts it on faith), no amount of experimental evidence to the contrary can ever alter that a priori probability.
Now for the example: Suppose that you are walking down a street in an arid town, and you notice that the sidewalk in front of a house is wet. From prior experience you know that people tend to sprinkle their lawns about three days a week, whereas it only rains once a week, so a priori you expect that P(Sprinkler|Exp) > P(Rain|Exp), with a priori odds of about 3 to 1. Let’s assume for the moment that you can’t think of any third explanation, so your Universe of Discourse will consist of the two propositions “It was raining earlier,” and “The sprinkler was on earlier.” From experience, you know a priori that P(Wet|Sprinkler,Exp) and P(Wet|Rain,Exp) are both close to unity, i.e., if the sprinkler was on, the sidewalk will probably get wet, and if it was raining, the sidewalk will also probably get wet, but if all the information you have is that one given sidewalk in front of one given house is wet, one can’t say much more than P(Sprinkler|Wet,Exp) > P(Rain|Wet,Exp), since people sprinkle more often than it rains.
Now, suppose you look up and down the sidewalk, and notice that the sidewalks in front of all the houses are wet. From experience, you know that rainstorms seldom rain on only one house while avoiding others, so you suspect that it probably rained — but how confident can you be of that conclusion?
We can estimate the relative data likelihoods using the chain-rule for conditional probabilities:
Code:
P(Wet_1 & Wet_2 & ... & Wet_N | X & E) = P(Wet_1 | X & E, Wet_2 & ... & Wet_N) * P(Wet_2 | X & E & Wet_3 & ... & Wet_N) * ... * P(Wet_N | X & E)
where “X” is either “Rain” or “Sprinkler,” and “E” is your experience and assumptions.
First, suppose that it rained — then you know from experience that Wet_1 = Wet_2 = … Wet_N; hence, since P(A&A|X) = P(A|X), P(Wet_1 & Wet_2 & … & Wet_N | Rain, Exp) will not be appreciably different from any individual P(Wet_i | Rain, Exp), which is furthermore close to unity; hence, the data likelihood that if it rained, all the sidewalks will be wet is close to unit, in agreement with commons sense.
By contrast, you know from experience that people decide to water their lawns more or less independently, so P(Wet_i|Sprinkler,Exp,Wet_j) = P(Wet_i|Sprinkler,Exp) for all i != j; hence
Code:
P(Wet_1, Wet_2, ..., Wet_N | Sprinkler, Exp) = P(Wet_1 | Sprinkler, Exp, Wet_2, ..., Wet_N) * P(Wet_2 | Sprinkler, Exp, Wet_3, ..., Wet_N) * ... * P(Wet_N | Sprinkler, Exp)
= P(Wet_1 | Sprinkler, Exp) * P(Wet_2 | Sprinkler, Exp) * ... * P(Wet_N | Sprinkler, Exp)
~= ( P(Wet | Sprinkler, Exp) )**N
where the last step assumes that most people water their lawns with about the same frequency. It thus follows that, even if P(Wet | Sprinkler, Exp) is close to unity, it will not take a very large number of houses N before the data likelihood becomes very small — which is consistent with both experience and common sense that it’s unlikely that every resident on the block will water their lawn on the same day (unless it’s extremely hot!).
Plugging these and similar estimates of data-likelihoods for the two hypotheses into Bayes’ Theorem, it’s fairly straightforward to show that, if all the sidewalks are wet, then the a posterior probability for rain becomes quite large, even though the a priori probability of rain was much smaller than for sprinkling.
Conversely, if only one sidewalk is wet and all the others are dry, then sprinkling becomes even likely than rain — although a more careful analysis will convinces you that something odd must be going on, since it’s far more likely that about 3 sidewalks out of 7 would be wet than just one sidewalk out of N.
Finally, if we had included the “catch all” hypothesis that something we haven’t thought of has happened, then in the case that only 1 sidewalk out of N was wet, it would be the “catch-all” that would have gotten the highest posterior probability — even if one had assumed that its a priori probability was small — suggesting that it’s time to re-think your set of hypotheses.
For an elementary introduction to bayesian probability theory, I recommend “Data Analysis: A Bayesian Tutorial,” by D.S. Sivia. for a detailed discussion of both the philosophy and practice of bayesian probabilistic reasoning, I recommend “Probability Theory: The Logic of Science, by E.T. Jaynes. For free repositories of many papers and tutorials online, see http://bayes.wustl.edu/ (which contains the first several chapters of Jaynes’ book and a complete but unpublished draft of an earlier book), and http://www.astro.cornell.edu/staff/loredo/bayes/ which contains tutorials and links to other Bayesian websites.
This is very enlightening. Now I start seeing where “faith” kicks in. Once people are convinced that something is true, nothing will shake that believe.
Okay so we have 2 hypothesis. Hr (for rain) and Hs for sprinkler. Say I see that a lawn is wet. Say I come from middle east where rain comes once a year. So I would believe that sprinkler must be on. Now this is close to “faith”. I already believe, with great prejudice that it ain’t rain.
But then I see all the other houses are wet too.
Now let’s see how things work.
Look I will edit this much latter. I need time to think.
I think for simplicity sake, let’s call the first neighbor Wet0
That way we consider only 2 possibilities, rain, or sprinkler (sprinkler 0)
Also for simplicity sake lets’ call
P(A|B W0E) as Pwe (A|B). Where Pwe is the probability measure when E and W is part of the assumption. That should leave all the clutters out.
I think there should be an easier way to see Pwe(R | W1 W2 W3 W4… WN). I’ll come back to this one.

|