Bayes Theorem

The Bayesian thing is pretty cool once you wrap your head around it...presuming that I have... To start, I found this nice page that describes what it means in my particularly practical standpoint:

Basic Bayes Background

This is pretty much verbatim from section 13.5.1 on AIMA page 496...I'm going to use disease for cause and symptom for effect because the subsequent examples are medically motivated. The Bayesian equation is:

P(disease | symptom) = (P(symptom | disease) * P(disease)) / P(symptom)

Which reads as:

The Probability of having a disease, given that you have a symptom
The Probability of having that symptom, given that you have that disease
The Probability of that disease
                                       (both) DIVIDED BY
                          The Probability of having that symptom

In the meningitis example given this becomes:
Plugging those numbers into the Bayes equation gives:

(0.7 * 0.00002) / 0.01 =  0.0014

So about .14% of the population who have stiff necks also have meningitis.

Running the Numbers

The yudkowsky page is way long. Reduced to the minimum it says that, for a seemingly binary test, there are four possible results:

  1. True Positives -- Positive results that are correct;
  2. False Negatives -- Negative results that are actual positives.
  3. False Positives --  Positive results that are actual negatives;
  4. True Negatives -- Negative results that are correct;
In order to figure out what a test result means you really need to know the False bits...and one other piece of information: The expected results, or Prior. For the Breast Cancer example on the yudkowsky page it goes like this:
1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening.  What is the probability that she actually has breast cancer?
To recap:
  1. Actual rate in the population: 1%    = 0.01
  2. True Positives:                    80%    = 0.80
  3. False Negatives:                 20%    = 0.20
  4. False Positives:                     9.6% = 0.096
Out of a Population of 10,000 where the Rate is 0.01, there should be 100 who are Real Positives, and the test results will be:
(Note that the numbers magically add up to 10,000)! So, 1030 [80 + 950] will be Positive Test, but a positive result is really positive about [(80 /  1030) = 0.0776]  ~7.8% of the time. Without knowing the "expected prior probability" of the result we cannot evaluate its actual probability. This also works nicely in the degenerate cases. If you have a perfect test with 0% False Positive/Negative results, you get the expected 100 True Positives and 9900 True Negatives. And if you have 0% Real Positives in the population you get the expected 960 False Positives and 9040 True Negatives.

So, why's that Bayesian?

Lets do it another way then. Here's what we know in a different light:
Plugging that into Bayes we get:

(P(symptom | disease) * P(disease)) / P(symptom) = (0.80 * 0.01) / 0.103

Which, incredibly enough, gives us:

P(disease | symptom) = 0.0776  or 7.8% of Test Positives who are Real Positives (!!!)

Finding the Prior

Estimating the expected Prior probability gives you a better handle on the problem, and a way to revise the actual results. But I think you can use this all to calculate the Real Positive value when it is not known. Like this.

The things we know are:
  1. Population = 10,000
  2. PositiveTest = FalsePositive + TruePositive = 1030
  3. FalsePositive = 0.096 * (Population - RealPositive) = ((0.096 * Population) - (0.096 * RealPositive))
  4. TruePositive = 0.80 * RealPositive
PositiveTest = ((0.096 * Population) - (0.096 * RealPositive)) + (0.080 * RealPositive) =
PositiveTest - (0.096 * Population) = RealPositive * (0.80 - 0.096) =
(PositiveTest - (0.096 * Population)) / (0.80 - 0.096) = RealPositive =
(1030 - (0.0960 * 10,000)) / 0.704 = RealPositive = 99.43 ~= 100
!!! I think I did that right anyway !!!

Just for the Exercise

Chapter 13 of the AIMA book covers this sort of probability reasoning and Exercise 13.15, AIMA page 508, is exactly this problem:
After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease and that the test is 99% accurate (i.e., the probability of testing positive when you do have the disease is 0.99, as is the probability of testing negative when you don't have the disease). The good news is that this is a rare disease, striking only 1 in 10,000 people of your age. What are the chances that you actually have the disease?
To recap:
  1. Actual rate in the population: 1/10,000  = 0.0001
  2. True Positives: 99% = 0.99
  3. True Negatives: 99% = 0.99
  1. False Positives: 1% = 0.01
  2. False Negatives: 1% = 0.01
Out of a Population of 1,000,000 where the Rate is 0.0001, there should be 100 who are Real Positive. The test results will be:
So the probability of being a Real Positive is 99/10,098 ~= .0098 or 0.98%. Why bother going to the doctor at all?