Naive Bayes

Quiz:

Suppose you have a bag with three standard 6-sided dice with face values [1,2,3,4,5,6] and two non-standard 6-sided dice with face values [2,3,3,4,4,5]. Someone draws a die from the bag, rolls it, and announces it was a 3. What is the probability that the die that was rolled was a standard die?

Input your answer as a fraction or as a decimal with at least three digits of precision.

Zuerst berechnen wir die Prior Probability der beiden Würfel:

P(normaler Würfel) = \frac 3 5 \\ P(anderer Würfel) = \frac 2 5

Jetzt wird für jeden Würfel berechnet, wie hoch die Wahrscheinlichkeit ist (Posterior), dass der Würfel eine 3 anzeigt:

P(3/normalerWürfel) = \frac 1 6 \\ P(3/andererWürfel) = \frac 2 6

Zuletzt wird alles in die Naive Bayes Formel eingesetzt:

oder als Python Skript:

p_normal = 3/5
p_anders = 2/5
p_normal_3 = 1/6
p_anders_3 = 2/6

P_A = (p_normal*p_normal_3)/((p_normal*p_normal_3) + (p_anders*p_anders_3))
print(P_A)

#Ergebnis: 0.42857142857142855

Python – sklearn

Mit Hilfe der Klasse `CountVectorizer` des Python Moduls sklearn kann man einfach eine sogenannte „Bag of Words“ erstellen. Hier werden alle Wörter aus einem Text extrahiert und in einer Liste gespeichert:

from sklearn.feature_extraction.text import CountVectorizer

documents1 = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

count_vector = CountVectorizer(stop_words = 'english') #unwichtige englische Wörter werden nicht mitgezählt
count_vector.fit(documents1)
print(count_vector.get_feature_names_out())

#Ausgabe:
['hello' 'home' 'money' 'tomorrow' 'win']

Diese „Bag of Words“ Liste kann man unterschiedlich darstellen lassen:

from sklearn.feature_extraction.text import CountVectorizer
documents2 = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']

count_vector = CountVectorizer(stop_words = 'english')
count_vector = count_vector.fit(documents2)
print("\n get_feature_names_out")
column_names = count_vector.get_feature_names_out()
print(column_names)
count_vector = count_vector.fit_transform(documents2)
print("\n transform documents")
print(count_vector)
doc_array = count_vector.toarray()
print("\n doc_array")
print(doc_array)

import pandas as pd
frequency_matrix = pd.DataFrame(doc_array)
frequency_matrix.columns = column_names
print("\n frequency_matrix")
print(frequency_matrix)

Python – Bayes Theorem simple example

We assume the following:

P(D) is the probability of a person having Diabetes. Its value is 0.01, or in other words, 1% of the general population has diabetes (disclaimer: these values are assumptions and are not reflective of any actual medical study).
P(Pos) is the probability of getting a positive test result.
P(Neg) is the probability of getting a negative test result.
P(Pos|D) is the probability of getting a positive result on a test done for detecting diabetes, given that you have diabetes. This has a value 0.9. In other words the test is correct 90% of the time. This is also called the Sensitivity or True Positive Rate.
P(Neg|~D) is the probability of getting a negative result on a test done for detecting diabetes, given that you do not have diabetes. This also has a value of 0.9 and is therefore correct, 90% of the time. This is also called the Specificity or True Negative Rate.

The Bayes formula is as follows:

P(A) is the prior probability of A occurring independently. In our example this is P(D). This value is given to us.
P(B) is the prior probability of B occurring independently. In our example this is P(Pos).
P(A|B) is the posterior probability that A occurs given B. In our example this is P(D|Pos). That is, the probability of an individual having diabetes, given that this individual got a positive test result. This is the value that we are looking to calculate.
P(B|A) is the prior probability of B occurring, given A. In our example this is P(Pos|D). This value is given to us.

Putting our values into the formula for Bayes theorem we get:

P(D|Pos) = P(D) * P(Pos|D) / P(Pos)

The probability of getting a positive test result P(Pos) can be calculated using the Sensitivity and Specificity as follows:

P(Pos) = (P(D) * Sensitivity) + (P(~D) * (1-Specificity)))

Hier wird die Wahrscheinlichkeit berechnet, einen positiven Test zu bekommen:

'''
Calculate probability of getting a positive test result, P(Pos)
'''
# P(D)
p_diabetes = 0.01

# P(~D)
p_no_diabetes = 0.99

# Sensitivity or P(Pos|D)
p_pos_diabetes = 0.9

# Specificity or P(Neg|~D)
p_neg_no_diabetes = 0.9

# P(Pos)
p_pos = (p_diabetes * p_pos_diabetes) + (p_no_diabetes * (1-p_neg_no_diabetes))

print('The probability of getting a positive test result P(Pos) is: ',format(p_pos))

#Ergebnis: 0.10799999999999998

Jetzt berechnen wir die Wahrscheinlichkeit, einen positiven Test zu bekommen UND Diabetes zu haben:

'''
Compute the probability of an individual having diabetes, given that, that individual got a positive test result.
In other words, compute P(D|Pos).

The formula is: P(D|Pos) = (P(D) * P(Pos|D) / P(Pos)
'''
# P(D|Pos)
p_diabetes_pos = p_diabetes * p_pos_diabetes / p_pos
print('Probability of an individual having diabetes, given that that individual got a positive test result is:\
',format(p_diabetes_pos)) 

#Ergebnis: 0.08333333333333336

Zuletzt berechnen wir die Wahrscheinlichkeit, einen positiven Test zu bekommen UND NICHT Diabetes zu haben:

'''
Compute the probability of an individual not having diabetes, given that, that individual got a positive test result.
In other words, compute P(~D|Pos).

The formula is: P(~D|Pos) = P(~D) * P(Pos|~D) / P(Pos)

Note that P(Pos|~D) can be computed as 1 - P(Neg|~D). 

Therefore:
P(Pos|~D) = p_pos_no_diabetes = 1 - 0.9 = 0.1
'''
# P(Pos|~D)
p_pos_no_diabetes = 0.1

# P(~D|Pos)
p_no_diabetes_pos = p_no_diabetes * p_pos_no_diabetes / p_pos
print ('Probability of an individual not having diabetes, given that that individual got a positive test result is:'\
,p_no_diabetes_pos)

#Ergebnis: 0.9166666666666669

Die Analyse zeigt, dass selbst bei einem positiven Testergebnis nur eine Wahrscheinlichkeit von 8,3 % besteht, dass Sie tatsächlich an Diabetes leiden und eine Wahrscheinlichkeit von 91,67 %, dass Sie keinen Diabetes haben. Dies setzt natürlich voraus, dass nur 1 % der Gesamtbevölkerung Diabetes hat, was nur eine Annahme ist.

Quiz:

Python – sklearn

Python – Bayes Theorem simple example

Links:

unsere-schule.org

Code: