Naive Bayes Links: Prior – Rechts: Posterior Quiz: Suppose you have a bag with three standard 6-sided dice with face values [1,2,3,4,5,6] and two non-standard 6-sided dice with face values [2,3,3,4,4,5]. Someone draws a die from the bag, rolls it, and announces it was a 3. What is the probability that the die that was rolled was a standard die? Input your answer as a fraction or as a decimal with at least three digits of precision. Zuerst berechnen wir die Prior Probability der beiden Würfel: P(normaler Würfel) = \frac 3 5 \\ P(anderer Würfel) = \frac 2 5 Jetzt wird für jeden Würfel berechnet, wie hoch die Wahrscheinlichkeit ist (Posterior), dass der Würfel eine 3 anzeigt: P(3/normalerWürfel) = \frac 1 6 \\ P(3/andererWürfel) = \frac 2 6 Zuletzt wird alles in die Naive Bayes Formel eingesetzt: oder als Python Skript: p_normal = 3/5 p_anders = 2/5 p_normal_3 = 1/6 p_anders_3 = 2/6 P_A = (p_normal*p_normal_3)/((p_normal*p_normal_3) + (p_anders*p_anders_3)) print(P_A) #Ergebnis: 0.42857142857142855 Python – sklearn Mit Hilfe der Klasse `CountVectorizer` des Python Moduls sklearn kann man einfach eine sogenannte „Bag of Words“ erstellen. Hier werden alle Wörter aus einem Text extrahiert und in einer Liste gespeichert: from sklearn.feature_extraction.text import CountVectorizer documents1 = ['Hello, how are you!', 'Win money, win from home.', 'Call me now.', 'Hello, Call hello you tomorrow?'] count_vector = CountVectorizer(stop_words = 'english') #unwichtige englische Wörter werden nicht mitgezählt count_vector.fit(documents1) print(count_vector.get_feature_names_out()) #Ausgabe: ['hello' 'home' 'money' 'tomorrow' 'win'] Diese „Bag of Words“ Liste kann man unterschiedlich darstellen lassen: from sklearn.feature_extraction.text import CountVectorizer documents2 = ['Hello, how are you!', 'Win money, win from home.', 'Call me now.', 'Hello, Call hello you tomorrow?'] count_vector = CountVectorizer(stop_words = 'english') count_vector = count_vector.fit(documents2) print("\n get_feature_names_out") column_names = count_vector.get_feature_names_out() print(column_names) count_vector = count_vector.fit_transform(documents2) print("\n transform documents") print(count_vector) doc_array = count_vector.toarray() print("\n doc_array") print(doc_array) import pandas as pd frequency_matrix = pd.DataFrame(doc_array) frequency_matrix.columns = column_names print("\n frequency_matrix") print(frequency_matrix) Python – Bayes Theorem simple example We assume the following: P(D) is the probability of a person having Diabetes. Its value is 0.01, or in other words, 1% of the general population has diabetes (disclaimer: these values are assumptions and are not reflective of any actual medical study).P(Pos) is the probability of getting a positive test result.P(Neg) is the probability of getting a negative test result.P(Pos|D) is the probability of getting a positive result on a test done for detecting diabetes, given that you have diabetes. This has a value 0.9. In other words the test is correct 90% of the time. This is also called the Sensitivity or True Positive Rate.P(Neg|~D) is the probability of getting a negative result on a test done for detecting diabetes, given that you do not have diabetes. This also has a value of 0.9 and is therefore correct, 90% of the time. This is also called the Specificity or True Negative Rate. The Bayes formula is as follows: P(A) is the prior probability of A occurring independently. In our example this is P(D). This value is given to us.P(B) is the prior probability of B occurring independently. In our example this is P(Pos).P(A|B) is the posterior probability that A occurs given B. In our example this is P(D|Pos). That is, the probability of an individual having diabetes, given that this individual got a positive test result. This is the value that we are looking to calculate.P(B|A) is the prior probability of B occurring, given A. In our example this is P(Pos|D). This value is given to us. Putting our values into the formula for Bayes theorem we get: P(D|Pos) = P(D) * P(Pos|D) / P(Pos) The probability of getting a positive test result P(Pos) can be calculated using the Sensitivity and Specificity as follows: P(Pos) = (P(D) * Sensitivity) + (P(~D) * (1-Specificity))) Hier wird die Wahrscheinlichkeit berechnet, einen positiven Test zu bekommen: ''' Calculate probability of getting a positive test result, P(Pos) ''' # P(D) p_diabetes = 0.01 # P(~D) p_no_diabetes = 0.99 # Sensitivity or P(Pos|D) p_pos_diabetes = 0.9 # Specificity or P(Neg|~D) p_neg_no_diabetes = 0.9 # P(Pos) p_pos = (p_diabetes * p_pos_diabetes) + (p_no_diabetes * (1-p_neg_no_diabetes)) print('The probability of getting a positive test result P(Pos) is: ',format(p_pos)) #Ergebnis: 0.10799999999999998 Jetzt berechnen wir die Wahrscheinlichkeit, einen positiven Test zu bekommen UND Diabetes zu haben: ''' Compute the probability of an individual having diabetes, given that, that individual got a positive test result. In other words, compute P(D|Pos). The formula is: P(D|Pos) = (P(D) * P(Pos|D) / P(Pos) ''' # P(D|Pos) p_diabetes_pos = p_diabetes * p_pos_diabetes / p_pos print('Probability of an individual having diabetes, given that that individual got a positive test result is:\ ',format(p_diabetes_pos)) #Ergebnis: 0.08333333333333336 Zuletzt berechnen wir die Wahrscheinlichkeit, einen positiven Test zu bekommen UND NICHT Diabetes zu haben: ''' Compute the probability of an individual not having diabetes, given that, that individual got a positive test result. In other words, compute P(~D|Pos). The formula is: P(~D|Pos) = P(~D) * P(Pos|~D) / P(Pos) Note that P(Pos|~D) can be computed as 1 - P(Neg|~D). Therefore: P(Pos|~D) = p_pos_no_diabetes = 1 - 0.9 = 0.1 ''' # P(Pos|~D) p_pos_no_diabetes = 0.1 # P(~D|Pos) p_no_diabetes_pos = p_no_diabetes * p_pos_no_diabetes / p_pos print ('Probability of an individual not having diabetes, given that that individual got a positive test result is:'\ ,p_no_diabetes_pos) #Ergebnis: 0.9166666666666669 Die Analyse zeigt, dass selbst bei einem positiven Testergebnis nur eine Wahrscheinlichkeit von 8,3 % besteht, dass Sie tatsächlich an Diabetes leiden und eine Wahrscheinlichkeit von 91,67 %, dass Sie keinen Diabetes haben. Dies setzt natürlich voraus, dass nur 1 % der Gesamtbevölkerung Diabetes hat, was nur eine Annahme ist. Links: Bilder aus dem Udacity Kurs: Intro to Machine Learning with TensorFlow unsere-schule.org × Naive Bayes Code: Infos: unsere-schule Codes