Markdown update

MatteoFasulo · Mar 8, 2024 · a011423 · a011423
1 parent 61376d6
commit a011423
Showing 1 changed file with 8 additions and 5 deletions.
diff --git a/notebook.ipynb b/notebook.ipynb
@@ -495,7 +495,7 @@
     "\n",
     "We have split the dataset into `train` and `test` for the sake of evaluating our classificator.\n",
     "\n",
-    "The training set contains $85%$ of the samples, while the test set contains $15%$ of the samples and the split is performed using the `train_test_split` function of the `sklearn` library. Thanks to the `stratify` parameter of the procedure, we ha mantained the same distribution of the target variable (`Heart disease`) in both the train and test set.\n",
+    "The training set contains $85%$ of the samples, while the test set contains $15%$ of the samples and the split is performed using the `train_test_split` function of the `sklearn` library. Thanks to the `stratify` parameter of the procedure, we have mantained the same distribution of the target variable (`Heart disease`) in both the train and test set.\n",
     "\n",
     "The training set will be used to learn the parameters of the Bayesian Network, while the test set will be used to evaluate the performance of the model in terms of accuracy and other metrics."
    ]
@@ -685,7 +685,7 @@
     "# 5. Improving the Naïve Bayes network with Hill Climbing\n",
     "<a class=\"anchor\" id=\"ch5\"></a>\n",
     "\n",
-    "The first approach to learn the structure of the Bayesian Network is to use an **Hill Climbing** algorithm.\n",
+    "The first approach we used to learn the structure of the Bayesian Network is the **Hill Climbing** algorithm.\n",
     "\n",
     "The Hill Climbing algorithm is a greedy search algorithm which starts from an empty network (or an already built one) and adds or removes edges to maximize a score [[14]](#14). The score is usually the **Bayesian Information Criterion (BIC)**, which is a trade-off between the likelihood of the data and the complexity of the model. Since it is a *local search algorithm*, it strongly depends on the starting network, and there is a very high risk of being stuck on a local Maximum. To avoid this, the algorithm is usually run multiple times with different initializations and the best network is selected.\n",
     "\n",
@@ -815,7 +815,7 @@
    "id": "ec50e431",
    "metadata": {},
    "source": [
-    "From the comparison of the scoring parameter of the Hill Climbing below one can notice that they have an overall similar `ROC AUC`. However, the networks are different from each other and each network is not fully explainable. This means that:\n",
+    "From the comparison of the scoring parameter of the Hill Climbing above one can notice that they have an overall similar `ROC AUC`. However, the networks are different from each other and each network is not fully explainable. This means that:\n",
     "- the choice of the scoring method does not influence so much the performance, even if the structures of the networks are different\n",
     "- since the semantic is problematic, we need to add some constraint through blacklisting/whitelisting. This means that the Hill Climbing will not explore the connections written in the blacklist while they will use the connections reported in the whitelist.\n",
     "\n",
@@ -1164,7 +1164,7 @@
     "\n",
     "- Using a structure learning algorithm on this relatively small dataset could lead to overfitting. We have seen that in the sixth paragraph.\n",
     "- In general we think that the people that partecipate in this kind of survey are not a good representation of the whole population (for example, there isn't even one underage person). In other words, we think that the dataframe could be biased by definition, and letting an algorithm learn on such biased dataset could lead to overfitting.\n",
-    "- Generally, a good `roc_auc` score is anything above 0.80. We assume that the network we are going to build can easily achieve that score (since even the Naive Bayes one can get these kind of results), and even if it doesn't reach the score obtained before, it is fully explainable and more general, and it can be used in a real world scenario.\n",
+    "- Generally, a good `roc_auc` score is anything above 0.80[[31]](#31). We assume that the network we are going to build can easily achieve that score (since even the Naive Bayes one can get these kind of results), and even if it doesn't reach the score obtained before, it is fully explainable and more general, and it can be used in a real world scenario.\n",
     "- Bayesian Networks are based on the conditional independence assumption: if some edge doesn't represent an actual causal-effect relation the whole structure may be weak"
    ]
   },
@@ -2732,7 +2732,10 @@
     "Understanding Blood Pressure Readings - American Heart Association. https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings\n",
     "\n",
     "<a id=30>[30]</a>\n",
-    "Lipid Panel - hopkinsmedicine.org. https://www.hopkinsmedicine.org/health/treatment-tests-and-therapies/lipid-panel"
+    "Lipid Panel - hopkinsmedicine.org. https://www.hopkinsmedicine.org/health/treatment-tests-and-therapies/lipid-panel\n",
+    "\n",
+    "<a id=31>[31]</a>\n",
+    "Receiver Operating Characteristic Curve in Diagnostic Test Assessment - SciendeDirect. https://www.sciencedirect.com/science/article/pii/S1556086415306043"
    ]
   }
  ],