-
Notifications
You must be signed in to change notification settings - Fork 0
/
captions.vtt
5470 lines (3647 loc) · 227 KB
/
captions.vtt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
WEBVTT
00:00:00.000 --> 00:00:06.000
Kylie Ying has worked at many interesting places such as MIT, CERN, and Free Code Camp.
00:00:06.000 --> 00:00:10.880
She's a physicist, engineer, and basically a genius. And now she's going to teach you
00:00:10.880 --> 00:00:14.720
about machine learning in a way that is accessible to absolute beginners.
00:00:15.280 --> 00:00:21.600
What's up you guys? So welcome to Machine Learning for Everyone. If you are someone who
00:00:21.600 --> 00:00:27.520
is interested in machine learning and you think you are considered as everyone, then this video
00:00:27.520 --> 00:00:33.040
is for you. In this video, we'll talk about supervised and unsupervised learning models,
00:00:33.040 --> 00:00:39.200
we'll go through maybe a little bit of the logic or math behind them, and then we'll also see how
00:00:39.200 --> 00:00:46.960
we can program it on Google CoLab. If there are certain things that I have done, and you know,
00:00:46.960 --> 00:00:50.960
you're somebody with more experience than me, please feel free to correct me in the comments
00:00:50.960 --> 00:00:58.000
and we can all as a community learn from this together. So with that, let's just dive right in.
00:00:58.000 --> 00:01:02.160
Without wasting any time, let's just dive straight into the code and I will be teaching you guys
00:01:02.160 --> 00:01:11.040
concepts as we go. So this here is the UCI machine learning repository. And basically,
00:01:11.040 --> 00:01:15.280
they just have a ton of data sets that we can access. And I found this really cool one called
00:01:15.280 --> 00:01:22.560
the magic gamma telescope data set. So in this data set, if you want to read all this information,
00:01:22.560 --> 00:01:28.320
to summarize what I what I think is going on, is there's this gamma telescope, and we have all
00:01:28.320 --> 00:01:34.240
these high energy particles hitting the telescope. Now there's a camera, there's a detector that
00:01:34.240 --> 00:01:40.400
actually records certain patterns of you know, how this light hits the camera. And we can use
00:01:40.400 --> 00:01:46.640
properties of those patterns in order to predict what type of particle caused that radiation. So
00:01:46.640 --> 00:01:54.880
whether it was a gamma particle, or some other head, like hadron. Down here, these are all of
00:01:54.880 --> 00:02:00.000
the attributes of those patterns that we collect in the camera. So you can see that there's, you
00:02:00.000 --> 00:02:06.480
know, some length, width, size, asymmetry, etc. Now we're going to use all these properties to
00:02:06.480 --> 00:02:12.400
help us discriminate the patterns and whether or not they came from a gamma particle or hadron.
00:02:13.200 --> 00:02:19.520
So in order to do this, we're going to come up here, go to the data folder. And you're going
00:02:19.520 --> 00:02:28.240
to click this magic zero for data, and we're going to download that. Now over here, I have a colab
00:02:28.240 --> 00:02:34.320
notebook open. So you go to colab dot research dot google.com, you start a new notebook. And
00:02:34.320 --> 00:02:43.120
I'm just going to call this the magic data set. So actually, I'm going to call this for code camp
00:02:43.120 --> 00:02:52.240
magic example. Okay. So with that, I'm going to first start with some imports. So I will import,
00:02:52.240 --> 00:03:04.560
you know, I always import NumPy, I always import pandas. And I always import matplotlib.
00:03:06.080 --> 00:03:11.360
And then we'll import other things as we go. So yeah,
00:03:14.080 --> 00:03:19.200
we run that in order to run the cell, you can either click this play button here, or you can
00:03:19.200 --> 00:03:24.320
on my computer, it's just shift enter and that that will run the cell. And here, I'm just going
00:03:24.320 --> 00:03:29.120
to order I'm just going to, you know, let you guys know, okay, this is where I found the data set.
00:03:30.000 --> 00:03:34.080
So I've copied and pasted this actually, but this is just where I found the data set.
00:03:35.200 --> 00:03:40.640
And in order to import that downloaded file that we we got from the computer, we're going to go
00:03:40.640 --> 00:03:49.120
over here to this folder thing. And I am literally just going to drag and drop that file into here.
00:03:50.800 --> 00:03:55.840
Okay. So in order to take a look at, you know, what does this file consist of,
00:03:55.840 --> 00:03:59.840
do we have the labels? Do we not? I mean, we could open it on our computer, but we can also just do
00:04:00.960 --> 00:04:06.640
pandas read CSV. And we can pass in the name of this file.
00:04:06.640 --> 00:04:14.560
And let's see what it returns. So it doesn't seem like we have the label. So let's go back to here.
00:04:16.160 --> 00:04:23.600
I'm just going to make the columns, the column labels, all of these attribute names over here.
00:04:23.600 --> 00:04:29.120
So I'm just going to take these values and make that the column names.
00:04:29.120 --> 00:04:36.080
All right, how do I do that? So basically, I will come back here, and I will create a list called
00:04:36.080 --> 00:04:50.560
calls. And I will type in all of those things. With f size, f conk. And we also have f conk one.
00:04:50.560 --> 00:05:06.080
We have f symmetry, f m three long, f m three trans, f alpha. Let's see, we have f dist and class.
00:05:09.840 --> 00:05:16.640
Okay, great. Now in order to label those as these columns down here in our data frame.
00:05:16.640 --> 00:05:22.880
So basically, this command here just reads some CSV file that you pass in CSV has come about comma
00:05:22.880 --> 00:05:31.520
separated values, and turns that into a pandas data frame object. So now if I pass in a names here,
00:05:31.520 --> 00:05:38.800
then it basically assigns these labels to the columns of this data set. So I'm going to set
00:05:38.800 --> 00:05:44.960
this data frame equal to DF. And then if we call the head is just like, give me the first five things,
00:05:44.960 --> 00:05:50.800
give me the first five things. Now you'll see that we have labels for all of these. Okay.
00:05:52.000 --> 00:05:57.520
All right, great. So one thing that you might notice is that over here, the class labels,
00:05:57.520 --> 00:06:05.280
we have G and H. So if I actually go down here, and I do data frame class unique,
00:06:07.200 --> 00:06:11.520
you'll see that I have either G's or H's, and these stand for gammas or hadrons.
00:06:11.520 --> 00:06:17.440
And our computer is not so good at understanding letters, right? Our computer is really good at
00:06:17.440 --> 00:06:23.280
understanding numbers. So what we're going to do is we're going to convert this to zero for G and
00:06:23.280 --> 00:06:35.680
one for H. So here, I'm going to set this equal to this, whether or not that equals G. And then
00:06:35.680 --> 00:06:42.560
I'm just going to say as type int. So what this should do is convert this entire column,
00:06:43.360 --> 00:06:48.720
if it equals G, then this is true. So I guess that would be one. And then if it's H, it would
00:06:48.720 --> 00:06:52.800
be false. So that would be zero, but I'm just converting G and H to one and zero, it doesn't
00:06:52.800 --> 00:07:02.240
really matter. Like, if G is one and H is zero or vice versa. Let me just take a step back right
00:07:02.240 --> 00:07:09.440
now and talk about this data set. So here I have some data frame, and I have all of these different
00:07:09.440 --> 00:07:18.240
values for each entry. Now this is a you know, each of these is one sample, it's one example,
00:07:18.240 --> 00:07:23.200
it's one item in our data set, it's one data point, all of these things are kind of the same
00:07:23.200 --> 00:07:29.120
thing when I mentioned, oh, this is one example, or this is one sample or whatever. Now, each of
00:07:29.120 --> 00:07:36.240
these samples, they have, you know, one quality for each or one value for each of these labels
00:07:36.240 --> 00:07:41.600
up here, and then it has the class. Now what we're going to do in this specific example is try to
00:07:41.600 --> 00:07:50.800
predict for future, you know, samples, whether the class is G for gamma or H for hadron. And
00:07:50.800 --> 00:08:00.320
that is something known as classification. Now, all of these up here, these are known as our features,
00:08:00.320 --> 00:08:05.760
and features are just things that we're going to pass into our model in order to help us predict
00:08:05.760 --> 00:08:12.880
the label, which in this case is the class column. So for you know, sample zero, I have
00:08:14.240 --> 00:08:19.520
10 different features. So I have 10 different values that I can pass into some model.
00:08:19.520 --> 00:08:26.720
And I can spit out, you know, the class the label, and I know the true label here is G. So this is
00:08:26.720 --> 00:08:35.440
this is actually supervised learning. All right. So before I move on, let me just give you a quick
00:08:35.440 --> 00:08:43.360
little crash course on what I just said. This is machine learning for everyone. Well, the first
00:08:43.360 --> 00:08:49.760
question is, what is machine learning? Well, machine learning is a sub domain of computer science
00:08:49.760 --> 00:08:56.000
that focuses on certain algorithms, which might help a computer learn from data, without a
00:08:56.000 --> 00:09:01.360
programmer being there telling the computer exactly what to do. That's what we call explicit
00:09:01.360 --> 00:09:08.480
programming. So you might have heard of AI and ML and data science, what is the difference between
00:09:08.480 --> 00:09:14.720
all of these. So AI is artificial intelligence. And that's an area of computer science, where the
00:09:14.720 --> 00:09:22.080
goal is to enable computers and machines to perform human like tasks and simulate human behavior.
00:09:23.600 --> 00:09:31.600
Now machine learning is a subset of AI that tries to solve one specific problem and make predictions
00:09:31.600 --> 00:09:39.840
using certain data. And data science is a field that attempts to find patterns and draw insights
00:09:39.840 --> 00:09:45.840
from data. And that might mean we're using machine learning. So all of these fields kind of overlap,
00:09:45.840 --> 00:09:52.560
and all of them might use machine learning. So there are a few types of machine learning.
00:09:52.560 --> 00:09:58.400
The first one is supervised learning. And in supervised learning, we're using labeled inputs.
00:09:58.400 --> 00:10:05.360
So this means whatever input we get, we have a corresponding output label, in order to train
00:10:05.360 --> 00:10:12.960
models and to learn outputs of different new inputs that we might feed our model. So for example,
00:10:12.960 --> 00:10:19.040
I might have these pictures, okay, to a computer, all these pictures are are pixels, they're pixels
00:10:19.040 --> 00:10:27.440
with a certain color. Now in supervised learning, all of these inputs have a label associated with
00:10:27.440 --> 00:10:32.880
them, this is the output that we might want the computer to be able to predict. So for example,
00:10:32.880 --> 00:10:39.200
over here, this picture is a cat, this picture is a dog, and this picture is a lizard.
00:10:41.600 --> 00:10:47.840
Now there's also unsupervised learning. And in unsupervised learning, we use unlabeled data
00:10:47.840 --> 00:10:57.920
to learn about patterns in the data. So here are here are my input data points. Again, they're just
00:10:57.920 --> 00:11:04.960
images, they're just pixels. Well, okay, let's say I have a bunch of these different pictures.
00:11:05.760 --> 00:11:09.920
And what I can do is I can feed all these to my computer. And I might not, you know,
00:11:09.920 --> 00:11:14.480
my computer is not going to be able to say, Oh, this is a cat, dog and lizard in terms of,
00:11:14.480 --> 00:11:19.680
you know, the output. But it might be able to cluster all these pictures, it might say,
00:11:19.680 --> 00:11:26.080
Hey, all of these have something in common. All of these have something in common. And then these
00:11:26.080 --> 00:11:31.680
down here have something in common, that's finding some sort of structure in our unlabeled data.
00:11:33.680 --> 00:11:40.160
And finally, we have reinforcement learning. And reinforcement learning. Well, they usually
00:11:40.160 --> 00:11:46.480
there's an agent that is learning in some sort of interactive environment, based on rewards and
00:11:46.480 --> 00:11:54.720
penalties. So let's think of a dog, we can train our dog, but there's not necessarily, you know,
00:11:54.720 --> 00:12:02.880
any wrong or right output at any given moment, right? Well, let's pretend that dog is a computer.
00:12:03.600 --> 00:12:08.240
Essentially, what we're doing is we're giving rewards to our computer, and tell your computer,
00:12:08.240 --> 00:12:15.200
Hey, this is probably something good that you want to keep doing. Well, computer agent terminology.
00:12:16.880 --> 00:12:21.760
But in this class today, we'll be focusing on supervised learning and unsupervised learning
00:12:21.760 --> 00:12:29.120
and learning different models for each of those. Alright, so let's talk about supervised learning
00:12:29.120 --> 00:12:35.120
first. So this is kind of what a machine learning model looks like you have a bunch of inputs
00:12:35.120 --> 00:12:40.960
that are going into some model. And then the model is spitting out an output, which is our prediction.
00:12:41.920 --> 00:12:48.400
So all these inputs, this is what we call the feature vector. Now there are different types
00:12:48.400 --> 00:12:53.920
of features that we can have, we might have qualitative features. And qualitative means
00:12:53.920 --> 00:13:01.360
categorical data, there's either a finite number of categories or groups. So one example of a
00:13:01.360 --> 00:13:07.440
qualitative feature might be gender. And in this case, there's only two here, it's for the sake of
00:13:07.440 --> 00:13:13.200
the example, I know this might be a little bit outdated. Here we have a girl and a boy, there are
00:13:13.200 --> 00:13:19.840
two genders, there are two different categories. That's a piece of qualitative data. Another
00:13:19.840 --> 00:13:25.600
example might be okay, we have, you know, a bunch of different nationalities, maybe a nationality or
00:13:25.600 --> 00:13:33.280
a nation or a location, that might also be an example of categorical data. Now, in both of
00:13:33.280 --> 00:13:43.200
these, there's no inherent order. It's not like, you know, we can rate us one and France to Japan
00:13:43.200 --> 00:13:51.840
three, etc. Right? There's not really any inherent order built into either of these categorical
00:13:51.840 --> 00:14:00.240
data sets. That's why we call this nominal data. Now, for nominal data, the way that we want
00:14:00.240 --> 00:14:06.640
to feed it into our computer is using something called one hot encoding. So let's say that, you
00:14:06.640 --> 00:14:13.120
know, I have a data set, some of the items in our data, some of the inputs might be from the US,
00:14:13.120 --> 00:14:19.200
some might be from India, then Canada, then France. Now, how do we get our computer to recognize that
00:14:19.200 --> 00:14:24.560
we have to do something called one hot encoding. And basically, one hot encoding is saying, okay,
00:14:24.560 --> 00:14:30.240
well, if it matches some category, make that a one. And if it doesn't just make that a zero.
00:14:31.120 --> 00:14:40.160
So for example, if your input were from the US, you would you might have 1000. India, you know,
00:14:40.160 --> 00:14:46.880
0100. Canada, okay, well, the item representing Canada is one and then France, the item representing
00:14:46.880 --> 00:14:52.240
France is one. And then you can see that the rest are zeros, that's one hot encoding.
00:14:54.480 --> 00:15:00.480
Now, there are also a different type of qualitative feature. So here on the left,
00:15:00.480 --> 00:15:07.440
there are different age groups, there's babies, toddlers, teenagers, young adults,
00:15:08.640 --> 00:15:15.840
adults, and so on, right. And on the right hand side, we might have different ratings. So maybe
00:15:15.840 --> 00:15:26.160
bad, not so good, mediocre, good, and then like, great. Now, these are known as ordinal pieces of
00:15:26.160 --> 00:15:33.600
data, because they have some sort of inherent order, right? Like, being a toddler is a lot closer to
00:15:33.600 --> 00:15:41.680
being a baby than being an elderly person, right? Or good is closer to great than it is to really
00:15:41.680 --> 00:15:48.560
bad. So these have some sort of inherent ordering system. And so for these types of data sets,
00:15:48.560 --> 00:15:54.400
we can actually just mark them from, you know, one to five, or we can just say, hey, for each of these,
00:15:54.400 --> 00:16:02.960
let's give it a number. And this makes sense. Because, like, for example, the thing that I
00:16:02.960 --> 00:16:09.760
just said, how good is closer to great, then good is close to not good at all. Well, four is closer
00:16:09.760 --> 00:16:14.560
to five, then four is close to one. So this actually kind of makes sense. And it'll make sense for the
00:16:14.560 --> 00:16:22.400
computer as well. Alright, there are also quantitative pieces of data and quantitative
00:16:22.960 --> 00:16:29.040
pieces of data are numerical valued pieces of data. So this could be discrete, which means,
00:16:29.040 --> 00:16:34.160
you know, they might be integers, or it could be continuous, which means all real numbers.
00:16:34.160 --> 00:16:40.800
So for example, the length of something is a quantitative piece of data, it's a quantitative
00:16:40.800 --> 00:16:46.560
feature, the temperature of something is a quantitative feature. And then maybe how many
00:16:46.560 --> 00:16:53.680
Easter eggs I collected in my basket, this Easter egg hunt, that is an example of discrete quantitative
00:16:53.680 --> 00:17:02.080
feature. Okay, so these are continuous. And this over here is the screen. So those are the things
00:17:02.080 --> 00:17:08.400
that go into our feature vector, those are our features that we're feeding this model, because
00:17:08.400 --> 00:17:14.800
our computers are really, really good at understanding math, right at understanding numbers,
00:17:14.800 --> 00:17:19.680
they're not so good at understanding things that humans might be able to understand.
00:17:21.760 --> 00:17:29.680
Well, what are the types of predictions that our model can output? So in supervised learning,
00:17:29.680 --> 00:17:35.440
there are some different tasks, there's one classification, and basically classification,
00:17:35.440 --> 00:17:42.000
just saying, okay, predict discrete classes. And that might mean, you know, this is a hot dog,
00:17:42.800 --> 00:17:48.640
this is a pizza, and this is ice cream. Okay, so there are three distinct classes and any other
00:17:48.640 --> 00:17:56.480
pictures of hot dogs, pizza or ice cream, I can put under these labels. Hot dog, pizza, ice cream.
00:17:56.480 --> 00:18:03.440
Hot dog, pizza, ice cream. This is something known as multi class classification. But there's also
00:18:03.440 --> 00:18:10.640
binary classification. And binary classification, you might have hot dog, or not hot dog. So there's
00:18:10.640 --> 00:18:14.240
only two categories that you're working with something that is something and something that's
00:18:14.240 --> 00:18:23.680
isn't binary classification. Okay, so yeah, other examples. So if something has positive or negative
00:18:23.680 --> 00:18:28.960
sentiment, that's binary classification. Maybe you're predicting your pictures of their cats or
00:18:28.960 --> 00:18:35.040
dogs. That's binary classification. Maybe, you know, you are writing an email filter, and you're
00:18:35.040 --> 00:18:40.560
trying to figure out if an email spam or not spam. So that's also binary classification.
00:18:41.760 --> 00:18:46.240
Now for multi class classification, you might have, you know, cat, dog, lizard, dolphin, shark,
00:18:46.960 --> 00:18:53.520
rabbit, etc. We might have different types of fruits like orange, apple, pear, etc. And then
00:18:53.520 --> 00:18:59.440
maybe different plant species. But multi class classification just means more than two. Okay,
00:18:59.440 --> 00:19:06.320
and binary means we're predicting between two things. There's also something called regression
00:19:06.320 --> 00:19:11.360
when we talk about supervised learning. And this just means we're trying to predict continuous
00:19:11.360 --> 00:19:15.760
values. So instead of just trying to predict different categories, we're trying to come up
00:19:15.760 --> 00:19:24.400
with a number that you know, is on some sort of scale. So some examples. So some examples might
00:19:24.400 --> 00:19:31.040
be the price of aetherium tomorrow, or it might be okay, what is going to be the temperature?
00:19:31.760 --> 00:19:37.440
Or it might be what is the price of this house? Right? So these things don't really fit into
00:19:37.440 --> 00:19:43.920
discrete classes. We're trying to predict a number that's as close to the true value as possible
00:19:43.920 --> 00:19:51.760
using different features of our data set. So that's exactly what our model looks like in
00:19:51.760 --> 00:19:59.280
supervised learning. Now let's talk about the model itself. How do we make this model learn?
00:19:59.920 --> 00:20:05.120
Or how can we tell whether or not it's even learning? So before we talk about the models,
00:20:05.680 --> 00:20:10.320
let's talk about how can we actually like evaluate these models? Or how can we tell
00:20:10.320 --> 00:20:19.040
whether something is a good model or bad model? So let's take a look at this data set. So this data
00:20:19.040 --> 00:20:26.640
set has this is from a diabetes, a Pima Indian diabetes data set. And here we have different
00:20:26.640 --> 00:20:32.640
number of pregnancies, different glucose levels, blood pressure, skin thickness, insulin, BMI,
00:20:32.640 --> 00:20:37.520
age, and then the outcome whether or not they have diabetes one for they do zero for they don't.
00:20:37.520 --> 00:20:46.640
So here, all of these are quantitative features, right, because they're all on some scale.
00:20:48.720 --> 00:20:56.160
So each row is a different sample in the data. So it's a different example, it's one person's data,
00:20:56.160 --> 00:21:04.240
and each row represents one person in this data set. Now this column, each column represents a
00:21:04.240 --> 00:21:11.600
different feature. So this one here is some measure of blood pressure levels. And this one
00:21:11.600 --> 00:21:17.120
over here, as we mentioned is the output label. So this one is whether or not they have diabetes.
00:21:19.040 --> 00:21:23.760
And as I mentioned, this is what we would call a feature vector, because these are all of our
00:21:23.760 --> 00:21:33.520
features in one sample. And this is what's known as the target, or the output for that feature
00:21:33.520 --> 00:21:41.280
vector. That's what we're trying to predict. And all of these together is our features matrix x.
00:21:42.640 --> 00:21:51.920
And over here, this is our labels or targets vector y. So I've condensed this to a chocolate
00:21:51.920 --> 00:21:58.000
bar to kind of talk about some of the other concepts in machine learning. So over here,
00:21:58.000 --> 00:22:08.160
we have our x, our features matrix, and over here, this is our label y. So each row of this
00:22:08.160 --> 00:22:15.200
will be fed into our model, right. And our model will make some sort of prediction. And what we do
00:22:15.200 --> 00:22:21.920
is we compare that prediction to the actual value of y that we have in our label data set, because
00:22:21.920 --> 00:22:26.960
that's the whole point of supervised learning is we can compare what our model is outputting to,
00:22:26.960 --> 00:22:31.920
oh, what is the truth, actually, and then we can go back and we can adjust some things. So the next
00:22:31.920 --> 00:22:41.040
iteration, we get closer to what the true value is. So that whole process here, the tinkering that,
00:22:41.040 --> 00:22:46.400
okay, what's the difference? Where did we go wrong? That's what's known as training the model.
00:22:47.680 --> 00:22:54.080
Alright, so take this whole, you know, chunk right here, do we want to really put our entire
00:22:54.080 --> 00:23:02.320
chocolate bar into the model to train our model? Not really, right? Because if we did that, then
00:23:02.320 --> 00:23:10.240
how do we know that our model can do well on new data that we haven't seen? Like, if I were to
00:23:10.240 --> 00:23:18.000
create a model to predict whether or not someone has diabetes, let's say that I just train all my
00:23:18.000 --> 00:23:23.120
data, and I see that all my training data does well, I go to some hospital, I'm like, here's my
00:23:23.120 --> 00:23:28.560
model. I think you can use this to predict if somebody has diabetes. Do we think that would
00:23:28.560 --> 00:23:41.040
be effective or not? Probably not, right? Because we haven't assessed how well our model can
00:23:41.040 --> 00:23:46.880
generalize. Okay, it might do well after you know, our model has seen this data over and over and
00:23:46.880 --> 00:23:54.960
over again. But what about new data? Can our model handle new data? Well, how do we how do we get our
00:23:54.960 --> 00:24:02.320
model to assess that? So we actually break up our whole data set that we have into three different
00:24:02.320 --> 00:24:07.760
types of data sets, we call it the training data set, the validation data set and the testing data
00:24:07.760 --> 00:24:15.760
set. And you know, you might have 60% here 20% and 20% or 80 10 and 10. It really depends on how
00:24:15.760 --> 00:24:22.000
many statistics you have, I think either of those would be acceptable. So what we do is then we feed
00:24:22.000 --> 00:24:28.960
the training data set into our model, we come up with, you know, this might be a vector of predictions
00:24:28.960 --> 00:24:36.080
corresponding with each sample that we put into our model, we figure out, okay, what's the difference
00:24:36.080 --> 00:24:42.880
between our prediction and the true values, this is something known as loss, losses, you know,
00:24:42.880 --> 00:24:50.080
what's the difference here, in some numerical quantity, of course. And then we make adjustments,
00:24:50.080 --> 00:24:57.600
and that's what we call training. Okay. So then, once you know, we've made a bunch of adjustments,
00:24:58.480 --> 00:25:06.000
we can put our validation set through this model. And the validation set is kind of used as a reality
00:25:06.000 --> 00:25:14.560
check during or after training to ensure that the model can handle unseen data still. So every
00:25:14.560 --> 00:25:19.600
single time after we train one iteration, we might stick the validation set in and see, hey, what's
00:25:19.600 --> 00:25:25.680
the loss there. And then after our training is over, we can assess the validation set and ask,
00:25:25.680 --> 00:25:32.400
hey, what's the loss there. But one key difference here is that we don't have that training step,
00:25:32.400 --> 00:25:38.080
this loss never gets fed back into the model, right, that feedback loop is not closed.
00:25:38.800 --> 00:25:45.920
Alright, so let's talk about loss really quickly. So here, I have four different types of models,
00:25:45.920 --> 00:25:52.960
I have some sort of data that's being fed into the model, and then some output. Okay, so this output
00:25:52.960 --> 00:26:02.720
here is pretty far from you know, this truth that we want. And so this loss is going to be high. In
00:26:02.720 --> 00:26:07.840
model B, again, this is pretty far from what we want. So this loss is also going to be high,
00:26:07.840 --> 00:26:15.760
let's give it 1.5. Now this one here, it's pretty close, I mean, maybe not almost, but pretty close
00:26:15.760 --> 00:26:23.840
to this one. So that might have a loss of 0.5. And then this one here is maybe further than this,
00:26:23.840 --> 00:26:30.320
but still better than these two. So that loss might be 0.9. Okay, so which of these model
00:26:30.320 --> 00:26:40.080
performs the best? Well, model C has a smallest loss, so it's probably model C. Okay, now let's
00:26:40.080 --> 00:26:45.680
take model C. After you know, we've come up with these, all these models, and we've seen, okay, model
00:26:45.680 --> 00:26:52.880
C is probably the best model. We take model C, and we run our test set through this model. And this
00:26:52.880 --> 00:27:00.720
test set is used as a final check to see how generalizable that chosen model is. So if I,
00:27:00.720 --> 00:27:05.680
you know, finish training my diabetes data set, then I could run it through some chunk of the
00:27:05.680 --> 00:27:11.520
data and I can say, oh, like, this is how we perform on data that it's never seen before at
00:27:11.520 --> 00:27:19.600
any point during the training process. Okay. And that loss, that's the final reported performance
00:27:19.600 --> 00:27:27.200
of my test set, or this would be the final reported performance of my model. Okay.
00:27:29.280 --> 00:27:34.880
So let's talk about this thing called loss, because I think I kind of just glossed over it,
00:27:34.880 --> 00:27:41.600
right? So loss is the difference between your prediction and the actual, like, label.
00:27:43.200 --> 00:27:50.640
So this would give a slightly higher loss than this. And this would even give a higher loss,
00:27:50.640 --> 00:27:56.960
because it's even more off. In computer science, we like formulas, right? We like formulaic ways
00:27:57.600 --> 00:28:03.280
of describing things. So here are some examples of loss functions and how we can actually come
00:28:03.280 --> 00:28:10.160
up with numbers. This here is known as L one loss. And basically, L one loss just takes the
00:28:10.160 --> 00:28:18.080
absolute value of whatever your you know, real value is, whatever the real output label is,
00:28:18.640 --> 00:28:26.160
subtracts the predicted value, and takes the absolute value of that. Okay. So the absolute
00:28:26.160 --> 00:28:34.000
value is a function that looks something like this. So the further off you are, the greater your losses,
00:28:35.520 --> 00:28:42.480
right in either direction. So if your real value is off from your predicted value by 10,
00:28:42.480 --> 00:28:47.520
then your loss for that point would be 10. And then this sum here just means, hey,
00:28:47.520 --> 00:28:53.040
we're taking all the points in our data set. And we're trying to figure out the sum of how far
00:28:53.040 --> 00:29:01.600
everything is. Now, we also have something called L two loss. So this loss function is quadratic,
00:29:01.600 --> 00:29:08.560
which means that if it's close, the penalty is very minimal. And if it's off by a lot,
00:29:08.560 --> 00:29:15.840
then the penalty is much, much higher. Okay. And this instead of the absolute value, we just square
00:29:15.840 --> 00:29:26.000
the the difference between the two. Now, there's also something called binary cross entropy loss.
00:29:26.960 --> 00:29:32.720
It looks something like this. And this is for binary classification, this this might be the
00:29:32.720 --> 00:29:38.960
loss that we use. So this loss, you know, I'm not going to really go through it too much.
00:29:38.960 --> 00:29:47.840
But you just need to know that loss decreases as the performance gets better. So there are some
00:29:47.840 --> 00:29:53.680
other measures of accurate or performance as well. So for example, accuracy, what is accuracy?
00:29:55.440 --> 00:30:02.560
So let's say that these are pictures that I'm feeding my model, okay. And these predictions
00:30:02.560 --> 00:30:11.360
might be apple, orange, orange, apple, okay, but the actual is apple, orange, apple, apple. So
00:30:12.240 --> 00:30:17.680
three of them were correct. And one of them was incorrect. So the accuracy of this model is
00:30:17.680 --> 00:30:25.600
three quarters or 75%. Alright, coming back to our colab notebook, I'm going to close this a little
00:30:25.600 --> 00:30:33.040
bit. Again, we've imported stuff up here. And we've already created our data frame right here. And
00:30:33.040 --> 00:30:39.600
this is this is all of our data. This is what we're going to use to train our models. So down here,
00:30:40.560 --> 00:30:49.040
again, if we now take a look at our data set, you'll see that our classes are now zeros and ones.
00:30:49.040 --> 00:30:53.120
So now this is all numerical, which is good, because our computer can now understand that.
00:30:53.120 --> 00:31:00.720
Okay. And you know, it would probably be a good idea to maybe kind of plot, hey, do these things
00:31:00.720 --> 00:31:10.240
have anything to do with the class. So here, I'm going to go through all the labels. So for label
00:31:10.240 --> 00:31:15.840
in the columns of this data frame. So this just gets me the list. Actually, we have the list,
00:31:15.840 --> 00:31:20.880
right? It's called so let's just use that might be less confusing of everything up to the last
00:31:20.880 --> 00:31:26.560
thing, which is the class. So I'm going to take all these 10 different features. And I'm going
00:31:26.560 --> 00:31:37.040
to plot them as a histogram. So and now I'm going to plot them as a histogram. So basically, if I
00:31:37.040 --> 00:31:45.600
take that data frame, and I say, okay, for everything where the class is equal to one, so these are all
00:31:45.600 --> 00:31:55.280
of our gammas, remember, now, for that portion of the data frame, if I look at this label, so now
00:31:55.280 --> 00:32:03.440
these, okay, what this part here is saying is, inside the data frame, get me everything where
00:32:03.440 --> 00:32:08.480
the class is equal to one. So that's all all of these would fit into that category, right?
00:32:09.120 --> 00:32:14.080
And now let's just look at the label column. So the first label would be f length, which would
00:32:14.080 --> 00:32:20.480
be this column. So this command here is getting me all the different values that belong to class one
00:32:20.480 --> 00:32:27.200
for this specific label. And that's exactly what I'm going to put into the histogram. And now I'm
00:32:27.200 --> 00:32:34.960
just going to tell you know, matplotlib make the color blue, make this label this as you know, gamma
00:32:37.040 --> 00:32:43.280
set alpha, why do I keep doing that, alpha equal to 0.7. So that's just like the transparency.
00:32:43.280 --> 00:32:48.400
And then I'm going to set density equal to true, so that when we compare it to
00:32:50.000 --> 00:32:56.960
the hadrons here, we'll have a baseline for comparing them. Okay, so the density being true
00:32:56.960 --> 00:33:05.360
just basically normalizes these distributions. So you know, if you have 200 in of one type,
00:33:05.360 --> 00:33:12.080
and then 50 of another type, well, if you drew the histograms, it would be hard to compare because
00:33:12.080 --> 00:33:17.600
one of them would be a lot bigger than the other, right. But by normalizing them, we kind of are
00:33:17.600 --> 00:33:24.240
distributing them over how many samples there are. Alright, and then I'm just going to put a title
00:33:24.240 --> 00:33:31.680
on here and make that the label, the y label. So because it's density, the y label is probability.
00:33:32.800 --> 00:33:36.320
And the x label is just going to be the label.
00:33:36.320 --> 00:33:44.640
What is going on. And I'm going to include a legend and PLT dot show just means okay, display
00:33:44.640 --> 00:33:54.800
the plot. So if I run that, just be up to the last item. So we want a list, right, not just the last
00:33:54.800 --> 00:34:02.240
item. And now we can see that we're plotting all of these. So here we have the length. Oh, and I
00:34:02.240 --> 00:34:11.200
made this gamma. So this should be hadron. Okay, so the gammas in blue, the hadrons are in red. So
00:34:11.200 --> 00:34:16.560
here we can already see that, you know, maybe if the length is smaller, it's probably more likely
00:34:16.560 --> 00:34:24.320
to be gamma, right. And we can kind of you know, these all look somewhat similar. But here, okay,
00:34:24.320 --> 00:34:34.640
clearly, if there's more asymmetry, or if you know, this asymmetry measure is larger, then it's
00:34:34.640 --> 00:34:44.480
probably hadron. Okay, oh, this one's a good one. So f alpha seems like hadrons are pretty evenly
00:34:44.480 --> 00:34:48.960
distributed. Whereas if this is smaller, it looks like there's more gammas in that area.
00:34:48.960 --> 00:34:54.480
Okay, so this is kind of what the data that we're working with, we can kind of see what's going on.
00:34:55.920 --> 00:35:02.080
Okay, so the next thing that we're going to do here is we are going to create our train,
00:35:03.120 --> 00:35:12.880
our validation, and our test data sets. I'm going to set train valid and test to be equal to
00:35:12.880 --> 00:35:20.800
this. So NumPy dot split, I'm just splitting up the data frame. And if I do this sample,
00:35:20.800 --> 00:35:29.360
where I'm sampling everything, this will basically shuffle my data. Now, if I I want to pass in where
00:35:29.360 --> 00:35:38.320
exactly I'm splitting my data set, so the first split is going to be maybe at 60%. So I'm going
00:35:38.320 --> 00:35:44.720
to say 0.6 times the length of this data frame. So and then cast that 10 integer, that's going
00:35:44.720 --> 00:35:50.560
to be the first place where you know, I cut it off, and that'll be my training data. Now, if I
00:35:50.560 --> 00:35:57.360
then go to 0.8, this basically means everything between 60% and 80% of the length of the data
00:35:57.360 --> 00:36:03.760
set will go towards validation. And then, like everything from 80 to 100, I'm going to pass
00:36:03.760 --> 00:36:12.080
my test data. So I can run that. And now, if we go up here, and we inspect this data, we'll see that
00:36:12.080 --> 00:36:20.480
these columns seem to have values in like the 100s, whereas this one is 0.03. Right? So the scale of
00:36:20.480 --> 00:36:28.240
all these numbers is way off. And sometimes that will affect our results. So I'm going to run this
00:36:28.240 --> 00:36:35.920
is way off. And sometimes that will affect our results. So one thing that we would want to do
00:36:35.920 --> 00:36:46.240
is scale these so that they are, you know, so that it's now relative to maybe the mean and the
00:36:46.240 --> 00:36:54.400
standard deviation of that specific column. I'm going to create a function called scale data set.
00:36:54.400 --> 00:37:04.880
And I'm going to pass in the data frame. And that's what I'll do for now. Okay, so the x values are
00:37:04.880 --> 00:37:14.320
going to be, you know, I take the data frame. And let's assume that the columns are going to be,
00:37:14.320 --> 00:37:20.000
you know, that the label will always be the last thing in the data frame. So what I can do is say
00:37:20.000 --> 00:37:28.560
data frame, dot columns all the way up to the last item, and get those values. Now for my y,
00:37:30.000 --> 00:37:34.240
well, it's the last column. So I can just do this, I can just index into that last column,
00:37:34.800 --> 00:37:46.640
and then get those values. Now, in, so I'm actually going to import something known as
00:37:46.640 --> 00:37:55.200
the standard scalar from sk learn. So if I come up here, I can go to sk learn dot pre processing.
00:37:56.080 --> 00:38:04.880
And I'm going to import standard scalar, I have to run that cell, I'm going to come back down here.
00:38:04.880 --> 00:38:10.880
And now I'm going to create a scalar and use that skip or so standard scalar.
00:38:10.880 --> 00:38:21.120
And with the scalar, what I can do is actually just fit and transform x. So here, I can say x
00:38:21.120 --> 00:38:31.600
is equal to scalar dot fit, fit, transform x. So what that's doing is saying, okay, take x and
00:38:31.600 --> 00:38:36.800
fit the standard scalar to x, and then transform all those values. And what would it be? And that's
00:38:36.800 --> 00:38:45.040
going to be our new x. Alright. And then I'm also going to just create, you know, the whole data as
00:38:45.040 --> 00:38:53.920
one huge 2d NumPy array. And in order to do that, I'm going to call H stack. So H stack is saying,
00:38:53.920 --> 00:38:58.400
okay, take an array, and another array and horizontally stack them together. That's what
00:38:58.400 --> 00:39:03.440
the H stands for. So by horizontally stacked them together, just like put them side by side,
00:39:03.440 --> 00:39:09.200
okay, not on top of each other. So what am I stacking? Well, I have to pass in something
00:39:10.000 --> 00:39:20.400
so that it can stack x and y. And now, okay, so NumPy is very particular about dimensions,
00:39:20.400 --> 00:39:27.120
right? So in this specific case, our x is a two dimensional object, but y is only a one dimensional
00:39:27.120 --> 00:39:35.440
thing, it's only a vector of values. So in order to now reshape it into a 2d item, we have to call
00:39:35.440 --> 00:39:45.200
NumPy dot reshape. And we can pass in the dimensions of its reshape. So if I pass in negative
00:39:45.200 --> 00:39:51.040
one comma one, that just means okay, make this a 2d array, where the negative one just means infer