Data Science in Practice - IISc Computational Data Science
Hello, everyone, thank you for joining. We'll be starting the session in next five minutes. We are waiting few more people to join. Hey, can you hear me? Yes, Professor debugger audible.
Good evening, everybody. So I think I narrowed you can type your question in the chat. We will take questions from the chat if you have any.
Good evening,
everyone, a very warm welcome to all of you, especially taking out time on a weekday. And joining us for the session today. I can see Professor Deepak and Professor Shashi have joined us for the session. Welcome professors. Before starting I would like to introduce myself. My name is Arnaud Kriti. And I work as the program manager at NSC talentsprint I take care of the admissions of the IAC nse talentsprint advanced program in computational data science. Professor Shashi and Professor Deepak Would you like to give a short intro about yourself before we start off with the session?
Yeah, sure. So I am Professor Deepak subramani. I'm Assistant Professor in the Department of computational and data sciences at the Indian Institute of Science. My focus area of research is using machine learning and artificial intelligence for application problems in environment, Earth Sciences, and autonomous driving autonomous underwater vehicles. That is my area of research. I obtained my PhD from the Massachusetts Institute of Technology MIT in the US, about three and a half years back. Since then, I've been working I see in the CDs department. So, Shashi, you can good. Okay,
good evening to all. I'm Sigmar. So currently I'm working as chair and associate professor at Department of computational data science. Basically, I trained as an applied mathematician, so more working on computational math finite element CFD, off late in the recent times, I'm mainly concentrating on underwriter driven applications to CFD and practical applications to scientific computing. And also I have interest in the parallel computations, computational scale, and so on. So I'm the program coordinator for the data science course, this program. And yeah, I'm looking forward to see you all in the future. Yeah, I think we can continue on the program, you all yo can take for an intro, you have anything to say?
Oh, Professor depot, probably we can start off with the session right now. And then we can take the question and answers at the end of the session.
Wonderful. So let's get started. So welcome to the master class on data science in practice, as part of the advanced program in computation and data sciences, which is is jointly offering with our partner talentsprint. So the goal for today is to give you an idea of the kind of classes that you will attend as part of this program. And to give you a motivation for you to study the entire subject of data science as it is. So the way that things are planned out is we'll have a class for about 40 minutes or so right? 45 minutes, and then you can ask questions towards the end, and we can have a discussion about joining the program. As you might see, along the way, I will be launching a few poll questions that test your understanding of the subject, as it may be. Right, so that there is some engagement with YouTube. And if you have any questions along the way, feel free to post them on the chat, I will take them as the lecture proceeds, right. And I will explain if it fits into the picture. So let's get started. So we all have seen this gun, right? So wherever we go, this somebody standing with this gun and making a measurement. So this particular gun that you see on your screen, it actually measures the temperature in Celsius. But typically we think that you know, if you open swiggy, or any zomato, it tells you that the rider's temperature is 98.2 Fahrenheit, right. So that's like what it says. So what is this value in Fahrenheit yet? So what is the Celsius value in Fahrenheit? If to? To answer that question, let us actually write a program, right, so let's write a program in which we take an input as the Celsius temperature, the temperature in Celsius, we take that as input. And we apply a logic which we know from our high school, right, so from our primary school or high school, which relates Fahrenheit to Celsius net, so the two temperature scales, how it relates, and the relationship is actually pretty simple and straightforward, right? So it is simply nine by five times Celsius plus 32. Right? So that's the relationship between Fahrenheit and Celsius. So we just write this program right. So we write this program and say that given a Celsius temperature, I know how to calculate the Fahrenheit value right. So using this formula, so what have we done here? We have in fact provided rules to the computer for determining what the Fahrenheit should be given an input of Celsius, so not So assume Celsius, right. This is what we have done here. Now let's consider another scenario now. So this is one way of solving this problem, right? So to get Fahrenheit from Celsius, we know the formula will provide the rules computer gives us the outcome. Now let's consider another scenario. We don't know what this f equal to nine by phi c plus 32 is we don't know that. Instead, what we have is a table. This table provides us with values in Celsius and Fahrenheit which somebody has calculated from some magic right. So some Genie or some robot has calculated this, right. So some Oracle has calculated this and given to you in the form of a table right. So Celsius is 34. corresponding Fahrenheit is 93.2. Like that one table is available to you. But so the question that we ask is, can we build machine that accepts the Celsius as input and provides an output of Fahrenheit, right? So given just this input, we need to get an output that is Fahrenheit. Earlier, what did we do? When the Celsius is given? We had a formula, right? So nine by five c plus 32. That formula we applied, we got Fahrenheit. We don't know that what that formula is, let's say, right, so and then, given this table, our question is build a machine that accepts this input and gives the output as Fahrenheit. But we don't know the rules, right? So we are not going to provide the rules that relate the to Celsius and Fahrenheit. So how do we do this? So that's the question. That's a question which we ask in data science. So the whole field of data science are such kind of questions, like, what is the relationship between two columns of your data frame or your Excel sheet that you get? So that's the kind of questions that we ask? And the answer is machine learning how to do this relationship, right how to build such a machine that learns from a set of data that is given to you and discovers the rules that exist between the two different columns, or two different variables that exist in the system. And so that's the goal of doing machine learning.
So let's look at the two concepts that we just looked at right. So the classical programming, which we did previously that before getting the table right, so in which the data which is Celsius, Celsius data is given, and rules are given. And the answers right, which is the value in Fahrenheit, what is that in Fahrenheit? That was a question that we asked right? So those that answer is given according to the rules that we pre programmed, right? So we write a program to do that. in machine learning, right? On the other hand, the question is, right, so we are given both Celsius and Fahrenheit or data like that, right? So given some input and the corresponding output we have, right, so this is the Celsius and this is the Fahrenheit right? So we are given that. And the machine learning algorithm must now learn what this relationship between the input and the output is, right. So that's the goal of the machine learning. So this is a difference between classical programming approach and the machine learning approach. So many of you may have a question, right? So this is pretty much what data science or machine learning is all about. Right? So given data, make sense of the data, right? So and find some relationship between some input and output right at the very core and a very basic, that's what the idea of the data science, or machine learning is all about. But I know many of you have that question. Right? So is this what I just described? Is that machine learning is that data science is the deep learning what is AI? Right? So what is AI? What is machine learning? in all of this context? This is a doubt many of you may have. So to clear that, let's look at what exactly where exactly the concept of machine learning that I just described to you fits into data science in practice, right? So in practice, where does it fit in? So let's consider the whole realm of data science, there are a lot of datasets, a whole universe of data science, which is there is like an umbrella term, which involves many things. So let's consider this realm of data science. Within data science, right, one of the important things that we do is to have the problem formulation. So what did we do now? Right? So we, the problem that we had was to find the Fahrenheit value from Celsius, that was a problem that we were dealing with. So we formulate that problem that we want to find the Fahrenheit value from the Celsius value. Right, then what we need to do is get a lot of data, right. So that relates that helps us answer this question. Right? So what is the Fahrenheit from Celsius? So to get that data, right, so in this case, it's very easy, right? So you just need an Excel sheet right? So and you can you can actually get the data. But let's say it's a more complicated scenario, right, which many of you are dealing with right in your businesses in your carriers in your line of work, then we need to perform what is called as data engineering in which the whole amount of data that is there that is produced by devices like the Internet of Things, it's all connected devices, smartphones, Smart TV, smart fridges, smart watches, like all of these collect a lot of data. So now all of that has to be processed and brought into one picture like what I showed you previously right. So with the input and output that picture so that process of getting data into a form that can be analyzed that is what the data engineering is all about, right? So we need to perform data engineering, once we do the data engineering and collect the data, then the question comes like what have we collected? How do I visualize it? So I need to visualize that data, right? I need to look at what is the relationships that are forming, right? What are the clusters? What are the patterns are forming, right? So data visualization is another important aspect of the whole realm of data science. And then once you build a model, like you make a big deal out of it, right, so you get a very high accuracy, your root mean squared error is low with the test data, all these things you do that you build a very good model in such a way that any new data point, right? So somebody gives you a new Celsius value very accurately, you can tell what the Fahrenheit should be, right? Let's say you build such a model, that's not enough, you must convince your audience, right? So convince your security at the front of your apartment complex or your workplace, that the model that you are giving is actually correct. So you must tell a story with that data to your consumer or an end user. And so that whole realm is part of this data science field.
In all of this, like, I skipped one portion of doing actually building the predictive model, right. So the building the model that takes us input the whatever the data that can be collected, and gives output that is a quantity of interest that you have, and that is done using machine learning. So machine learning is a subset of data science, that deals with using data and building a model that predicts the relationship between the input and output that whatever we are interested in. So in this whole field, like where is AI coming, right, what is artificial intelligence? Where is AI coming? Where is deep learning coming? Right? So this is just machine learning? Right? So this is data driven modeling, it's simply modeling earlier, right? So without data, we used to do the same, right? So that's how we got that five by nine c plus 30. To that formula, right? How did we get it, so somebody studied how Celsius or Fahrenheit are related, and then they gave you that formula? That's procedural programming, we have been solving problems even without data science and machine learning, right. So, we have been doing that, but what has changed now really, is the advent of big data, right? So when the big data comes into the picture, right, so when the big data comes into the picture, right, so that means there's lots and lots of data that allows you to now discover relationships between data points, that was heated to and possibly impossible to do so, previously, we were not able to do that, because we did not have sufficient data now, we have sufficient data right. So, now, one of the techniques of machine learning is one way in which the I said that the answers and the outcomes right so the data and the answers are given input and output are given and rules are learned one such way of learning the rules right. So the rules is what is called as a neural network right. So, today we will see what is called as a decision tree. Another way of doing that rule is one is so called as neural network. So that neural network when it intersects with lots and lots of data, that field right, so that small area of intersection of the neural network with big data, that is that field of deep learning. So deep learning is nothing but using lots and lots of data and a specific machine learning technique called neural networks, to do some tasks, which we're not possible to be to increase the accuracy of some tasks, which were not possible before, right. So that's where the deep learning comes into the picture right. So, deep learning is that subset of where the machine learning intersects with the big data and using neural networks as a particular tool of doing so, in all this one thing was missing, which is AI, let's have that where does AI come into the picture artificial intelligence. So, if I write a program right I write a procedural program right. So that is Fahrenheit is equal to Celsius times five by nine plus 32 I write this program nine by five plus 32 I write that program
that is artificial intelligence right. So the computer is artificially telling you right? So your intelligence has been transferred to the computer in the form of a procedure and if else condition or a foreign loop, right. So all that is really artificial intelligence. And there is nothing wrong in calling that as artificial intelligence program is an artificial intelligence program. But what we really mean when we say talk about artificial intelligence, in today's context, under the Big Data context, is that those rules right, that we specified are learned from the data itself without us telling it explicitly what those rules should be. And specifically, the tasks which humans are capable of doing that is speaking, understanding and identifying, right, so identifying pictures, which are vision, and speech, right. So these are two of the main tasks which humans can do. And from the from a picture identify whether this person is a cat or dog, right, whether this animal is a cat or now, that capability humans had, and that when a computer can do that particular task without you telling explicitly that it is a cat or a dog, etc. That is where the artificial intelligence comes into the picture. And so remember that artificial intelligence exist even without the relevant, relevant realm of data science, right? So it takes its own existence, and one or two techniques within artificial intelligence, which has used data a lot, which is like speech and vision, right? Those kinds of tasks. That is where this whole idea of big data, neural networks all these things come into the picture. But in 90 95% of the business cases, business use cases right that we deal with, we deal with tabular data in which machine learning and just using the existing in all the data visualization techniques data engineering aspects right. So, all give you very powerful results right. So, we will see one such technique today which is called as decision trees which you can use. So, to summarize data science is nothing but an umbrella term right. So, traditional data analytics is related to this modern data science and is almost equal in meaning, but it has some key differences, the main difference being that whatever we do here right so, in this column is with lots and lots of data that is available today thanks to all the data collection infrastructure that many businesses are put in place and you do all of that using little bit modern high end computational infrastructure right. So, not on your pen and paper text notebook or not on your Excel sheets right. So, you you actually do that using modern high end computational infrastructure that is where the whole field of computational data science comes into practice. And so, computational data science in practice involves solving these problems right such as the problem very trivial example, which I gave which was Celsius to Fahrenheit conversion like that, you can have solved many many many such problems right. So, what are those kinds of problems, so, you will see very briefly today, so, again to summarize machine learning model right, so, is exist right. So, you give some input and you get an output right. So, from the machine learning model. So, there are in fact two general abstract settings in which the kind of machine learning models work one such setting is what I showed you So, far, which is called as supervised learning, right. So, we give a game an input and output pair x or Celsius Fahrenheit that table was given and we supervise the machine learning model to learn this relationship between the input and the output right. So, that is what is called a supervised learning setting. So, we have some inputs called as features and output called as labels right. So, this data is available to us right. So, x and y data is available to us, and we need a machine learning model for predicting the output for input for existing data or and when a new Celsius comes to the picture, right. So, we want to make a prediction the other kind of approach right so, is that you really don't know what the output that you're trying to predict is you just have a lot of unlabeled data. So, sometimes also called as raw data is available
with this raw data, we want to generate some insight into this way right. So, how close they are what are the clusters that are involved in this way, right. So, are they connected together right. So, does this data point look very close to the other data points at such things right. So, such questions, we want to try to answer right. So generate some insight about this way, that approach right. So, where input output is not known, the output relationship is not known. So, such such kind of questions are called as the unsupervised learning learning problems, right? So, common unsupervised learning problems are that y can be some clusters in x, right? So what is the density of the x problem? And it is standard density function? What is it? Like is a one particular data point in all of that, is that an anomaly or not, right? So, if it's an anomaly, detect that anomaly and throw it away. So, anomaly detection, those kinds of tasks are called as unsupervised learning tasks. So, within supervised learning, so this is what the whole realm of machine learning is divided into supervised learning and unsupervised learning. In the supervised learning, we have both input and output, both are given to us. Examples of both are given to us. And we use that to supervise model to learn the relationship between the input and the output. In the unsupervised learning setting, you are only given the left part, right, only the x is given to you. The y is all about, okay, are some points in this x together? is some point in x away is that a density distribution over them, that is a kind of way that there is no supervision for it, right? So there's no supervision for it to learn, like what that model should be. That is the kind of unsupervised learning tasks that we do. Right. So the supervised learning within the supervised learning where both input and output are given, right, so you've given both input and output. There are in fact computer can do right so machines can do two such supervised learning tasks very well. And all the problems that we solve will be broken down into two of these tasks. One is called as regression. regression is about predicting a continuous outcome variable based on the value of some of these predictors that you give right? So why that is a Fahrenheit is equal to some function of the Celsius right. So that's what we wanted to do. All right. So, then some examples of doing this market forecasting like predicting what the price of the stock will be what the price of your rice will be lower the price of gold will be what the price of metal will be and all that market forecasting problems, population growth prediction like how will the population growth, whether the COVID will grow right what is the new what is the wave three going to look like all those questions are about population growth prediction, these are all examples of our regression tasks when we are trying to predict a continuous variable from multiple regressors right. So, multiple inputs, advertising popularity quantity of sales, all these are examples are continuous variables that must be predicted in your outcome variable is continuous and that is what is called as a regression problem. Then there is a classification task which is about identifying whether right to your output is belongs to a class or not it So, whether this image is a cat or not, whether this image is a dog or not, right, so, this is the outcome is binary, right? So, in that case, it's binary, or it can be multi class, right? So, whether it's a cat dog or not, right, so, but what impact what the importance there is it's not a continuous variable that you are trying to predict it is a class that you're trying to predict there right. So, it's whether it is one zero or two, and whether it is 01 or two, whether it is Cat Dog or horse right. So, such such things like some classes, what you're trying to predict, for example, whether I have COVID or not, whether this customer will churn or not, right. So, whether this customer will go away or not, whether the participants are not good, we will like to know, right whether how many of these 79 of you who are attending will attend will join for our course or not right. So, she actually if I can get developer classification model and give it to her saying that okay, given all these inputs, that is all the features of the 80 participants that are there here, how many of you will actually convert or not right. So, that classification problem, she will pay anything to get get that machine learning model, right. So, whether the market will go up or down? Whether the SBA credit card, right, so what will they like to know? Right? If somebody swipes fives in the POS, they won't immediately know whether it's a fraud transaction or not, whether I should block it or not right. So that the question of whether a particular transactions are fraud or not is a classification problem that must be solved.
Okay. So, that was supervised learning. The unsupervised learning setting there are mainly three tasks that we do one I already explained was clustering right whether these two data points are together whether the customers right so mmm 85 of you, how many of you are in one kind of segment right? So whether you are super freshers right? So, you have five to 10 years of experience or 10 to 15 years of experience, it's also like that I can make a segmentation and then to provide targeted marketing to all those kind of sub customer segmentation, I must actually perform a clustering and find out among all of you what are the number of clusters and how many targeted marketing campaigns I must actually do. So, what and I must recommend right. So, in case you are dealing with this 84 of you, we must be able to recommend what is the program that is suitable for many of you right? So that recommender system is also based on clustering you right? So, the clustering the crowd that is there over here, then there is a question of finding the density of the distribution that I have, whether it is actually belonging to whether a new data point belongs to the existing class existing class density or not. Right. So, for example, fraud detection, right. So, fraud detection in some case is like an anomaly detection, right? So, we want to make an anomalous prediction, maybe we want to make an anomalous prediction of the data point. So, how we might do that, right. So if I want to do that, I must know what a normal point looks like. So, what does it mean to know what a normal point looks like? I must know the distribution of that normal point. It For example, without calling, right, so, we know that we're not calling makes a mean score of what 50 odd right? So is that is average, and a standard deviation of scoring is maybe about 30 minutes or something like that. We know the distribution of the scores that he makes usually, right? So if he gets out in the first ball or second ball, right, so that might be an anomalous point data point, because usually in the distribution that we know about Kohli's batting, he does not get out on the first or second ball. So that is an anomalous data. But how do we know that we look at the previous history of the ball in which he gets out and then we make a distribution of that estimate the density of the distribution and beta Based on that, we can say whether this particular event was anomalous or not, whether he is out of form or not right. So, those kinds of questions can be answered based on this anomaly detection question. And finally, there is a poor point about dimensionality reduction when we are dealing with large number of features, especially in an image, we don't want to look at all the parts of the image. So, we want to look at a small portion of the domain. So, it's so that that helps us to do this dimensionality reduction kind of problem as well. So, so, far, we have discussed about the general setting of what machine learning does, we looked at how machine learning fits into the whole data science problem, within the data sense where is machine learning, we saw one trivial example of how machine learning can be applied to Celsius to Fahrenheit conversion, then we got into the picture and showed that how machine learning looks like right so there is an X there is an ml model, there's a y, then we said that if x and y both are given that is what is called as a supervised learning when x is only given and we are trying to estimate what the Y is that is what is unsupervised learning. So there are supervised learning and unsupervised learning, we saw examples of supervised learning tasks, which is classification and regression, we saw examples of unsupervised learning tasks, which is clustering, anomaly detection and dimensionality reduction, this is what we have done so far. So, to get an estimate of what, how you guys are doing, let me actually, so an overview where is the support? How can I launch it, the q&a? Okay. Launch the audience for one. So I want you to look through the questions, all participants will look through the questions and answer them from what you have understood so far, right? So please, please do that. So yes. 60% of the audience has voted right now. Okay, you can close it. Okay, Daniel 70.
Right. Okay. So, what did we say so far in machine learning data and rules are provided to obtain answers No, right. So, data and rules are provided to obtain answers in classical programming, we know what the relationship between the input and output is right? We know what the relationship between the data is. So we provide that rule together with the data to get answers that is classical programming in machine learning data and answers are given and we want to discover the rules, right? So we want to fit a model which won't fit the machine learning model that finds the rules between them. So that is what machine learning is all about. Right? So first question, the answer is false. Right? So it is false. Then an example of supervised machine learning, right, so supervised machine learning. The two tasks of supervised machine learning are regression and classification. All the other three that are given their clustering, density estimation or normalization are examples of unsupervised learning. So in fact, a computer can very well do five different tasks. Nature, two are supervised learning tasks, and three are unsupervised learning tasks to supervised learning task can very well do a regression and classification. So keep that in mind. Right. So classification. So credit card fraud detection is an example of right. So that's a trick question, it can be done as an example of a supervised learning as well as an unsupervised learning problem, right. So, I gave both as examples during the lecture, both are correct, right. So you can actually do a credit card fraud detection as a classification problem, as well as an unsupervised anomaly detection problem, right. So fraud detection can be done both as a supervised learning and unsupervised learning. So the purpose of this question is to drive home the point to you that any problem that any business case that you deal with may not have a unique way of solving it, right. So within the context of the machine learning itself, there might be two different approaches in which you can take to solve one particular question. And that is what is beautifully been reflected from the audience poll as well. Right. So 50% of you believe that is a supervised learning 50 unsupervised learning? and both are correct. And both both of you can in fact, develop a machine learning model that will do credit card fraud detection correctly for you. Right, so AI is a subset of machine learning. No, right. So machine learning is a subset of AI. Right? So that's what it is about. Right? So it's how we describe the entire context, or data sense, is there machine learning is there within it, deep learning is there within the machine learning, right, especially just neural network with that big data circle that we have? and so many other parts are also there to the whole context of data science, right? So how to do all of this on a high performance computing machine, how to do all of this using data engineering, how to do data visualization, how to say stories with your data, how to formulate the problem, all of this are part of data science. And this answers the question that is going on around right, so one of the question is what is the difference between the data science course and an AI or deep learning course? Right? So AI and deep learning course just focuses on that one tiny bit, right. So which is the using neural network to do these two tasks, which is classification and regression, those are the only two tasks which the computer can do, and use the computer to do that using neural networks, the different architectures with different laws, that is what the AI, deep learning is all about. Right? So but to solve a business case, which is about 90 to 95% of the times in industry, it is all about knowing what data to use, whether using data science is the right approach there is what is the data tell you what is the story that data tells you all of that is data science, and that is like this is a more broad umbrella term. Okay, so with that, we'll move to the second part of our
program today. So far, I have discussed to you what machine learning is about and hopefully convinced you why you should know data science within the context, right? So machine learning within the context of data science, and data science about all the other parts that are there for the data science, right. So we have all of these other webinars, which we gave before, which goes a little bit more into detail about specific examples of how these things go. Now, I'm going to discuss about one particular classification task and how to do this classification using one particular machine learning model. So, remember, what did we have we have this machine learning model we have x we have y right? So, you are given this x and y and we want to learn that machine learning model that somehow relates this x to the y and one particular model machine learning model that does this relationship between X and Y to perform supervised learning is called as decision trees. We will learn about decision tree algorithm today decision tree models today. Okay. So let me put this in context right, what is the problem that we will be dealing with over the next 1520 minutes? What is the classification problem in the iris data set? So the picture as you see over here, right is three different species of the iris flower. Right, so setosa, versicolor, and virginica These are three different varieties of the iris flower, right? So three different species are there. Each of that is characterized by its sepal length, petal length, sepal width and petrels width, right. So this is what the sequel is, this is what the petal is, it's a petal is it so simple here, and it has a length and a width associated with it. So this Iris data set is a data set somebody has created by taking pictures of these Iris flowers and measuring the length of this equal petal and they have created that dataset for us, right? About 150 samples are there with these four attributes. So these four attributes are the X that I've been talking about the X that I have been talking about. Okay, so the X that I've been talking about is the
sepal length, petal length, sepal width petal width. The y that I've been talking about is the iris setosa versicolor and virginica those are the three classes this is a y that is what I'm going to do perform the classification on now I need to develop a machine learning model that takes these as inputs and gives an output right so that's what we want to do. One simple way of doing that is to by heart, this is what we all do very well right so in 10th class cbse right so social studies we by heart right so we say okay, page number three there is an answer to the Battle of glass ceiling by harden, we go to the cbse exam they will ask you write about bad Laplace See, we just put it read it that's what we do this by Hardy, right. So one way in which machine learning models also do the same is what is called as instance based learning. So an instance based learning which is there on the left side, right, so that's what the instance based learning is all about. Right? So what I plotted there on the x axis is the petal length. on the y axis is the petal width. This is what I plotted there. And those reds and the greens are the two different classes that I'm trying to make a distinction between whether it is an iris or not. That's what I'm trying to do that blue square that you see over there, there is a new data point. And I want to make a prediction, whether it belongs to Iris or not. Now the question that I'm asking, so what do I do, I might have to address I know how the green triangles look like how the red circles look like, I just see this blue square, which of it is closest to to I look at four of its closest neighbors, and then just take a vote of that. Right? So I have learned about four different let's say I learned about four different battles to write about the question game something, then I thought about it okay, well, this looks very close to the Battle of class, let me answer that as Battle of last night for example, it's also the example of by hurting and coming right. So, that's what is called as an instance based classification. In fact, there is a machine learning algorithm machine learning model, which does exactly that, that is what is called as a K nearest neighbor classification algorithm k nearest neighbor classification algorithm is what I just described, that is this blue square that is there it finds the nearest neighbors in the feature space right. So, this is the features that are there finds its nearest neighbors and makes a working on that voting on that nearest neighbors. So, I find four nearest neighbors and do a voting on that is what is called as a K nearest neighbor classifier. That is one way of doing it. The other way of doing machine learning is to learn the rules right. So, we said that machine learning model learns the rules between the data and the answers. So, the rule that we are trying to learn over here is what is this decision boundary right. So, what is this decision boundary that I must put there, what is the decision boundary I must put, so, then based on that decision boundary, I can make a prediction of whether it is a red circle or a green triangle right. So, that is a classification problem and how to draw that decision boundary that is a question. So we want to learn the relationship right. So, this is the student who understands the concept and which makes it needs to be done right. So blue is in fact a new flower for which classification needs to be done that is exactly correct, right. So blue is the new data point that does come there is answering the question of per diem. Let us come right. So this is instance based versus model based now, we are going to look at this model based where we are trying to learn right what is that relationship using one particular model called as decision trees. Okay, decision trees are these data driven models that can be used for both classification and regression, it can be used for both like classification and regression. Even though Today we will focus only on classification. So it can be used for regression as well.
This decision tree is a very versatile machine learning algorithm that is capable of fitting complex data, right. So you can actually work with a lot of features a lot of nonlinear data, it can, we cannot do all that. They're trained by a very good algorithm called as a greedy optimization algorithm called card classification and regression tree algorithm card algorithm. This is the algorithm that trains a decision tree. So the plan of action now is to see the commands in scikit learn that we may need to use to train a decision tree, we will look at how our decision tree looks like. And I will explain how a decision tree makes its decision. Right. So typically, in a full length class, you will understand the math behind what goes on here, because we don't have a lot of time, we'll just see the algorithm, we'll just see how the decision tree looks like and how you can train that in scikit learn. Okay, so in scikit, learn, right. So psychic learn is the software package Python package that we will use in the machine learning for our into our course. Right? So psychic land is a very powerful Python package that we will learn you will become experts in using psychic learn at the end of the course. Right? So this is what scikit learn does. So in psychic land is actually very easy to train machine learning models, right? So first, it has lots of datasets which you can just, you know, load and use. So simply here, what you see is loading the data set, which is data set you load, then I'm calling the decision tree classifier, right, so decision tree classifier I'm calling, right, so then I simply load the iris dataset. This is actually object oriented programming concept, which was actually object oriented programming concept. We have one full module in the course to learn how to write object oriented programming ideas, how to do an object oriented programming, so we have in fact, just called a instance of the class, right, so then I'm just calling this x, this dot, right? So an attribute dot. I'm just doing that and find a fine of finalizing the x and the y Right, so I'm doing x&y here, right and then it's very easy to just to train this machine learning model here, right? So that line or the average is three CLF is decision tree classifier maxdop equal to two, that instantiates a class, which already instantiates a class and then the dot fit just trains the machine learning model between x and y, and you get a trained ml model. Our task now is to understand what happens behind this right? So what happens is behind dot fit, right, so, that's what we the course is all right, so that's what we will try to do very briefly here. So, this is how our decision tree looks like. So, what you see on the right hand side, this is our decision tree will look like. So what happens at each node of this decision tree is that based on the features right, which is petal length, petal width, sepal length and sepal width, right? So based on those features, what we simply have to do is at every node, ask a question based on one feature and one threshold, right? So we'll see this is a trained decision tree model next what you see over here is a trained decision tree model we will see right after doing the dot fit this is what the decision tree model has got to trained us right. So, what does what happens in this decision tree model is that we start from the top and move down asking the question whether petal length is less than or equal to point four five or not at the top top here right
once again Okay. Oh, okay. So that at the top right the question that is being asked is whether the petal length is less than or equal to point four five or not right. So if it is true, then we move on the true branch right? So we move on the true branch, if it is false, we move on the false branch right? So if it is true, then what happens is it makes a prediction that the class is setosa. If it is false, right, it goes to the right brands, right? It goes to the right brands and what it does is it as the other question, right? So in the right bands, a question is, is the petal width less than or equal to 1.75? Or not? If that is also true, right? So again, it goes to the left branch from there, and then it makes a prediction that the classes were similar. Otherwise it makes a prediction that the classes virginica, so, the way a new data point when it comes it makes the decision is to ask all these questions at these different nodes until it reaches a leaf node right. So, what you see over here is the leaf nodes that are there, right. So these leaf nodes, which you see, right, so, these leaf nodes, it makes a prediction of what class this particular data point belongs to our model. So what happens by the car type or the greedy algorithm is that it makes the decision about which feature to ask what threshold to ask at each of those nodes, that is what it does. Right? So that is what the machine learning algorithm called the decision tree classifier dot fit it does that we don't have time to go into detail about how exactly does it today unfortunately, but if you decide to join our class, right, you will get all up you will learn all about it right? So this is what happens in that decision tree classifier it's very simple right? So it's that's pretty much how we all make our decisions Alright, so VBC a decision trees are very intuitive model a machine learning model that is there. And we simply whenever we are faced with such questions, we also ask such questions right, when how do we decide to buy a house? So we ask the question, okay, what is the distance of this house to my workplace? Is it less than or equal to some value? Yes means that house is under consideration no means that house is not under consideration like that we also make decision right. So, decision tree is a very intuitive way of doing things when a COVID positive patient comes to the emergency room adopters buy in right. So, in the emergency room, mind the same decision tree runs is the temperature less than or equal to this value greater than or equal to this value it is less, then you ask the question, is the saturation less or no? Right? If saturation is less, then you ask a question. Okay, what is the kind of ambulatory audience this might be needed? Right? So all of this right, so all of this all of this is very, very, very intuitive thing to do. Right? So that's what the machine learning model does. And decision tree model does. So when I call the dot fit command, right, so inside it land, this is the tree that it produces using the cart algorithm behind the scenes, okay. So I told you, right, so it's actually a model based classification where we are trying to fit a decision boundary mix, that's what we were trying to do. Here also, that's exactly what I've shown here, right? So, there are actually three classes right? So Iris setosa versicolor and virginica three classes are there and a decision tree makes these kind of orthogonal decision boundaries. So, from this figure it is very clear that the combination of petal length and width right if it is somewhere here, it is going to be classified as the yellow class yellow class, if it is somewhere here, right. So, it is going to be classified as a blue class, if it is somewhere on the left side is going to be classified as the red class. That's what it will do. Right. So, the decision tree classifier makes these kinds of decision boundaries, it learns that relationship between the data input and output that's what it does, right. So, that's what the dot fit commanded us.
So, we are allowed decision trees The reason is that there is very good model interpretation that is possible for the decision trees right. I suppose, so, what what I mean by model interpretation is that this decision tree whatever rules it provides is actually easy to interpret we just saw right is it less than or equal to or greater than or equal to the kind of questions it asks this can be manually understood and you can even apply it right. So, by yourself such kind of models are what are called as white box models or transparent models, which are this this explainable AI system is actually an artificial intelligence system right. So, because now, once you train a decision tree to classify whether a patient needs a doctor's attention or not, right, so, it is actually replacing a human and it is actually performing artificial intelligence. And we did all of that using data, right, we have trained the decision tree model using data, we just gave the data and answers and the rules were learned, right. So, the rules that were learned was this decision tree model that was the rules that were learned on the on in this particular case. And so, this explainable AI models are pretty good in in the business world right. So, we want to try to explain why something's happened. So, decision trees are excellent for that. In addition, you can also write, so, there are four features, but many of you may have noticed that I showed only petal length and width as the two axes right whenever I was visualizing, why did I do that, that was because once a decision tree is trained, it can tell you the feature importance that is which of the two features are most important to make this classification. And it so happens that in this particular case, both feature Lennar the petal length and width combined have about 80 85% explainability power right. So you just need petal length and petal width to make a classification very accurately led to a higher degree of accuracy. Right on the other hand, these neural networks or ensemble methods called random forests, they are sometimes called as a black box models or opaque models in a neural reset very difficult to explain why a particular instance was classified to a particular class or not. Right? So it's actually very difficult and ongoing research on explaining what a neural network does it by the time it comes, your business will be over right about that, it's still some time away. So in that sense, machine learning is extremely important for doing the explainable AI system, right. So feature importance, how feature importance is calculated, there's a whole lecture right? So, we have a full section about how to do that. So, today, we don't have time to get into that right. So typically, right. So, after understanding how this intuitively understanding how the decision tree model works, we get into the math of it right. So, at each node, the question that we ask is How do I decide which feature to use and which threshold to use. So, it is based on how much impurity is reduced right. So, actually an optimization problem works there, there is an optimization that is performed at each node, the objective function of that optimization is what is written over here right. So one objective function is calculated and the decision variable there is the TK. So, in our course, we have a whole section about learning about optimization and fundamentals of data science, right. So basically foundations of data science, right, how do we choose the features, what is an objective function? How do I perform the training right? How do I perform the optimization all of that we will learn? Oh, you need to learn all of that to understand exactly what is happening over here right. So what how the J is written, how we minimize it, but at the end of it, what you need to really understand is that at each node to decide which question to ask and which threshold to put right at what feature I must ask the question at what threshold I must put that is the answer of an optimization algorithm. And that optimization algorithm is this car algorithm classification and regression tree algorithm that is actually a greedy algorithm. Why is it a greedy Because at each node, it asks the question without worrying about what happens to its children below, it just optimizes based on that right it just makes an optimal decision for it at that time, it will not choose a solution in a way that the child split plus it split the two together will be optimal right it doesn't care about that, it just makes sure that it makes an optimal split for it and below it makes an optimal split and so on so forth, that is why it is called as a greedy algorithm right one question was there on why why is it called a greedy algorithm?
Right. So, the last point that we must remember in the decision tree is that decision trees are very prone to overfitting. So, what do we mean mean by overfitting? So, we have trained this decision tree No. So, we have used some past data to train this decision tree, if we made the decision tree in such a way that it makes a perfect prediction to everything in the past data, then it is a sure shot recipie for future prediction failure right. So, that means it has over learned excessively on that data set it has again I go back to our 10 plus student our friend right who was marked up dc ncrt textbook, it he has over trained himself on the ncrp textbook. Now, suddenly, if a new question comes right, he seems he doesn't know the decision boundary very well yes, overclaimed himself with the ncfa textbook right. So, he will go and answer only what is there in the textbook and that might be wrong, right. So, we all know that. So, that is the problem of overfitting, right. So, how to deal with this overfitting in this decision tree. So, this is decision tree since at each node it makes an ask this question, if we simply use a tree that has as many leaves as a number of training points, we can get a perfectly or fit example right. So, it does not learn anything it just memorizes it just by hearts right. So, what is there in the textbook, it just by heart What is there in the data set. So, to avoid overfitting, we need to restrict this decision trees degrees of freedom, we should not allow it to by heart the textbook that we should allow we should not allow me to buy her the textbook. So, how do we do that there are quantities called as hyper parameters which we must tune in the psychic land right. So, in the psychic learn basically at the very high level idea is the number of depths that you go number of leaves that you allow, let's say if you restrict the number of times you make a split and if you restrict the total number of leaves that are present, that should be enough to constrain that decision tree model right. So, how to do that in our full course there is a full assignment on how you will learn to do this right. So, no discussion is complete without knowing what the issues are with decision trees right. So, what the issues are with the model that we are learning, so, what is indecision, they also have a lot of issues. So, one of that one of the issues is that it always produces these orthogonal decision boundaries. So, if you rotate the data a little bit right it it learns unnecessarily convoluted decisions. So it learns unnecessarily convoluted things. And these decision trees are very, very sensitive to small variations in the training data if you remove an outlier, it might rather dramatically change the decision tree register the reason because the reason is that it is a greedy algorithm right it makes the splits right. So, if there was an outlier somewhere right, it actually made some very bad split at one of those and then that makes a big brings it into a problem. Further psychic learns implementation itself has some problems in that it randomly select some feature. So, you might not get the same model when you repeatedly run it right. So, these are some of the issues that we must be aware of when dealing with decision trees. But the way to overcome this problem is to use what is called as a random forest that is an ensemble of this decision tree. Now ensemble means multiple, so a forest has a lot of trees right. So, if we train a lot of trees and make a forest which are randomly we put a lot of trees and each tree train slightly differently with different parts of the data set right. So, it takes the data set first 10 data points the first tree lands secondary lands 11 to 2033 lands 21 to 30 based on that those data points right if each tree are learned and you put all just three once you put all this tree inside the model and that is what is called an ensemble model. So that is what is called as a random forest and random forests are powerful models for dealing with tabular data, right. So that's what it is. So, another thing you can launch the next next poll, right so. So what we have discussed so far, right. So what we have discussed so far is one particular machine learning model that takes an input x and gives an output y. Perform the classification task. It can Also do the regression test, but we looked at the classification task.
Okay. So, what did we do in that classification task when we were trying to learn the decision boundary? So, how the decision tree did it it actually has a tree structure, it asks the question at each node, what is the feature I must look for what is the threshold based on that true or false it goes down it keeps going down until it reaches a leaf at a leaf it makes a decision whether this is a youngster this particular class or not that entire process right. So, he is what is called as a decision tree training. And that is done using the classification and regression tree algorithm. Right. So, that's how the decision tree operates. And decision tree can be used for making explanation right. So until you can launch
okay. So some of the questions are how to ensure that training data is sufficient for corrective decision? So, that is a very good question, right. So, there is no machine learning algorithm that will tell you, that will work for bad data. So, there is no substitute for having bad data, you need to have really good data. So, how do we ensure that the training data is sufficient for current decision that comes when you prepare that data and see the distribution that is there. So, first we have to analyze the data. So, in fact, the whole process of doing machine learning is about three, six different steps. Right. So, I didn't do go through that, because it's already been done in one of our public webinars that are available, which I myself have given. So, you can see see those there. So, there are different steps, before we even get into fitting a model between x and y, we must actually look at how this x looks like how does y look like at that time we make the decision whether the training data that we have is sufficient or not made. So at that time is when we make that decision or fitting is like so, the next question is by kalyana Varma right. So overfitting is about right. So, when you have when you are training this model, right? So when we are training this machine learning model between the input and output, right, x and y are given to you and you're training that model when you train that model. If the model is being trained in such a way that whatever during the training the x&y, it gives it correctly predicts it very closely predicts the answer. But when you give a new data point, right, when you when the blue blue square right when the blue square came, it gives terrible results. It does not generalize beyond what the data has seen. This is the problem of overfitting. overfitting means that you have overfitting to the training data. Remember our 10th class friend, right? He has mucked up the ncfa textbook, right? So yes, by 100 everything, he has overfit himself to the ncma textbook without learning the decision boundaries in such a way that it generalizes beyond what the NCAA textbook has told him right now, when you ask a question from another textbook or another little bit applied, they are not able to answer. So that is the outcome of overtraining a decision tree model okay.
Right. So, feature importance determined sorry, we cannot do that now. Right. So, it's a whole lot of discussion that we will take okay. So, let me I think about your sport results you have seen right. So, decision tree does not suffer from overfitting problem No, it suffers from overfitting problem we need to constrain the freedom of the decision tree the decision tree boundaries are orthogonal Yes, they are orthogonal right. So, they are the traditional tree boundaries are orthogonal. The algorithm that is used to train a decision tree is the current algorithm right. So, classification and regression tree algorithm CRP, the decision tree is what is called as a transparent model or a white box model. It is what is called right. It's an explainable model. It is not an opaque model. Neural Networks, on the other hand are like an opaque model, though there is some research on going to explain a neural network but it's a little bit long way from reaching to fruition, right. So, that is what the decision tree is all about. Right. So summarizing everything, we looked at what a data science problem was, right? So how the data science field was, where machine learning is located, where deep learning is located, where neural networks are located. And we touched upon the different skills that a data scientist must have. These skills not only involve machine learning, but also data visualization but also data stories, data engineering problems. formulation itself, that is also a very big, big part of taking data science to a business problem is very, very important. Then we saw what machine learning is all about, we said that in decision in machine learning, there were two types of majorly two types of tasks. But actually, there are three, there's reinforcement learning also, we did not touch about that. But the two main tasks in machine learning are supervised and unsupervised learning algorithms, supervised learning is when you give both x and y right so both input and output are given to you. Both the data and answers are given and you are trying to learn the rules, right? So you're trying to learn the machine learning model that relates the input to the output that is what is called as supervised learning. If you have the data, but you don't know what you're looking for, you're just looking for some patterns, right? You're just mining the data that is what is called as an unsupervised learning task. Within supervised learning tasks, classification and regression are two things that are sub problems within major other problems right. So, classification, if you learn how to do classification and regression, you can break down any data science problem right into different parts and then solve it right. So that is what you can do in the decision tree. Now sorry, in the in solving a data science problem, so classification, regression, then unsupervised learning had clustering, anomaly detection, and dimensionality reduction, these are the three main tasks in the unsupervised learning. Then we got into knowing about little bit more about what the classification problem looks like, we looked at an iris data set, which is a 150, a small data set, it has some over four features and three classes. And we try to learn a model based classification algorithm called as decision tree model. We saw how our decision tree looks like at each node, it asks a question, right? So whether it is yes or no, based on that, yes or no, it keeps on going down and it makes a decision. A decision trees training algorithm cart algorithm, what it will do is it will build this tree and tell you what question to ask at each of those nodes based on a feature and a threshold. That is what it does. It does that using an optimization algorithm? Right. So to get into the details, we need to know what demonstration algorithms are. And based on that it makes a prediction. That is what we looked at today. Right? So that brings us an end to what we have planned as the teaching part of the course. Right, so of this today's session, and I think I answered all the questions that are there. And thanks for attending it. So I know pretty retro, you want to add anything else?
So if you all have any further queries, probably you can type your questions in the chat box, we can address them and then we can probably end the session as well.
Oh, so there are some questions. Okay. Sorry. Are you looking only at the chat? Okay, in the queue? Aren't all machine learning algorithms greedy optimization problems? They are not right. So all machine learning algorithms are not really optimization algorithms. There are other algorithms as well. Decision Tree seems to work more like if then else are case statements in programming, right? So yes, that's how it makes those decisions, but how it is set in machine learning is it tells you which feature to look at at which node and what is the threshold right. So that we don't need to give talent right the structure we have defined right. So we have defined a data data decision tree structure based on that structure it takes takes the question, okay. Is that only one parameter at each node or not? Yeah, but it's not only one feature is looked at for one only one feature is there at each node that is what is called as a binary decision tree right? So if only one feature is looped, could we not use multiple features like petal and sepal for each one not so the car by vardaman? psychic land does not do that it only asked one one feature, you can do multiple features. But in practice, what it has been seen is that using one feature at a node and using a forest of that, right so using multiple trees that are found to be better labeled data and labeled data data which has an outcome with a trend it has an answer to it, each data point has an answer to it. If the data point does not have an answer to that unlabeled data, you don't know what it means that you don't know which customer belongs to, you don't know whether that that particular for petal length with sepal length with rain. So that combination of four you don't know whether it belongs to set aside is or not right that setosa Iris virginica that is the label for that data set right. So, that is what it means is empirical formula derived by last intuition will come under the umbrella of data science, okay, that is a philosophy question right. So, you can call that data science you can call a computational science you can call that theory and this has empirical it so that is philosophy. Have a good question. We don't want to take that now. Okay, recording, I'm not sure I'm not going to be we'll take care of that. Okay, so this was good. So yeah,
we'll be having the cohort starting in the month of September rent and applications are already opened. In fact, there's a very high demand running for the programs already 70% of the admissions are already filled up. So And right now, we also have few scholarships available. So in case you all are planning the processes, y'all need to fill in the form online and write your statement of purpose in the form and we can forward your profiles for shortlisting, it's going to be a 10 month program and it's a weekend based course. So it's highly designed for working professionals. So only on Saturdays and Sundays we'd be having these classes by the faculty and everyone is taught right from the basics out here, down level one students are comfortable, we will be introducing you all to these advanced techniques and basically, everyone will be learning how to build those mathematical models to solve the problems.
So, every every topic that we teach will have similar structure, there will be an intuitive understanding of what goes on how to use that using the programming language, which is Python, there is an entire module zero which is called as the bridge module in which you will learn if you don't know how to use Python right, you will learn how to use Python before you even start coming to the sessions that are there by the IAC processor talentsprint will be offering that module zero, then you will learn about all the mathematics that are needed and all the different parts that I showed right in the data science sample all of that you will get training inside. So, and the sessions will be very live and interactive like what we just saw, you will have assignments right. So, after this typically if this was a regular class, you will get an assignment in which you will practice the concepts that were shared, you will have mentors are sent to you in with who with which with whom you will do a one on one interaction or many on one interaction, you will have your cohort you will learn from each other right and I will have office hours right. So, all the faculty will have office hours in which some questions that could not be answered by the mentors or your peers, you want to ask us right so you can book an office hour slot and answer that and typically how it works is with that assignment, you will learn the theory then the next week, right? So there will be a mini project, which will be a mini project in the sense that it's a structured project will be questions and answers, which you have to do with your team right. So, so that you get an answer you get hands on experience in doing this. So, we do both theory and hands on right. So not just theory and giving you a high level picture, we do theory and hands on in the theory it is not just mathematics, we also give you the intuitive understanding of what this is. And it is not just intuition, we tell you the mathematics also I did not go into the details because you need some more background today right. So but in our regular class, I would have gone into how that is calculated and how the optimization is done. Right. So I hope to see many of you in the in the program, right? So good luck to all of you.
We are also getting one common query that am I eligible for the program. So let us quickly discuss about the eligibility as well. If you have a four years of graduation, and you're comfortable with basic maths, you're 10 plus two standard level and you have a basic coding knowledge. It could be c++, Java, Python, you are eligible for the program with a minimum of one year of work experience, it can be in any domain. If I talk about we have successfully launched two cohorts and more than 200 people are already undergoing the program. The average experience also of the people in the batch is 12 to 13 years, and going up to 20 to 25 years or so. So team leads project managers, architect, software developers, generally professionals from it background engineering kind of profiles from mechanical etc. Computer Science backgrounds enroll for the program, and they are coming from various sectors it telecom manufacturing healthcare person should just have a zeal to learn. If you're ready to dedicate your time, you can easily manage the program and this is an advanced course. So it would give you an added advantage because basically we'll be learning also out here to build those mathematical models and solve problems. That's right. So yeah, you want to be closer. So I'm getting a question like what is the fee for the program the actual program fees is four lakhs plus 18% GST, but right now there are a few scholarships that are being offered. And up to one lakh we can offer as a scholarship right now for which people need to apply before 18th of July. There are various EMI options also available which are interest free, where students can take and avail that as a feature.
The regular classes are live dates or live classes right. So
nothing is recorded every all the classes are live. Once the class is over, you will be getting an LMS access through which your recording would be uploaded in one working day. Okay, yeah, thank you, Professor, thanking Thank you for taking out your time, especially on a weekday. And for such an enriching session, I would say, and I would like to thank all of you for joining us for the session. If you have any queries, I have posted my number in the chat box and posted the link as well. You can reach out and we can have a one on one call also to discuss any further queries you have. Thank you and good day. Bye. Hope to see you all in the cohort three bye
Watch the entire interview here https://www.youtube.com/watch?v=k80S-lvQDiA