Webinar on Learning to See: Machine Learning in Computer Vision
So but today,
we are just going to take up one topic and you know, have a very high level introduction, and then
try to chat with you. If you have any questions on it. That's the idea. Okay. So here it goes. It's a very
layman's introduction to this topic of computer vision and machine learning, we will get a little bit of detail into details towards the end, but that would,
I'll spare the details for your questions, in case you have any. Okay, so let's get started. So, what is computer vision? And what has machine learning to do with it? And why do I call it as learning to see?
Well, vision is one of the most fundamental perceptions of human beings, you know, we have, of course, you know, the five senses. But out of all those senses, if you just look at the brain mapping of it, and see how much area of volume of your brain is dedicated to each of the senses, you would notice that vision is by far, the most predominant sense for human beings. And that's for a reason, like, when we operate in this world, for us, visual input is extremely useful, it is useful, not just to, you know, be able to navigate yourself in this world, but also to, you know, determine prey from predator to, you know, understand where you can walk, understand, recognize people all kinds of things you do with with your vision. So, it's an extremely important
sense for human beings. And we accordingly, has developed our own world, the modern world, when we developed, it is highly visual in nature, most of the instructions are visual in nature, and in the world, you hardly ever hear an audio instruction as to, you know, stop at this red light, it's usually a red light, that's it. So, vision is extremely important for even an AI system, if it wants to operate in this world, it's inevitable that it has to
have the capability of vision and to be able to see. And towards this, this area of computer vision, which actually aims at trying to derive a usable knowledge of the world has been around for quite some time, you know, several decades now.
Almost three quarters of a century maybe. But
only in the very recent past with the advances in deep learning and machine learning that it is computer vision has kind of come of age and be able to create solutions. So let's try to at least get a feel of what computer vision is, what the challenges are, and how solutions to these kinds of problems are derived from a very high level perspective. Okay, so my outline for today's talk is this just give you a quick introduction and talk about some of the challenges in computer vision. And I'll talk about three different approaches to solutions, there are even more, but these are kind of the key ones. One is solutions that are derived from the geometry of the world, we operate in a 3d world images are 2d entities, and there is some relationship between the two. So how do we derive solutions using that? Second is how do we derive solutions using learning? And third is can we combine these two ideas together geometry and computational learning together to create create solutions? And then we'll quickly look at couple of applications. And then, you know, since I'm a faculty member, I'm fond of giving assignments so at the end of it, I'll give you an assignment to to work on at the end, this is for some food for thought. Okay,
here we go.
So, what is computer vision? Well, you can say that it's understanding visual input, whether it is images or videos by computers. And, you know, basically making sense of the images, what is there in the image, or another way to talk about understanding is very important, this this particular aspect that, can you describe that image, if you can describe the image, then you understand it? In fact, for human beings, understanding is very close correlation with languages. You know, when you say I understand something, it's not just saying that, you know, I understand what I'm seeing. Actually, it means that I can understand the components of what I'm seeing and be able to describe the relationship between them and what is going on. So it's, there's a lot of language related aspects involved in it.
Now, it's an interesting question to ask, does computer vision mimic human vision? Well, definitely in the goals it does. We want computer vision to be able to do things similar to human vision. But
is it necessary that it also follows it for its
solution methods? Well, we'll see that it's not necessary. Okay. Why do we try to emulate human vision? Well, because human vision is one of the best. When I say human vision, I'm not just talking about the eyes. Definitely our eyes.
Not the best in the animal kingdom, there are other animals which can see better than us. See, in the sense, capture rays of light, but ability to process it is far more for human beings. And hence, it's a far more sophisticated
system that you have. The problem is that we don't understand it well, in all of us are experts in vision, we can look around and see things very easily. We do things very quickly. But if I asked you, how do you do these things, it's very difficult for us to understand or to to explain. So that is another interesting aspect of computer vision.
So should computers process input like human beings? Well, not necessarily. Many times we do draw inspiration from the human vision, but not necessarily always limit the computer vision to what human vision can do. In fact, nowadays, computer vision has surpassed human vision in several of the tasks. Okay, so let's just look at couple of issues with human vision and see if this makes sense. If you look at the two pictures on the left, you know, if from a first quick glance of those two phases, yes, they are upside down. But you just take a quick look at it doesn't look anything wrong with it, then it. But if you were actually to look at it as in the wrongs right side up, then you suddenly see that Whoa, there's something wrong with the left face. Okay? The eyes are not right. The lips are not right. You know, you see these things very quickly. Whereas when it was upside down, unless you were able to kind of look at the minute details and stuff like that, you don't realize there is something wrong with it. And that's an interesting point. So very clearly tells you that human vision takes some shortcuts to be able to figure out whether the world is okay around it or not. Okay. Another similar problem, you can see it in the right eye is a check chessboard kind of thing on which there is a cylinder cap, and there is some light coming from the top right and casting a shadow here. Now the question is, can you look at these cells A and B, and tell me which one is brighter? Okay, which one has a lighter color, which one has a darker color? I'm not talking about the letters A and B, I'm just talking about the columns, they're the two cells that on which a and b are written. And it's pretty obvious to most of us that a is the blacks cell, it's dark in color, and B is much brighter, how much of our clothes you observe these two, I can't figure out any way in which this could be, you know, any different than that. That is very clear to us that B is much brighter than a. But this is another problem with human vision. You know, if I do if I just mask out the rest of that.
checkerboard, let's try to do that, I just masked out the rest of the checkerboard. Now look at A and B and tell me if they are one is brighter than the other or not. If you look at it, they're actually identical in terms of color, I did not do anything, I just put a mask on top of the other, I can go back to the previous one here.
And I did not change that image at all in between it was just a simple JPEG image put there. And you can very clearly see that the context has been misleading us a and b are actually of the same intensity. But when you see it in a context, in this context, you have the context of a checkerboard at a particular angle, there is a cylinder light coming from one side. And all these things together will your brain will start interpreting be a cell white cell in a shadow, and as a dark cell in bright light. And because of that human brain kind of separates out the light from there and tells you what the actual color of that cellars or at least what it thinks that colors. So we don't really see what is out there, we see an interpretation of the world or by our brain, and that sometimes can be really good sometimes can be bad. Okay, another popular example down here is asking the question the two horizontal lines here and here, which one is longer? Well, the quick look, of course, this one looks shorter, but if you try to match the edges of it, you will see that actually both of them are the same length. Okay, small. So we have seen several of these optical illusions. But my point is that human visual system is not the best. No another example here, if you look at this one, unless you are looking at it in a mobile phone, you might, you should see some kind of motion on the screen. You know, you'll see the circles kind of rotating here and there.
But the problem is that if you focus at any point on this, you don't see that point moving, you see other parts moving. Actually there is no movement in this slide at all. This is not a video. This is a simple JPEG image. Okay, nothing is moving in here. Your brain is playing tricks with your eyes and making it look like something is moving. Anyway, I think we have had enough of
Optical Illusions to tell us that, you know, our human vision is not necessarily the best, a computer vision system will just look at it and say absolutely no motion, because it wouldn't get fooled like this. Okay? My point is that human vision system has learned a lot of
information, and we use that learned information to interpret these images. And it is both a
gift for human vision system as well as impediment for actually understanding what is there out in the world.
Okay, so let's keep that apart, and then come back to vision and ask some basic questions. What are the fundamental
problems that are out there in computer vision? Okay. Usually, if you look at human vision, you know, we tend to have this urge to do certain things when you look at an image. For example, if I look at this image on the left, the first urge that we'll talk about is to group that is, when I look at this image on the left of the road, and the car and the trees, are there, I don't see individual pixels in there, I don't see okay, that pixel is the road pixel, black pixel, yellow pixels. No, no, I just see the road. I combine all those pixels together and say, Oh, that's one entity, the car together, that's one entity, the three together, that's one entity, you can group those pixels into regions. Similarly, if you look at the image on the right side, you know, you don't see individual pixels of yellow, blue, green, black, whatever it is, you just see, okay, that's a person there. Another person, another person, that aircraft there, those sky behind the green? No, yes, you just group those things together and directly Think of it as a single object. And this kind of grouping together is extremely important for us to make sense of what we see. And for human brain, it's almost impossible for us to not,
you look at something immediately, your brain does the grouping. Okay, to be able to stop that is very difficult, you can actually train your brain a lot to, to do some of that. But it's extremely difficult. If any of you are artists who try to look at a scene and try to draw that one of the ways in which you can train your brain to draw well, is to stop your brain from doing this grouping and recognition kind of tasks. Okay. So this is what we actually try to see you see these groups instead of what
individual pixels that PC, second is our urge to recognize, you know, you just look at the first picture. Now, your brain says that Subash Chandra Bose a second, you'd say is it scooter minara, peacock or, you know, Mark Zuckerberg, whatever it is, you know, it just immediately says what it is, rather than you having to actually look at all the bits and pieces and figure out what is what and understand, okay, there is a cylindrical structure with some curves on it for the kutub Minar. And then, you know, try to go back and try to match every cylindrical structure you have seen in the world, and then finally, figure out Oh, this is good I mean, are, you don't do that, you know, just look at that, and outcomes from your brain, or that could have been, or that's a peacock, you never probably would have seen a peacock in this specific orientation in your life, but doesn't matter, you know, within a fraction of a second, you recognize. And the point is that you cannot stop your brain from recognizing, that's why it's called an urge to recognize our urge to touch the memory. A third is kind of an urge to measure, you know, you look at something, how far away objects are, how many things are there, you know, you're throwing something, how far away the javelin is gonna go and fall, all these are aspects of vision that are very important for us, you know, measuring distances, counting things, and so on. And that kind of you you do kind of estimate this, approximately, I just look at this
tomb on the top right, how far away that is, you know, probably, you know, a few 100 meters away, you have some estimate of it, maybe 200 meters away. So, this kind of estimations we keep doing all the time.
Okay. So this is another third aspect of vision. So, your urge to prove your urge to recognize an urge to measure, these are three fundamental things that your human brain automatically does. And in fact, these are the same problems that if you can teach the computer to do these things, then you will actually get a very good vision system. So that's what we want to achieve.
But the problem is that
what is so easy for you so obvious for you and almost involuntary for you, is not at all easy for computers.
And why is it so difficult for computers to understand that one particular exercise that we can do is try to imagine try to put ourselves in the shoes of the computer. Okay, let's take, try to take this picture. All of us just look at it. Immediately. Our brain says, Oh, that's Charlie Chaplin. Okay.
But imagine what a computer sees when it looks at this picture. Okay? If
it were to look at, let's say, the small region around the eyes of this picture, what does the computer see when the computer sees this?
Okay, something like that. I haven't put the exact numbers the head, but a whole bunch of numbers in there. It's approximately correct. Okay, some range of numbers that you see there.
If you look at those array, if you if that picture on the left is not given to you, and if only those array of numbers are still given to you, will you look at it and say, well, that's, that's an eye difficult for you to do. So now you understand what is the computer looking at the computer has to look at these numbers and figured out whether these these these pixels go grouped together? What is the color of it? What is the shape of it? What is surrounding it? Can you actually put a black region around a white and white region around it, and then another black line around it, and put all those things together and call it as an eye of a person.
That is what the computer has to do, it has to build up this information from ground up bit by bit. And that's the reason why it's so difficult. Okay, the information present is very low level. And this higher level information that we are used to getting is actually a result of several amount of several layers of processing that is happening in our visual system. We do this involuntarily and we can't stop them stop it. Another problem is that things appear in a large variety of shapes and sizes, okay, for example,
what are these? Apparently, these are all chairs. Okay? Now, if I asked you a definition of what a chair is, you might say that, okay, it has got four legs, it has got a flat surface on which you can see it, maybe there is some support in the back, you see all that? And then ask the question, which of these pictures fit that definition?
Maybe almost none of them. Okay, none of them fit the definition of a chair. But still, none of them are chairs. Some of them are, of course, you need to think a little bit to figure out how to sit on them. But still, these are actually indeed chairs, okay. So, that is the second difficulty of computer vision that the variety of objects is extremely large. Okay.
The word technically this is called the intraclass variation. intra means within, okay, inter is usually between intervals within, within a class, within the class of chairs, there are so much of variety that for a computer to learn, what is a chair and what is not? It's not easy. Okay? Basically, you need to ask the question of this property of suitability. Can you sit on this object, then it's a chair, maybe. Okay. So that's, that's the problem. So it's very difficult for computers to actually learn from a set of rules, you cannot code this in, you cannot write a set of lines of code, which would say, okay, check for this one, then check for this, then check for this. if then else, you know, you can't build a program like that. To do this, you need to be able to do some completely different approach to solve this. And that is what, where machine learning comes in. That's why we want to learn to see rather than to train, or rather teach line by line, step by step ways of how to learn how to how to see. Okay, so we'll get to that in a minute.
So we can infer, infer a lot from the pictures, the question is, can we make computers to do the same? Okay. And
the last aspect that I pointed out before is that, do we really understand how we infer, we don't really understand how we are doing it, and that's the problem. That's the reason why we cannot make the computer understand by coding, you need to let the computer learn, okay. So that we'll get back to that. So now, given that all these challenges, let's try to look at three different approaches to solve the problem of vision. Okay, the first clue for doing anything, especially measurement of the world comes from geometry, okay? The fact that the world is a 3d world, and your image that you're looking at is a 2d object
has and there is also some kind of a relationship between the 3d world and the 2d
image that you see is what is used. If you look at a camera, which actually captures images. The simplest variant of it is what he calls a pinhole camera, you have a light box, you have a single hole in the front, and light enters through that hole and forms an image at the back. Most of you might have seen this kind of pinhole cameras, you might have made them in your school time.
This is the simplest form of a camera. And this is what we use in computer vision to create a mathematical model for it. So here's the simplest mathematical model for a camera. This blue box on the right side is a light box. There's a small
Hole in the front of it. And there is an object out in the world, and rays of light, whatever falls on this pencil would get reflected from. So there's light falling here, and the light comes out of reflecting out of there will be yellow in color because the object is yellow. And there are several rays of light that goes out into the world. But one of those rays of light will happen to pass through this pinhole and fall on the other side of this camera. And hence that part of the screen would be lit in yellow color. Okay, similarly, light from here, we'll have a pinkish color. And that's what would appear here, here, it will be dark in color, so there'll be very little light that will be falling here. So this part will also look dark. So if you look at this, you can see that, you know, kind of an image of this object out here is being formed at the back of the camera here. And this will be upside down, that's okay, that's just a detail. But this is what happens in the camera. Now that you have this picture, you can actually create a very simple geometry out of it, all of you probably remember your high school geometry, where there are two triangles here, there is one triangle like this, and a second triangle that is out here. And if you ask your kids, they would say, Oh, those are similar triangles. Okay? If they are in school,
around the around the age of, you know, in the eight to 10, that's when they learn. So this two triangles are similar triangles, so I can use the ratio of some sides of the similar triangle, I can divide this y divided by Zed is the base by height of this triangle would also be equal to the y by F for this second triangle. And from there, I can create an equation like this small y is equal to f times capital Y capital said, where F is the distance between the pinhole and the back of the camera, that is the distance to the object, y is the size of the object, the height of the object here, and small y is the height of the image of Dodge. Okay? Okay, very simple equation, y equals f times y Zed. Now, what can we do with it? Well,
just by this much, you can actually start solving problems, real world problems. So let's take a simple problem. I'm not going to go through the details of it. But let's just imagine this. So assume that you have a person, let's say, of 1.75 meter tall, standing at a distance of seven meters from the camera, and the camera has a focal length of 50 millimeter, which is the distance from the pinhole to the back of this thing. And the sensor at the back is three centimeter tall and has got a resolution of photos of 3000. The question is what is the height of the person in pixels, you can just use that equation that we learned where it would be 1.75 divided by seven will be the same as the size of the image
in the size of the image of the person divided by 50, the size of the image of the person in millimeter divided by 50 would be the equation. So you can use that and you can find the size of this person in this particular case, I'm not going to do that you can try and do it. But what I'm saying is with this very simple geometry equation, you can already start asking questions, you can do the reverse as well, if you know that, I can measure the height of the person in the image, which is what a computer vision system can do. And if I know that, okay, this was this particular person and his height is 1.75 meter,
my computer can tell how far away the person is. So we can already start answering questions like how far away is this person from the camera? We can answer it on the simple equation. Okay. You can also ask, you know, other kinds of questions which are actually, so can be solved using the same equation, I'm not going to get into details of how you do it. But if I move the camera up by one meter, how much does the person move down? And that is also can be solved with the same equation? You can also ask the question, how much does the sun move in the book is that as if I take my camera, take a picture of the sun, move the camera by one meter, how much will the sun move down in the image and the same similar triangle equation comes except that instead of the seven meter away, say the person was seven meter away, now the sun is 150 million kilometers away. And you will notice that when you plug that number in the sun will hardly move. So, this is the reason why when you are sitting in a train and looking out at night, you see the mountains which are far away or the moon that is rising up, you will see that as you move, you see the the moon is following you, whereas the trees are passing by. And that's basically coming out of the same equation. So a lot of interesting things that you can connect with a very simple, geometrical fact. My point is that geometry is one of the important cues with which you can actually create computer vision solutions.
Second, is probably mostly related to the title of the stock learning to see again, I'm not going to get into details of it.
how we do it. But
one of the classical problems in computer vision that we already talked about is segmentation. That is, can you group pixels together and form objects of the same kind. So in this case, there is this picture of this animal is a llama, and I want to separate the pixels that belongs to the llama from the background. The problem is that the pixels of the background are of similar color as the body of the llama. So how do you separate them and be able to get this thing out of from the rest of it? Okay, there are several approaches to solve this. And
popular approach that is currently out there is based on learning. And this is where
we try to learn to segment that is, if you're given an input image like this, let's say have a
road scene with cars and all can the computer output something like on the right side, where the road pixels are grouped together, the sidewalks are grouped together, the buildings and other things are grouped together, the sky is grouped together, the cars are separately grouped like that. So if it's possible to do that, then it actually helps you a lot in terms of let's say, if you want to do some autonomous navigation or driver assistance, kind of problems, you can do that. But how do you do that? Well, there is this set of planes that you see in between this is what you call a deep neural network, we will not get into details of that. But the fundamental approach here is learning. Okay, and how do you teach the computer to do this task? Well, you start with some training data where there is whole bunch of images of the type and the corresponding segmentations. On the right side, you it comes as pairs, an image and it's a mutation. So you have pairs of these appear as training data. And then you feed this on both ends of this neural network, and then do some computations from front to back and back to front.
In this neural network, and through this process, you will tweak the parameters inside this neural network. And it will slowly start learning how to do this by tweaking the parameters inside by applying this image on the left and demanding that this particular output be created out here, you can actually go from both ends and then try to tweak the parameters inside that. And that's from a very high level, what is going on when we say learning learning means you're given a problem and its solution and as the student learn how to do this particular problem to go from this problem to the solution. So, here is the problem the image the solution is the segmentation and you show the network several examples, actually millions of examples of these
kind and then it will slowly learn how to segment
Okay, so that's the second class of
approaches where you try to learn and this is just one example, this deep neural network based example
approach can be used to do recognition it can be used to know segmentation it can be used to do measurement, it can be used to 3d reconstruction of,
of the world whole bunch of things can be done using this deep learning based approach, because it's a very generic approach, where you have input and output and let the machine learn how to do that, rather than we telling the machine how to do it, let it learn from examples. And that has been the the predominant approach that has
remarkable results in the recent past, okay. And the third approach that I mentioned here is kind of combining geometry and computation through this area called computational geometry. And one example that I will take here, this is from our own work in the lab. This is the problem of computer capturing a 360 degree stereo video, okay, I want to capture not just a small portion of the world like a regular camera, I want to see all around me, not just that I want to see all around me, I want to see it in 3d or in studio, okay. So, what is this problem, like when we look at the world, if I take my head, my head can rotate around, I can look around or my body can turn around, I can also look up and down, but I hardly ever look, you know, this kind of twisting of the head around this news axis this we hardly ever do the rule, we can kind of ignore it. And pitching up and down is also you can ignore if you have sufficient size for the image. You can see from top to bottom the problem is looking around you. Okay, so the question that he asked us, if we want to capture those 360 images, 6060 degree images in stereo, what should we do? If you look at the two eyes of the person and imagine them to be two cameras, you can think of them as sitting on the edges of a circle. And when you turn your head around, it is equal until turning this so at a single point it will be looking at one direction. Okay. Now, when you turn your head on
These two cameras will rotate around in a circle. So the left eye will be rotating around in a circle like this looking in one direction, and the right eye will be looking in the other direction.
Of course, I haven't drawn this properly, I took the left hand right in, in two different circles, actually, these two circles are one on top of the other. Okay? So the picture is that these two circles to be concentrated, or in other words, it should be something like this, the right eye views will all look like this, left eye views will look like this. So what you need to really capture if you want to capture both left and right eye views of a 360
seen in studio is that I want to use all these cameras that are on the circle,
for the right eye, and the left eye. So all these images have to be captured, if you want to really capture the stereo panorama. The problem is that if I put a bunch of cameras like this, each camera will just see the camera in front and it won't be able to see the world. Okay, so that's the issue. So how do we solve this. So this kind of a problem, you cannot just use computation to get around it. Because unless you actually get the information, you will not be able to solve this problem. So the capturing of information itself becomes important. And one of the solutions that people have said is that if I take a camera, and instead of putting it on the circle, if I put it out there further distance away from the circle, then I can kind of simulate these two cameras, these two cameras can be thought of as whatever ray of light that I want to see, I can try to approximate it with the rays of light that this outside camera will see. Okay, and do a bunch of cameras like this all around in a circle. And based on that you try to capture the
visuals that you need. And this kind of solutions have been created. So the problem is that this takes a whole bunch of competition to solve it.
A second approach could be that you can take a single camera and rotate it and cut out strips corresponding to the left eye and right eye from a whole bunch of images when you rotate the camera. And then from there, you can compute this 360 degree panoramic views for both left eye and right eye. And this has been a solution that existed for a whole lot of time. And people use to try and do this. The problem is that whenever you're trying to rotate a camera, you cannot capture videos, you know, when the world is moving, if I rotate it, by the time it comes back, the world would have moved and things doesn't look right. Okay, so you want to be able to not just see all 360, but you want to be see all 360 at the same time. And then only you can actually create videos. So we try to create some solutions, right, we take her took a camera and created some kind of fancy object optics on top of it, which would actually try to capture all this information at the same time. And then the fact that we have computation means that the image that we get looks something like this on the right side, okay, and you take this and then do some computation on top of it to come up with a whole scene like that. So this scene is actually reconstructed from this weird looking scene that you can the image that you can see on the right side, Okay, back to solution looks something like this, the camera looks something like this, it has got a mirror like surface in there, there are cameras in there. And it does several rays of light that in our incident on those mirror get reflected. And that gets reflected back into the camera, which is inside, and it can do the stuff. Okay. And in fact, this particular solution that we created in our lab was quite successful. And we had my student who was working on it actually started a company on it, and they are now selling this kind of cameras which can be attached to a robot, it can be used to do navigation because it knows the distance to different objects. Remember, it's seeing 360 stereo, so it can see all around you and also the distance. So you can do that you can find out how far away each person is based on the distance from this camera. And it can also recognize human beings because it's actually an image. So it recognizes that there are human beings here and here. And that is useful for this particular robot in in this example here, this is what is what he calls a UV disinfectant robot, which is supposed to throw UV light and disinfect places around especially useful in this current pandemic situation.
The problem is that if we when people are around, if you throw UV light at them, that's not good for them. So it has to keep looking, it has to move around and keep looking for places where there are no human beings and then irradiate the place with UV light to disinfect.
You can also use this because you get 360 degree stereo, you can use it for virtual reality kind of applications. You can use it for surveillance kind of applications, you can attach this camera at the top and then try to see the world in 3d. So all kinds of applications come out of it. Okay, so that's basically a quick look at three
different directions of computer vision.
I'll just spend a few minutes quickly going through a few examples of applications that
computer vision can
give you. So we talked about segmentation where you try to group things together. And one of the applications of segmentation is what you see on the left side. And this is basically
a scan of the human body, and the lungs are being segmented out. And when you say segment of the lungs, you can also measure its size, its volume and stuff like that. And based on that, you can create objective measurements of your lung capacity, and so on. So this is used in automated medical diagnostics and stuff like that. On the right side, you see some satellite images, based on the vegetation that is there, this is actually a picture of the Amazon forest. So it's not real color image, it's fake color in the sense that it is captured in different frequencies spectra, which is outside the vision visual range. But for computer vision systems, it doesn't matter. You know, you don't need to necessarily work with visual spectra, you can work with
infrared, ultraviolet, all those radio waves and all kinds of stuff. So it can be used to figure out how much of vegetation is there in this space, you can use this to computer vision to take large number of images and do 3d reconstruction of
monuments, for example, we have done this project, actually, this is one of the projects that we did in humpy, where we were trying to create a 3d model of the vittala temple there.
So that is what we can achieve that you can create 3d models, there's actually a 3d model which you can rotate and if you want 3d print and so on so forth, it is quite useful, this is very useful application, which has been around for quite some time is automated inspection, especially of things like that, you know, if you are inspecting a PCB, where there are large number of small connections, you want to figure out that every single connection is actually impact and nothing is wrong there. Well, computer vision is a very effective solution for that, if you have people look at it, they will get tired very soon, and then they'll start missing things. And it's actually tortures two people to do this task. Whereas the computer vision system can do this day and night and just keep doing it without any problems. biometrics is another application that is very
popular for computer vision, where you can try to recognize people based on their face, based on their Iris based on their fingerprint and several other things like that the speech signal. So I've been in a lot of cases of biometrics being used in
recognizing people, of course, all of us know about that our system, which we use here in India, and that is fundamentally built upon the technology of biometrics, broadcasting or entertainment has lot of these kind of mixed reality applications where you have things which are drawn on the ground, which tells us information about you know, when you're trying to take to a goal kick, how far away is the ball from the goal line is something that can be shown very effectively on the ground, or, you know, lines that you need to cross to go to the next level or, you know, you probably have recently seen the javelin throw event where the gold and the silver are all shown us physical lines on the ground, they don't exist, they're just added. From outside, it looks like almost real nowadays. But it's not really line out there a
lot of applications in the, you know, movie entertainment, capturing and creating 3d models of actors and actresses and imparting those motions to characters and so on. That is another very popular application of computer vision, several others signal surveillance, automated assembly, mail sorting, face detection in cameras, Robo navigation, content based image retrieval, entertainment, and, you know, this is an area that is ever growing. So, if you are getting into this area, then definitely there's a lot of things that you can also do. Okay. Why do we do this? Well, unlike human vision, if you get it right, it has very high reliability. Once it gets it right, it tends to keep getting it right, it doesn't get tired. So high repeatability, more objective evaluation, you know, you can actually measure things more accurately, cost is lower, you can do higher speed things, ability to operate in hazardous environments. In some cases, you cannot send human beings into those environments, but you can send a robot with a camera on it and that can actually do interesting stuff.
So this is where the world is in terms of computer vision. There are no general purpose vision systems right now, like a human can do a whole bunch of things by our own.
computer vision systems tend to be specialists. Like if it's a vision system built for self driving car. It only does self driving cars, nothing else. It cannot figure out looking at a fruit whether it's an apple
or an orange that it cannot do that, that it's not built for that. So usually these are very specialized systems, unlike human vision systems. So that is interesting point to note, maybe in future, you will have generalized vision systems, you know, things like automated road safety and stuff like that has been gaining a lot of attention. And if you are from India, then of course, we have far more interesting things that are happening in our roads. So, the computer vision systems have to deal with, you know, stuff like this, rather than nicely delineated roads. So, that is a big challenge. So, we have been doing some
competitions, which actually tried to do improve the state of the art in computer vision, especially for
navigation in unstructured environments like this.
Okay, so that's the end of it, I'll just give you one
problem for you to think about. But before doing that, if you have any questions, I would take that first. And at the end of it, I'll try to give you one
problem to think about, if you are interested in thinking deeper.
There's one problem, which is
not a solved problem. Okay, so if you ask me how to solve it.
I also don't know. But of course, you can always think of potential directions. But before getting into that, any questions from anyone? I'd be happy to answer.
Thank you very much. I know. We already have couple of questions. So what I'll do is I'll read out the questions for you One by one, probably, we can try and take as many as possible, because we also have a problem to be solved as an assignment. So let's look at the time constraints and we'll try to address as many as possible. So to start with, the
Ashman says that I do not have any programming language, can I join the AI ml? So in terms of that context, Ashwin, some amount of basics of Python, and a little bit of understanding on any of the programming language like basics of that, I think we can definitely look at considering in terms of admission, because you will have to probably spend a little more time while exercising all these activities. I'm hope I'm right or not on this bike. Yes. Yes. And we have a question from amarnath. Mukherjee, he asks, Does Hawkeye system use computer vision? So absolutely. So Hawkeye system is one of the examples of computer vision, where they actually use a set of cameras, and use the combination of images from all that and all these cameras have to be synced to the nanosecond. Okay. So they are extremely fast and calibrated and synchronized cameras that have cameras, which are put on both. And so the we get, and they look at the ball, they detect the ball and track it in each of the images. And then you combine the information from all these cameras together, to be able to figure out where exactly the ball pitched and which way it is moving, and so on, so forth. So it's a very precise solution that comes out. It's fundamentally from geometry, we talked about the basics of geometry. And it uses solutions based on geometry to actually do this measurement as a very successful solution for that purpose. Yes. So it is a computer vision solution.
Hope that we've been able to address that question. So moving to the next question.
Harish is asking what kind of hardware and software is required to implement AI based vision? Okay, good question.
All of us have got sufficient compute power and things for deploying any an AI solution, our mobile phones are usually sufficient, it has got a camera, it has got a processor, it has got a display. Usually, most of the solutions in vision can be deployed on a mobile phone.
But the modern solutions, many of them are actually based on deep learning. And some of them are can be pretty heavy. So many times you send it to a server and do that. So
in practice to actually deploy a solution, it doesn't require that much in the sense that if you want to try it out, most of your laptops are good enough. You don't require specialized computing, for deploying solutions. But to learn to train systems usually require larger computation power. And we hardly ever use our laptops and stuff like that. You can use it for learning how to train and stuff like that. But the moment you get into
any serious learning, you need to. And when I say learning I mean, machine learning, not you, you're learning, okay? When you do, you're learning laptops are more than sufficient. But when you actually try to teach computers with huge amount of data, then you require serious competition computers with multiple GPUs and stuff like that. So usually you take it from the web, you know, your, you can get it, many of them are available free of cost, like Google colab, and stuff like that for several hours, otherwise, you can also rent out systems like Amazon cloud and so on. Okay.
So moving on to the next question.
How much accuracy in night vision analytics?
I'm not sure if the question is something which we would understand. Probably provoca be, you know, can look at if you can just type the question once again, with little more details on the
question, but we'll try to address that, once again.
We have one more question, suggest a good book for computer vision.
So you would want to add on something over here?
Sure. So in
computer vision, if you want to start out with, you know, most of the basic stuff of computer vision, there is a nice book on computer vision by Dr. Rick solinsky. He was from Microsoft Research, I'm not sure if you're still there.
But that book, if you just type computer vision and Excel is key, Richard cell is key is his name, sell his case, s Zed e, se, Li, sk II. So, try search for that you will, the computer vision book, if you search for you will get there is a website, the PDF is freely available, you can download it and use it. If you are interested in the geometry part of it. There is a specific book called the multiple view geometry, multiple view geometry by Andrew German and Richard Hartley Hartley in German is the name of the author's two people. And that is kind of considered as the Bible of
geometry based compensation. So I think these two should be more than sufficient. There are newer books that are coming up coming up for especially deep learning in computer vision, that is a new area. And for that,
there are some books that are coming up, I cannot say that one is definitely better than the other. But if you want to, there's a course in Stanford that Stanford offers, which if you are interested in only looking at computer vision, and deep learning together in that course, is a useful run, I think it is CS 231, and
then a the course number there, you can try looking at that
slot of information from there that is available. And, of course, in our AML program we take you from it's mostly on the ML part of it. And we'll several of the examples that we pick up in our course, our vision based, so you will get a good feel of it.
Thank you, professor. So I just move on to the next question. I'm in Burma is asking how algorithms are created for that specified approaches which you were discussing in the slides?
So it's a difficult question to answer in a situation like this, we can if he had a specific example we can talk about, but that would again be a very long discussion.
So there are computer vision draws upon several fundamental
branches out there, primarily linear algebra, probability and statistics, a bit of calculus. And
you know, you put these things together, there is sometimes a bit of optics as well. And there is sometimes a bit of signal processing coming in. So there is lot of fundamental areas out there from which ideas are drawn to solve this problem of computer vision. So it's not a simple single direction.
Many times people focus on one set of one or two sets of these basic areas and try to solve problems using only that, because getting an expertise in all these at the highest level is not so easy.
So that's usually you will see that people are more specialized in certain kinds of applicants because of the background knowledge they build up. So approaches vary significantly in these cases, but if you ask me one area that is most fun area of mathematics and algorithms that is mostly affect
affected in the sense positively influenced computer vision
In the area of optimization,
if you want to look at it, sure.
Rajiv Shah has asked a question stating, are there any computer vision solutions already available for sorting things based on size or quality? Same food industry? There are several. You know, I think, when I did my first computer vision course, that was in year 1999,
this was my first year of my PhD, when I was trying to do this, we were actually doing a case study on one of these systems that were developed by IBM, and the system was called veggie vision. And the goal was that that, can you just take a bunch of vegetables in a cover, like in a supermarket, put it on a weighing scale, like in a supermarket and have a camera on top of it, the camera would look through this
cover, it will have some of course, reflections on the plastic and all that, but can you look into it and figure out what particular vegetable it is, and not just any vegetable, but you also have to say the specific type, you know, it's not enough to say this is a mango, you need to say whether it is, you know, beggin Polly, or Alfonso or whatever it is, because the cost is different. So can you automatically build was the question, and they did develop a system that is
that can do that. And so solutions existed, at least more than 20 years back. And I'm sure there are several other solutions. I haven't followed up for the specific
problem that you talked about. But
I've seen many auxilary solutions around it in the area of food industry in terms of salting of, let's say, cashews in sorting of even rice, people have used this kind of solutions. Yes.
And if the
goal is asking, Can we use it for medical research, data analysis?
So, yes, so medical image medical data means of course, there is text data in medicine, and you can do a lot with the textual data. And machine learning can definitely be used in that. But specifically, you're talking about computer vision, yes, we use a lot of computer vision to do automated processing of things like you know, slides, which are created for,
you know, pathologist who want to look at it using microscopes, there's systems that would actually image those things, we are currently doing a project where we are trying to scan large number of these slides and bring it into a database where it can be made available to people to you know, style study of the system. So template is heading a center, which is trying to do that. And then there are applications in a lot of applications in human vision system, can you take images of the retina and then try to predict things like, block them, or even things like diabetes, you can all see that in the in the retina even before it starts giving symptoms in the rest of the body. So, you know, computer vision for medical diagnostics, and
even screening kind of applications. It's a huge area of potential areas. So this, we have, I think, has just started we have a center that is supported by the central government to actually do data collection in the healthcare, domain healthcare and mobility domain is what our center focuses on.
So it's a big area. Yes, of course. So chakrapani is asking is robotic surgery also from computer vision. Yep, you mean, computer vision. It's not just remember, it's not just visual spectrum. So you can use x rays, you can use MRI, all those things to understand what is going on inside. And then even during the surgery, you don't necessarily need to look at the body with
only visual cameras, you can also use other kinds of cameras to look inside. So, yes, most of the robotic surgery solutions are guided by computer vision systems in several cases. So it is usually used for assistance purposes. Currently, when you see robotic surgery, many times what is happening is that the surgeon is actually doing the operation. So the surgeon is sitting in a room with a set of specialized bows and attachments on his hands or rands, and then they're going and you know, their motion is basically translated into the motion of the robot. But the communication between these two and
other kinds of help are being provided by computer vision systems. There are examples of computer vision systems.
Where you could actually look at the patient. And there is a small glass plate between you and the patient. And the projection system actually projects something onto the glass plate so that when you look at the patient, you can just see the patient as well as the internals of the patient projected onto that. So when trying to do something, you know exactly where the problem area is, so that you can cut exactly there and that nowhere else so that kind of reduces the amount of bleeding and recovery time and stuff like that. So yes, in the area of robotic surgery, computer vision systems have been used for assistance purposes that will now
will it actually take over where
it will completely do by itself? I'm not sure
it is being done by in one area I know, is
after make surgery like LASIK, where computers will actually decide how much looking at AI it will decide what part to burn and how to remove things from your eyes using laser. So that is done by computer vision. Yes.
We have a question from Ramesh. He's asking how effective is explainability in computer vision problems?
Very, very important question at least I would say so because I'm interested in it.
I work on one of the areas that I work on in the case of machine learning is explainable AI that
you know, take a decision from an AI system and ask the question, why did you do that? Okay.
Problem is a bit more deeper than you might initially realize that the the question of why and what kind of answer is satisfiable or satisfactory for a human being is an interesting question.
When I say why did you do this?
You need an answer that is based on a certain set of
primitives that we agree upon. Like, you know, if I asked somebody Why did you cross the road
you need to get an answer in terms of what is on the road certain entities that are typically on the road and stuff like that, you know, if somebody gives an answer saying that okay, the 2030 pixels from the left side of the The view was bright in color and the 34th pixel was not so bright and hence I cross the road, well, it is explaining the solution what why did white cross the road, the robo was trying to cross the road or the car is trying to cross the road, it is trying to explain to you, but that is not explanation enough for you, you need to have explanation in terms of entities that are appropriate for that particular problem. So, there is a lot of connection between vision and language also in the in the problem in the area of explainable.
AI. So it's a very interesting area.
sometimes, we might be asking for things that are not necessarily feasible. So, it is possible that computer vision systems at us at a certain point computer vision systems become so intelligent, that it is actually making decisions based on primitives which are beyond our comprehension, that is also possible. But currently, it is basically
not not there yet. So yes, explainability is a very useful thing.
But, you know, once computer vision systems reaches a level, which is much higher than humans, then is it
what kind of exponential ability do you want? You know?
Yes, you want to understand how your mobile phone works? But do you want the biggest experts in the field to tell you how exactly the electromagnetic waves are forming the race that you which has been communicated using a beamforming antenna, which is communicated into the the cell phone tower? Well,
they don't need to do that. So the level of explainability may be? It's a very tricky question is what i'm trying what I'm trying to say. It's a Yeah, I like this area quite a lot. So I start talking, I will not stop. So I shall just
say, Professor, we are already I guess, ahead of the schedule. So what we can do is, as we see many more questions coming in, that we need to, you know, figure out a solution wherein we can have a q&a session separately for, you know, this specific topic. For now, what we can do is we can, you know, continue with the problem, which you thought you will share with us. And rest of the questions I'll, you know, probably call them if anybody has made a note of all the questions so we can, you know, respond to them to their email ids, with their respective answers.
Sure, okay. So we will just give you one problem.
To think about I was I was trying to see, you know, can I give you a research problem by the end. So, here is the problem for you, okay, we all have computer vision or AI systems that are giving you directions or your Google Maps and your navigation systems in your cars are doing that, you know, take a left 300 meters ahead, take a left turn and stuff like that. And you keep you keep hearing these commands from
your favorite voice actor, who is telling you what to do and what not to do. And that is one of the popular problems out there. But I'm asking you to look at a slightly related problem, okay, classical solutions exist. The first of course, is to find the path, go from where to where and which route to take and stuff like that. That is something that if you are a computer vision computer science person, you know, you know, you find the shortest path in a graph and stuff like that. That's okay. But that's not what we're talking about. There. The issues are primarily with respect scale, if you have a huge map, then how to find this efficiently is is the problem here. But I'm talking about can you give directions can a computer system give directions to a human driver? Okay. And that also exists, you know, that all these navigation systems will tell you so many meters ahead, take a left turn, take a right turn, stay on this road, stay on the right, two lanes, all those things that will tell you okay, but
can you make it intelligent? Can you make it give directions like a human being?
Now, I recently took a long road trip tried to go home, during the pandemic, so I had to drive about 1000 kilometers one way. And it is frustrating to hear that, you know, every small bifurcation from a major highway gives a you know, keep goes for keep going straight and 100 times it will tell you just to follow the same highway. Okay, so can it be a bit more intelligent? Can it learn my driving habits, can it see what is out there, and based on all that cannot give more meaningful and more useful and less annoying? driving directions, okay, for that, you're not really have to know your location, and how far away the next intersection is, you need to actually see the world, you need to see the amount of traffic you need to say what blockages are there. And you also need to kind of understand the driver a bit. So it's very interesting a problem. You of course, things like getting your exact direction. And then when should he ask the person to take a turn? How how much ahead of the
intersection, you should do that. And what happens when your driver does not follow the instructions, all these are problems that are solvable. But that is, think about it from a perspective of computer vision, where you are now not just having a GPS, but a computer vision system, which is actually able to look at the world, and then give far more interesting and more useful driving directions. Instead of saying that 340 meters had take a left turn, can you say, No, you see that tree over there? Take a left hand after that. There'll be far better driving direction than giving 340 meters ahead. So
can a computer vision do that system design? If you can create a solution like that? Well, you become the next big solution provider, you're the next big company, I guess, if you want to.
Even more interesting would be in future, if there are multiple cars. And if this intelligence can actually span multiple gods, how do you give directions to each of these drivers in such a way that they will not collide with each other or they they'll be the most efficient way of directing people. So I think Soon, we will all become, quote unquote, auto pilots where the system will start telling us directions, and we'll just follow it until at one point when the car will take over the driving itself. But until then,
I see that at least based on the trend of what I see around, I see people following this Google Maps more and more. I hardly ever hear anybody asking for
the address. They just say send us your location. That's it and then punch into the Google Map and just follow the directions provided by the
lady in the box keeps talking to you. So yeah.
So that's that's the
it's very interesting problems. You want to think about lot of interesting computer vision aspects can be brought into it just for food for thought for you. Okay.
Like, you know, I think you need to first help us understand how to do research on this particular problems problem.
You know, can you also say, okay,
If I say I want to buy some medicines, they can reroute it and make you go in such a way that you can stop in front of the.
Yeah, so it can do all this kinds of interesting stuff.
in this direction, yes, we need to start thinking in that direction. And that's when we can look at each and every aspect and probably look at a Eurozone solution for the same wonderful problem, I need at least one day's time to completely understand the problem.
It's already solved.
already solved in a very, it's not from a computer vision perspective, it's a very, very mathematical way in which it is solved. It's not give
us a frame friendly directions. That's my point. So the question is, can you make it user friendly? Okay. And do and, you know, just come to the concluding part of the session. So I'm just launching,
you know, the poll for the participants. Wherein, you know, we can just look at,
you know, I've launched a poll, wherein you can just submit a feedback, this is for all the participants can just take two minutes time and submit the poll. And, in the meanwhile, have also shared my contact details on the screen. For any further queries, you all the participants can reach out to me on the email id mentioned, or the phone number available on the screen. And we can take it forward from there. One on One sessions, anything related to the content, the program, the duration, the fee structure, and the program counselor, so you can reach out to me on this number. And last, not least, thank you very much, professor. It was a wonderful session today. And I wish we really had more time and more learning. But, you know, we had to, you know, come to conclusion for this particular part. And thank you participants as well, like, you know, you're being very supportive. I'm extremely sorry that we could not take all the questions as we know, we also need to abide by the time constraints. So they will look forward to you know, catch up sometime, I know, maybe one more session, or maybe many more sessions going forward, so that we can have few more interactions on the same line. And thank you, everyone, in case if, you know,
yeah. So if everyone can just submit the poll, we will just end the session. Thank you very much, Professor for your time. And I'm extremely sorry for taking 15 minutes extra. And we'll look forward to have one more session with you. Thank you very much. Thank you, and Thanks, everyone. Good luck.
Watch the entire interview here youtube.com/watch?v=ou8K2QOpnXg