Lex Fridman Podcast: Meta's Chief AI Scientist Yann LeCun [Summary + Transcript]
Podcast transcripts

Lex Fridman Podcast: Meta's Chief AI Scientist Yann LeCun [Summary + Transcript]

Fireside by Fireflies
Fireside by Fireflies

Dive into a thoughtful conversation as Lex Fridman sits down with Yann LeCun, the Chief AI Scientist at Meta. Together, they discuss whether proprietary AI systems pose a greater risk than Artificial General Intelligence (AGI), LLM limitations, and the future of AI.

Here are the key takeaways of the podcast:

Lex Fridman Podcast: Meta's Chief AI Scientist Yann LeCun—Summary powered by Fireflies.ai

Outline
  • Chapter 1: Building Tasks Model (07:43)
  • Chapter 2: Utilizing LLMs for Visual Representations (10:27)
  • Chapter 3: Masking Technique in Text Processing (12:18)
  • Chapter 4: Planning and Abstract Representation (15:11)
  • Chapter 5: Training Generative Models on Video (19:14)
  • Chapter 6: Self-Supervised Learning and Image/Video Representation (24:56)
  • Chapter 7: The Process of Masking in Video (38:39)
  • Chapter 8: Capturing Physics-Based Constraints in Video (40:21)
  • Chapter 9: The Concept of Hierarchical Planning (46:21)
  • Chapter 10: Self-Supervised Learning and Its Demonstrations (51:58)
  • Chapter 11: The Power of GPTs (58:47)
  • Chapter 12: The Importance of Planning in Uttering Words (01:14:02)
  • Chapter 13: Building a System That Can Plan (01:15:48)
  • Chapter 14: The Compatibility Between Images and Videos (01:27:48)
  • Chapter 15: Future Generations of AI Systems and Their Capabilities (01:58:30)
  • Chapter 16: The Challenges in Learning from Video (02:05:03)
  • Chapter 17: The Control and Limitations of LLMs (02:12:32)
  • Chapter 18: The Process of Planning a Sequence of Actions (02:35:22)
  • Chapter 19: The Concept of Hierarchical Representation of Action Plans (02:37:15)

Notes
  • Conversation between Lex Fridman and Yann LeCun on the Lexrumit podcast.
  • Discussion on the importance of language and its role in conveying wisdom and information.
  • Yann LeCun talks about the process of accomplishing tasks and the importance of planning action sequences.
  • Mention of LLMs (Large Language Models) and their potential use in digesting visual representations of images, videos, or audio.
  • Yann LeCun explains the mechanism of training a neural net by removing words from a text, masking them, and predicting the missing words.
  • Discussion on the abstract representation of thoughts and reactions in response to text or actions.
  • Yann LeCun explains his belief that we plan our responses before producing them and discusses the possibility of building this capability through predicting words.
  • Yann LeCun highlights the importance of learning good representations of images or videos for AI.
  • Lex Fridman asks Yann LeCun about his experiences with failures in the field of AI and machine learning.
  • Yann LeCun speaks about the importance of fine-tuning AI systems to produce good answers in response to a variety of prompts.
  • Yann LeCun discusses the idea of planning what to say before you say it and its importance for AI systems.
  • Conversation on the process of turning abstract representations of thoughts into text.
  • Yann LeCun discusses using a world model to plan a sequence of actions to reach a particular objective.
  • Discussion on fine-tuning AI systems using human feedback and different methods to achieve this.
  • Yann LeCun highlights the importance of AI systems being able to point users to appropriate references and resources.
  • Yann LeCun discusses the potential role of Large Language Models in small businesses.
  • Yann LeCun emphasizes the importance of AI systems being able to produce basic things everyone agrees on and avoid producing harmful information.
  • Yann LeCun shares his optimism about new ideas appearing in the field of AI and the potential for progress.
  • Yann LeCun discusses the need for AI systems to have memory and the techniques required to develop this.
  • Lex Fridman and Yann LeCun discuss the rapid dissemination of information in today's technology-driven world.
  • Yann LeCun speaks about the importance of hierarchical representation of action plans in AI.
Lex Fridman Podcast: Meta's Chief AI Scientist Yann LeCun - Summary powered by Fireflies.ai

Want to read the full podcast? Read the time-stamped transcript:

Lex Fridman Podcast: Meta's Chief AI Scientist Yann LeCun—Transcript powered by Fireflies.ai

00:00
Yann LeCun

I see the danger of this concentration of power through proprietary AI systems as a much bigger danger than everything else.

00:08
Yann LeCun

What works against this is people who.

00:12
Yann LeCun

Think that for reasons of security, we should keep AI systems under lock and.

00:17
Yann LeCun

Key because it's too dangerous to put.

00:19
Yann LeCun

It in the hands of everybody.

00:22
Yann LeCun

That would lead to a very bad.

00:24
Yann LeCun

Future in which all of our information diet is controlled by a small number of companies through proprietary systems.

00:32
Lex Fridman

I believe that people are fundamentally good, and so if AI, especially open source AI, can make them smarter, it just empowers the goodness in humans.

00:44
Yann LeCun

So I share that feeling. Okay, I think people are fundamentally good, and in fact, a lot of doomers are doomers because they don't think that people are fundamentally good.

00:57
Lex Fridman

The following is a conversation with Yann LeCun, his third time on this podcast. He is the chief AI scientist at Meta, professor at NYU, touring award winner, and one of the seminal figures in the history of artificial intelligence. He and Meta AI have been big proponents of open sourcing AI development and have been walking the walk by open sourcing many of their biggest models, including llama two and eventually llama three. Also, Jan has been an outspoken critic of those people in the AI community who warn about the looming danger and existential threat of AGI. He believes the AGI will be created one day, but it will be good. It will not escape human control, nor will it dominate and kill all humans.

Read the full transcript

00:00
Yann LeCun
I see the danger of this concentration of power through proprietary AI systems as a much bigger danger than everything else.

00:08
Yann LeCun
What works against this is people who.

00:12
Yann LeCun
Think that for reasons of security, we should keep AI systems under lock and.

00:17
Yann LeCun
Key because it's too dangerous to put.

00:19
Yann LeCun
It in the hands of everybody.

00:22
Yann LeCun
That would lead to a very bad.

00:24
Yann LeCun
Future in which all of our information diet is controlled by a small number of companies through proprietary systems.

00:32
Lex Fridman
I believe that people are fundamentally good, and so if AI, especially open source AI, can make them smarter, it just empowers the goodness in humans.

00:44
Yann LeCun
So I share that feeling. Okay, I think people are fundamentally good, and in fact, a lot of doomers are doomers because they don't think that people are fundamentally good.

00:57
Lex Fridman
The following is a conversation with Jan Lacun, his third time on this podcast. He is the chief AI scientist at Meta, professor at NYU, touring award winner, and one of the seminal figures in the history of artificial intelligence. He and Meta AI have been big proponents of open sourcing AI development and have been walking the walk by open sourcing many of their biggest models, including llama two and eventually llama three. Also, Jan has been an outspoken critic of those people in the AI community who warn about the looming danger and existential threat of AGI. He believes the AGI will be created one day, but it will be good. It will not escape human control, nor will it dominate and kill all humans.

01:52
Lex Fridman
At this moment of rapid AI development, this happens to be somewhat a controversial position, and so it's been fun seeing Jan get into a lot of intense and fascinating discussions online, as we do in this very conversation. This is the Lexrumit podcast. To support it, please check out our sponsors in the description. And now, dear friends, here's Jan Lacoon. You've had some strong statements, technical statements, about the future of artificial intelligence recently, throughout your career, actually, but recently as well, you've said that Autoregressive LLMs are not the way we're going to make progress towards superhuman intelligence. These are the large language models like GPT four, like llama two and three soon, and so on. How do they work, and why are they not going to take us all the way?

02:47
Yann LeCun
For a number of reasons. The first is that there is a number of characteristics of intelligent behavior. For example, the capacity to understand the world, understand the physical world, the ability to remember and retrieve things, persistent memory, the ability to reason, and the ability to plan. Those are four essential characteristics of intelligent systems or entities. Humans, animals, llms can do none of.

03:21
Yann LeCun
Those, or they can only do them.

03:24
Yann LeCun
In a very primitive way, and they don't really understand the physical world. They don't really have persistent memory, they can't really reason, and they certainly can't plan. And so if you expect the system to become intelligent just without having the possibility of doing those things, you're making a mistake. That is not to say that autoregacy.

03:48
Yann LeCun
Lms are not useful. They're certainly useful, that they're not interesting.

03:55
Yann LeCun
That we can't build a whole ecosystem of applications around them. Of course we can, but as a.

04:02
Yann LeCun
Path towards human level intelligence, they are missing essential components.

04:08
Yann LeCun
And then there is another tidbit or fact that I think is very interesting. Those LLNs are trained on enormous amounts.

04:16
Yann LeCun
Of text, basically the entirety of all.

04:19
Yann LeCun
Publicly available text on the Internet, right? That's typically on the order of ten to the 13 tokens. Each token is typically two bytes, so that's 210 to the 13 bytes as training data. It would take you or me 170,000 years to just read through this at 8 hours a day. So it seems like an enormous amount of knowledge, right, that those systems can accumulate. But then you realize it's really not that much data. If you talk to developmental psychologists and they tell you a four year old has been awake for 16,000 hours in.

04:56
Yann LeCun
His or her life, and the amount.

05:00
Yann LeCun
Of information that has reached the visual cortex of that child in four years.

05:09
Yann LeCun
Is about ten to the 15 bytes.

05:11
Yann LeCun
And you can compute this by estimating that the optical nerve carry about 20 megabytes per second, roughly. And so ten to the 15 bytes for a four year old versus two times ten to the 13 bytes for 170,000 years worth of reading.

05:29
Yann LeCun
What that tells you is that through.

05:32
Yann LeCun
Sensory input, we see a lot more information than we.

05:35
Yann LeCun
Than we do through language.

05:38
Yann LeCun
And that despite our intuition, most of what we learn and most of our knowledge is through our observation and interaction with the real world, not through language. Everything that we learn in the first few years of life, and certainly everything that animals learn, has nothing to do with language.

05:56
Lex Fridman
So it would be good to maybe push against some of the intuition behind what you're saying. So it is true. There's several orders of magnitude more data coming into the human mind much faster, and the human mind is able to learn very quickly from that, filter the data very quickly. Somebody might argue your comparison between sensory data versus language, that language is already very compressed. It already contains a lot more information than the bytes it takes to store them if you compare it to visual data. So there's a lot of wisdom and language there's words and the way we stitch them together, it already contains a lot of information.

06:36
Lex Fridman
So is it possible that language alone already has enough wisdom and knowledge in there to be able to, from that language, construct a world model, an understanding of the world, an understanding of the physical world that you're saying all lens lack?

06:56
Yann LeCun
So it's a big debate among philosophers and also cognitive scientists, like, whether intelligence needs to be grounded in reality. I'm clearly in the camp that, yes, intelligence cannot appear without some grounding in some reality. It doesn't need to be physical reality. It could be simulated, but the environment is just much richer than what you can express in language. Language is a very approximate representation of our percepts and our mental models, right? I mean, there's a lot of tasks that we accomplish where we manipulate a mental model of the situation at hand, and that has nothing to do with language. Everything that's physical, mechanical, whatever. When we build something, when we accomplish a task model, task of grabbing something, et cetera, we plan or action sequences. And we do this by essentially imagining the result of the outcome of sequence.

07:58
Yann LeCun
Of actions that we might imagine.

08:01
Yann LeCun
And that requires mental models that don't have much to do with language. And that's, I would argue, most of our knowledge is derived from that interaction with the physical world. So a lot of my colleagues who are more interested in things like computer vision are really on that camp that.

08:22
Yann LeCun
AI needs to be embodied, essentially, and.

08:25
Yann LeCun
Then other people coming from the NLP side, or maybe some other motivation, don't necessarily agree with that. And philosophers are split as well, and the complexity of the world is hard to imagine. It's hard to represent all the complexities that we take completely for granted in the real world that we don't even imagine require intelligence, right? This is the old Maravik paradox from the pioneer of robotics. Hence Maravik, who said, you know, how is it that with computers, it seems to be easy to do high level, complex tasks like playing chess and solving integrals and doing things like that? Whereas the thing we take for granted that we do every day, like, I.

09:14
Yann LeCun
Don'T know, learning to drive a car.

09:16
Yann LeCun
Or grabbing an object, we can't do with computers.

09:21
Yann LeCun
And.

09:24
Yann LeCun
We have llms that can pass.

09:26
Yann LeCun
The bar exam, so they must be smart. But then they can't learn to drive.

09:32
Yann LeCun
In 20 hours like any 17 year old. They can't learn to clear up the dinner table and fill up the dishwasher.

09:40
Yann LeCun
Like any ten year old can learn in one shot.

09:43
Yann LeCun
Why is that? What are we missing? What type of learning or reasoning architecture or whatever are we missing that.

09:54
Yann LeCun
Basically.

09:55
Yann LeCun
Prevent us from having level five serving cars and domestic robots?

10:00
Lex Fridman
Can a large language model construct a world model that does know how to drive and does know how to fill a dishwasher, but just doesn't know how to deal with visual data at this time? So it can operate in a space of concepts?

10:17
Yann LeCun
Yeah, that's what a lot of people are working on. So the short answer is no. And the more complex answer is you.

10:24
Yann LeCun
Can use all kind of tricks to.

10:27
Yann LeCun
Get an LLM to basically digest visual representations of images or video or audio, for that matter. And a classical way of doing this is you train a vision system in some way.

10:49
Yann LeCun
And we have a number of ways.

10:50
Yann LeCun
To train vision systems, either supervised, semisupervised, self supervised, all kinds of different ways that will turn any image into a high level representation. Basically a list of tokens that are really similar to the kind of tokens.

11:05
Yann LeCun
That typical LLM takes as an input. And then you just feed that to.

11:13
Yann LeCun
The LLM in addition to the text. And you just expect LLM to kind of, during training, to kind of be able to use those representations to help make decisions. I mean, there's been work along those lines for quite a long time, and now you see those systems, right? I mean, there are llms that have some vision extension, but they're basically hacks in the sense that those things are not like trained end to really understand the world. They're not trained with video, for example. They don't really understand intuitive physics, at least not at the moment.

11:50
Lex Fridman
So you don't think there's something special to you about intuitive physics, about sort of common sense reasoning about the physical space, about physical reality? That to you is a giant leap that llms are just not able to do.

12:02
Yann LeCun
We're not going to be able to do this with the type of llms that we are working with today. And there's a number of reasons for this. But the main reason is the way llms are trained is that you take.

12:15
Yann LeCun
A piece of text, you remove some.

12:18
Yann LeCun
Of the words in that text, you mask them, you replace them by blank markers, and you train a genetic neural net to predict the words that are missing. And if you build this neural net.

12:29
Yann LeCun
In a particular way so that it can only look at words that are.

12:33
Yann LeCun
To the left of the one it's trying to predict, then what you have is a system that basically is trying to predict the next word in a text, right? So then you can feed it a.

12:42
Yann LeCun
Text, a prompt, and you can ask.

12:44
Yann LeCun
It to predict the next word. It can never predict the next word.

12:47
Yann LeCun
Exactly.

12:47
Yann LeCun
And so what it's going to do is produce a probability distribution over all the possible words in your dictionary. In fact, it doesn't predict words. It predicts tokens that are kind of subword units. And so it's easy to handle the uncertainty in the prediction there, because there is only a finite number of possible words in the dictionary, and you can just compute the distribution over them. Then what the system does is that.

13:13
Yann LeCun
It picks a word from that distribution.

13:16
Yann LeCun
Of course, there's a higher chance of picking words that have a higher probability within that distribution. So you sample from that distribution to actually produce a word, and then you.

13:25
Yann LeCun
Shift that word into the input. And so that allows the system now.

13:29
Yann LeCun
To predict the second word.

13:31
Yann LeCun
Right. And once you do this, you shift.

13:33
Yann LeCun
It into the input, et cetera. That's called autoregressive prediction, which is why those llms should be called autoregressive llms, but we just call them llms.

13:46
Yann LeCun
And there is a difference between this.

13:49
Yann LeCun
Kind of process and a process by.

13:51
Yann LeCun
Which before producing a word.

13:54
Yann LeCun
When you talk, when you and I.

13:55
Yann LeCun
Talk, you and I are bilinguals.

13:58
Yann LeCun
We think about what we're going to say, and it's relatively independent of the.

14:02
Yann LeCun
Language in which we're going to say it.

14:04
Yann LeCun
When we talk about, I don't know, let's say a mathematical concept or something, the kind of thinking that we're doing and the answer that we're planning to produce is not linked to whether we're going to see it in French, Russian, or English.

14:19
Lex Fridman
Chomsky just rolled his eyes, but I understand. So you're saying that there's a bigger abstraction that goes before language and maps onto language.

14:30
Yann LeCun
Right.

14:30
Yann LeCun
It's certainly true for a lot of thinking that we do.

14:33
Lex Fridman
Is that obvious that we don't? You're saying your thinking is same in French as it is in English?

14:40
Yann LeCun
Yeah, pretty much.

14:41
Lex Fridman
Pretty much. Or how flexible are you if there's a probability distribution?

14:48
Yann LeCun
Well, it depends what kind of thinking. Right. If it's like producing puns, I get much better in French than English about that.

14:56
Lex Fridman
No. Is there an abstract representation of puns? Is your humor an abstract representative? Like when you tweet and your tweets are sometimes a little bit spicy? Is there an abstract representation in your brain of a tweet before it maps onto English?

15:11
Yann LeCun
There is an abstract representation of imagining the reaction of a reader to that text.

15:18
Lex Fridman
Or you start with laughter and then figure out how to make that happen.

15:21
Yann LeCun
Or figure out a reaction you want to cause and then figure out how to say it, right? So that it causes that reaction. But that's really close to language. But think about mathematical concept or imagining something you want to build out of wood or something like this, right? The kind of thinking you're doing has absolutely nothing to do with language, really. It's not like you have necessarily like an internal monologue in any particular language. You're imagining mental models of the thing, right? If I ask you to imagine what this water bottle will look like if.

15:56
Yann LeCun
I rotate it 90 degrees, that has nothing to do with language.

16:04
Yann LeCun
Clearly there is a more abstract level of representation in which we do most of our thinking and we plan what we're going to say if the output is uttered words, as opposed to an output being muscle actions, right?

16:26
Yann LeCun
We plan our answer before we produce it.

16:29
Yann LeCun
And llms don't do that.

16:30
Yann LeCun
They just produce one word after the other, instinctively, if you want.

16:35
Yann LeCun
It's a bit like the subconscious actions.

16:40
Yann LeCun
Where you're distracted, you're doing something, you're.

16:43
Yann LeCun
Completely concentrated, and someone comes to you and asks you a question, and you kind of answer the question. You don't have time to think about the answer, but the answer is easy, so you don't need to pay attention. You sort of respond automatically. That's kind of what an LLM does, right? It doesn't think about its sensor, really. It retrieves it because it's accumulated a lot of knowledge. So it can retrieve some things, but it's going to just spit out one token after the other without planning the answer.

17:13
Lex Fridman
But you're making it sound just one token after the other. One token at a time. Generation is bound to be simplistic, but if the world model is sufficiently sophisticated that one token at a time.

17:32
Yann LeCun
The.

17:32
Lex Fridman
Most likely thing it generates is a sequence of tokens, is going to be a deeply profound thing.

17:39
Yann LeCun
Okay.

17:39
Yann LeCun
But then that assumes that those systems actually possess an eternal world model.

17:44
Lex Fridman
So really it goes to the. I think the fundamental question is, can you build a really complete world model? Not complete, but one that has a deep understanding of the world?

17:58
Yann LeCun
Yeah.

17:59
Yann LeCun
So can you build this, first of all by prediction?

18:03
Lex Fridman
Right?

18:04
Yann LeCun
And the answer is probably yes.

18:06
Yann LeCun
Can you build it by predicting words?

18:10
Yann LeCun
And the answer is most probably no, because language is very poor in terms.

18:17
Yann LeCun
Of weak or low bandwidth, if you want. There's just not enough information there.

18:21
Yann LeCun
So building world models means observing the world and understanding why the world is.

18:30
Yann LeCun
Evolving the way it is. And then the extra component of a world model is something that can predict how the world is going to evolve as a consequence of an action you might take, right? So world models really is. Here is my idea of the state of the world at time t. Here is an action I might take. What is the predicted state of the.

18:53
Yann LeCun
World at time t plus one.

18:55
Yann LeCun
Now, that state of the world does not need to represent everything about the.

19:00
Yann LeCun
World, it just needs to represent enough.

19:02
Yann LeCun
That'S relevant for this planning of the action, but not necessarily all the details.

19:08
Yann LeCun
Now, here is the problem.

19:10
Yann LeCun
You're not going to be able to.

19:11
Yann LeCun
Do this with generative models.

19:14
Yann LeCun
So a generative model that's trained on video, and we've tried to do this for ten years, you take a video, show, a system, a piece of video, and then ask it to predict the.

19:24
Yann LeCun
Reminder of the video, basically predict what's.

19:27
Lex Fridman
Going to happen one frame at a time. Do the same thing as sort of the auto aggressive llms do, but for video, right?

19:34
Yann LeCun
Either one frame at a time or a group of friends at a time. But yeah, a large video model, if you want. The idea of doing this has been floating around for a long time. And at fair, some of our colleagues and I have been trying to do.

19:51
Yann LeCun
This for about ten years.

19:54
Yann LeCun
And you can't really do the same trick as with llms, because llms, as I said, you can't predict exactly which word is going to follow a sequence of words, but you can predict a distribution over words. Now, if you go to video, what you would have to do is predict.

20:12
Yann LeCun
A distribution over all possible frames in a video. And we don't really know how to do that properly.

20:20
Yann LeCun
We do not know how to represent distributions over high dimensional continuous spaces in ways that are useful. And there lies the main issue. And the reason we can do this is because the world is incredibly more complicated and richer in terms of information than text. Text is discrete, video is high dimensional and continuous. A lot of details in this. So if I take a video of this room, and the video is a camera panning around, there is no way I can predict everything that's going to be in the room as I pan around. The system cannot predict what's going to be in the room as the camera is panning. Maybe it's going to predict. This is a room where there is a light and there is a wall and things like that.

21:09
Yann LeCun
It can't predict what the painting on the wall looks like or what the texture of the couch looks like, certainly not the texture of the carpet. So there's no way it can predict all those details.

21:19
Yann LeCun
So the way to handle this is.

21:23
Yann LeCun
One way possibly to handle this, which we've been working for a long time, is to have a model that has what's called a latent variable. And the latent variable is fed to a neural net, and it's supposed to represent all the information about the world.

21:35
Yann LeCun
That you don't perceive yet, and that.

21:40
Yann LeCun
You need to augment the system for the prediction, to do a good job at predicting pixels, including the fine texture.

21:49
Yann LeCun
Of the carpet on the couch and.

21:53
Yann LeCun
The painting on the wall, that has.

21:57
Yann LeCun
Been a complete failure, essentially.

21:59
Yann LeCun
And we've tried lots of things. We tried just straight neural nets. We tried gans, we tried vaes, all kinds of regularized auto encoders. We tried many things. We also tried those kind of methods.

22:15
Yann LeCun
To learn good representations of images or video that could then be used as.

22:23
Yann LeCun
Input to, for example, an image classification system. And that also has basically failed, like all the systems that attempt to predict missing parts of an image or video from a corrupted version of it, basically. So take an image or a video, corrupt it, or transform it in some way, and then try to reconstruct the complete video or image from the corrupted version. And then hope that internally the system will develop good representations of images that you can use for object recognition, segmentation, whatever it is.

22:59
Yann LeCun
That has been essentially a complete failure.

23:02
Yann LeCun
And it works really well for text. That's the principle that is used for llms.

23:06
Yann LeCun
Right?

23:07
Lex Fridman
So where is the failure exactly? Is it that it's very difficult to form a good representation of an image, like a good embedding of all the important information in the image? Is it in terms of the consistency of image to image to image that forms the video? If we do a highlight reel of all the ways you failed, what's that look like?

23:30
Yann LeCun
Okay, so the reason this doesn't work.

23:33
Yann LeCun
Is, first of all, I have to.

23:36
Yann LeCun
Tell you exactly what doesn't work, because there is something else that does work. So the thing that does not work is training a system to learn representations.

23:46
Yann LeCun
Of images by training it to reconstruct.

23:51
Yann LeCun
A good image from a corrupted version of it.

23:53
Yann LeCun
Okay?

23:53
Yann LeCun
That's what doesn't work.

23:55
Yann LeCun
And we have a whole slew of.

23:57
Yann LeCun
Techniques for this that are variant of denoising autoencoders, something called Mae, developed by some of my colleagues at fair maxed autoencoder. So it's basically like the llms or things like this, where you train a system by corrupting text, except you corrupt images, you remove patches from it, and you train a gigantic neural net to reconstruct the features you get are not good. And you know they're not good because if you now train the same architecture, but you train it supervised with label data, with textual descriptions of images, et cetera, you do get good representations. And the performance on recognition tasks is much better than if you do this.

24:40
Yann LeCun
Self supervised free training. So the architecture is good. The architecture is good.

24:45
Yann LeCun
The architecture of the encoder is good.

24:47
Yann LeCun
Okay?

24:47
Yann LeCun
But the fact that you train the.

24:49
Yann LeCun
System to reconstruct images does not lead.

24:52
Yann LeCun
It to produce, to learn good generic.

24:55
Lex Fridman
Features of images when you train in.

24:56
Yann LeCun
A self supervised way.

24:58
Yann LeCun
Self supervised by reconstruction.

25:00
Lex Fridman
Yeah, by reconstruction.

25:01
Yann LeCun
Okay, so what's the alternative? The alternative is joint embedding.

25:07
Lex Fridman
What is joint embedding? What are these architectures that you're so excited about?

25:11
Yann LeCun
Okay, so now instead of training a system to encode the image and then training it to reconstruct the full image from a corrupted version, you take the full image.

25:21
Yann LeCun
You take the corrupted or transformed version.

25:25
Yann LeCun
You run them both through encoders, which in general are identical, but not necessarily. And then you train a predictor on top of those encoders to predict the representation of the full input from the representation of the corrupted one.

25:46
Yann LeCun
Okay?

25:47
Yann LeCun
So joint embedding, because you're taking the.

25:50
Yann LeCun
Full input and the corrupted version or.

25:52
Yann LeCun
Transform version, run them both through encoders.

25:55
Yann LeCun
You get a joint embedding.

25:57
Yann LeCun
And then you're saying, can I predict the representation of the full one from.

26:02
Yann LeCun
The representation of the corrupted one? Okay?

26:06
Yann LeCun
And I call this a Jepath. So that means joint embedding, predictive architecture, because it's joint embedding. And there is this predictor that predicts the representation of the good guy from the bad guy. And the big question is, how do you train something like this?

26:20
Yann LeCun
And until five years ago or six.

26:23
Yann LeCun
Years ago, we didn't have particularly good.

26:25
Yann LeCun
Answers for how you train those things.

26:27
Yann LeCun
Except for one called contrastive learning. And the idea of contrastive learning is you take a pair of images that are, again, an image and a corrupted version or degraded version somehow, or transformed version of the original one, and you train the predicted representation to be the same as that. If you only do this, the system collapses. It basically completely ignores the input and produces representations that are constant.

26:59
Yann LeCun
So the contrastive methods avoid this.

27:03
Yann LeCun
And those things have been around since the early ninety s. I had a paper on this in 1993, is you also show pairs of images that you.

27:12
Yann LeCun
Know are different, and then you push.

27:15
Yann LeCun
Away the representations from each other. So you say not only do representations of things that we know are the same, should be the same, or should be similar. But representation of things that we know.

27:25
Yann LeCun
Are different, should be different, and that prevents the collapse.

27:28
Yann LeCun
But it has some limitation. And there's a whole bunch of techniques that have appeared over the last six, seven years that can revive this type of method, some of them from fair, some of them from Google and other places. But there are limitations to those contrastive methods. What has changed in the last three.

27:51
Yann LeCun
Four years is now we have methods.

27:53
Yann LeCun
That are non contrastive, so they don't require those negative contrastive samples of images.

28:00
Yann LeCun
That we know are different.

28:02
Yann LeCun
You turn them only with images that are different versions or different views of the same thing, and you rely on some other tricks to prevent the system from collapsing. And we have half a dozen different.

28:14
Yann LeCun
Methods for this now.

28:15
Lex Fridman
So what is the fundamental difference between joint embedding architectures and llms? So, can Japa take us to AGI, whether we should say that you don't like the term agi? And we'll probably argue, I think every single time I've talked to you, we've argued about the g and.

28:38
Yann LeCun
It.

28:39
Lex Fridman
I get it. We'll probably continue to argue about it. It's great because you like French, and Ami is, I guess, friend in French. Yes, and Ami stands for advanced machine intelligence.

28:55
Yann LeCun
Right.

28:57
Lex Fridman
But either way, can Japa take us to that, towards that advanced machine intelligence.

29:02
Yann LeCun
Well, so it's a first step. Okay, so first of all, what's the difference with generative architectures like llms?

29:10
Yann LeCun
So llms or vision systems that are.

29:15
Yann LeCun
Trained by reconstruction generate the inputs. They generate the original input that is.

29:23
Yann LeCun
Non corrupted, non transformed. Right? So you have to predict all the pixels. And there is a huge amount of.

29:31
Yann LeCun
Resources spent in the system to actually predict all those pixels, all the details.

29:37
Yann LeCun
In a Jepa, you're not trying to.

29:39
Yann LeCun
Predict all the pixels. You're only trying to predict an abstract representation of the inputs, right? And that's much easier in many ways. So what the Jepa system, when it's being trained, is trying to do is extract as much information as possible from the input, but yet only extract information that is relatively easily predictable. Okay, so there's a lot of things in the world that we cannot predict. Like, for example, if you have a self driving car driving down the street or road, there may be trees around the road, and it could be a windy day. So the leaves on the tree are kind of moving in kind of semi.

30:18
Yann LeCun
Chaotic, random ways that you can't predict.

30:21
Yann LeCun
And you don't care. You don't want to predict. So what you want is your encoder to basically eliminate all those details. It will tell you there is moving leaves, but it's not going to keep the details of exactly what's going on. And so when you do the prediction in representation space, you're not going to have to predict every single pixel of every leaf. And that not only is a lot simpler, but also it allows the system to essentially learn an abstract representation of the world where what can be modeled and predicted is preserved and the rest is viewed as noise and eliminated by the encoder. So it kind of lifts the level of abstraction of the representation. If you think about this is something we do absolutely all the time. Whenever we describe a phenomenon, we describe it at a particular level of abstraction.

31:09
Yann LeCun
And we don't always describe every natural phenomenon in terms of quantum field theory.

31:15
Yann LeCun
That would be impossible.

31:17
Yann LeCun
So we have multiple levels of abstraction to describe what happens in the world, starting from quantum field theory to atomic theory and molecules and chemistry materials and all the way up to kind of concrete objects in the real world and things like that.

31:34
Yann LeCun
So we can't just only model everything.

31:39
Yann LeCun
At the lowest level. And that's what the idea of Jepa is really about. Learn abstract representation in a self supervised manner and you can do it hierarchically as well. So that, I think is an essential component of an intelligent system. And in language, we can get away without doing this because language is already to some level abstract and already has eliminated a lot of information that is not predictable.

32:07
Yann LeCun
So we can get away without doing.

32:09
Yann LeCun
The joint embedding, without lifting the abstraction level and by directly predicting words.

32:16
Lex Fridman
So joint embedding, it's still generative, but it's generative in this abstract representation space. And you're saying language. We were lazy with language because we already got the abstract representation for free. And now we have to zoom out. Actually think about generally intelligent systems. We have to deal with a full mess of physical reality, of reality. And you do have to do this step of jumping from the full, rich, detailed reality to an abstract representation of that reality based on which you can then reason and all that kind of stuff.

32:57
Yann LeCun
Right?

32:57
Yann LeCun
And the thing is, those self supervised algorithms that learn by prediction, even in.

33:03
Yann LeCun
Representation space, they learn more concept.

33:09
Yann LeCun
If the input data you feed them is more redundant, the more redundancy there is in the data, the more they're able to capture some internal structure of it. And so there is way more redundancy and structure in perceptual input, sensory input, like vision, than there is in text, which is not nearly as redundant. This is back to the question you were asking a few minutes ago. Language might represent more information, really? Because it's already compressed. You're right about that, but that means.

33:38
Yann LeCun
It'S also less redundant, and so self.

33:41
Yann LeCun
Supervised learning will not work as well.

33:43
Lex Fridman
Is it possible to join the self supervised training on visual data and self supervised training on language data? There is a huge amount of knowledge, even though you talk down about those ten to the 13 tokens represent the entirety, a large fraction of what us humans have figured out, both the shit talk on Reddit and the contents of all the books and the articles, and the full spectrum of human intellectual creation. So is it possible to join those two together?

34:22
Yann LeCun
Well, eventually, yes, but I think if.

34:25
Yann LeCun
We do this too early, we run the risk of being tempted to cheat. And in fact, that's what people are doing at the moment. With vision language model, we're basically cheating. We're using language as a crutch to help the deficiencies of our vision systems, to kind of learn good representations from images and video. And the problem with this is that we might improve our vision language system.

34:53
Yann LeCun
A bit, I mean, our language models.

34:55
Yann LeCun
By feeding them images. But we're not going to get to the level of even the intelligence or level of understanding of the world of.

35:04
Yann LeCun
A cat or a dog, which doesn't have language.

35:07
Yann LeCun
They don't have language, and they understand the world much better than any LLM. They can plan really complex actions and sort of imagine the result of a bunch of actions.

35:17
Yann LeCun
How do we get machines to learn that before we combine that with language?

35:22
Yann LeCun
Obviously, if we combine this with language, this is going to be a winner. But before that, we have to focus on how do we get systems to learn how the world works.

35:33
Lex Fridman
So this kind of joint embedding, predictive architecture for you, that's going to be able to learn something like common sense, something like what a cat uses to predict how to mess with its owner most optimally by knocking over a thing.

35:50
Yann LeCun
That's the hope. In fact, the techniques we're using are non contrastive. So not only is the architecture non generative, the learning procedures we're using are non contrastive. We have two sets of techniques. One set is based on distillation, and there's a number of methods that use this principle. One by DeepMind called BYOl, a couple by fair, one called vicreg, and another one called ijepa. And Vicreg, I should say, is not a distillation method, actually. But ijepa and BYoL certainly are. And there's another one also called Dino or dyno, also produced from at fair. And the idea of those things is that you take the full input, let's say an image, you run it through.

36:36
Yann LeCun
An encoder, produces a representation, and then.

36:41
Yann LeCun
You corrupt that input or transform it, run it through essentially what amounts to the same encoder with some minor differences, and then train a predictor. Sometimes the predictor is very simple, sometimes doesn't exist, but train a predictor to predict a representation of the first uncorrupted input from the corrupted input. But you only train the second branch. You only train the part of the network that is fed with the corrupted input. The other network, you don't train, but since they share the same weight, when you modify the first one, it also modifies the second one. And with various tricks, you can prevent this system from collapsing with the collapse of the type I was explaining before, where the system basically ignores the input.

37:28
Yann LeCun
So that works very well.

37:32
Yann LeCun
The two techniques we've developed at fair, Dino and Ijepa, work really well for that.

37:39
Lex Fridman
So what kind of data are we talking about here?

37:41
Yann LeCun
So there's several scenarios. One scenario is you take an image.

37:47
Yann LeCun
You corrupt it by changing the cropping.

37:52
Yann LeCun
For example, changing the size a little bit, maybe changing the orientation, blurring it, changing the colors, doing all kinds of horrible things to it, but basic horrible things that sort of degrade the quality a little bit and change the framing, crop the image. And in some cases, in the case of Ijepa, you don't need to do any of this.

38:13
Yann LeCun
You just mask some parts of it. You just basically remove some regions, like.

38:19
Yann LeCun
A big block, essentially, and then run through the encoders and train the entire system, encoder and predictor, to predict the representation of the good one from the representation of the corrupted one. So that's the IJEPA doesn't need to know that it's an image, for example, because the only thing it needs to.

38:39
Yann LeCun
Know is how to do this masking.

38:42
Yann LeCun
Whereas with Dino, you need to know it's an image because you need to do things like geometry, transformation and blurring and things like that are really image specific. A more recent version of this that we have is called vjepa. So it's basically the same idea as ijepa, except it's applied to video. So now you take a whole video.

39:00
Yann LeCun
And you mask a whole chunk of it.

39:02
Yann LeCun
And what we mask is actually kind of a temporal tube. So, like a whole segment of each frame in a video over the entire video.

39:10
Lex Fridman
And that tube is like, statically positioned throughout the frames, literally straight tube.

39:16
Yann LeCun
The tube, yeah. Typically is 16 frames or something, and we mask the same region over the entire 16 frames. It's a different one for every video, obviously. And then again, train that system so as to predict the representation of the full video from the partially masked video.

39:33
Yann LeCun
And that works really well.

39:35
Yann LeCun
It's the first system that we have that learns good representations of videos, so that when you feed those representations to a supervised classifier head, it can tell you what action is taking place in.

39:47
Yann LeCun
A video with pretty good accuracy.

39:50
Yann LeCun
So it's the first time we get something of that quality.

39:56
Lex Fridman
That's a good test, that a good representation is formed. That means there's something to this.

40:00
Yann LeCun
Yeah.

40:01
Yann LeCun
We have also preliminary results that seem to indicate that the representation allows us allow our system to tell whether the video is physically possible or completely impossible, because some object disappeared, or an object suddenly jumped from one location to another, or changed shape or something.

40:21
Lex Fridman
So it's able to capture some physics based constraints about the reality represented in the video, about the appearance and the disappearance of objects.

40:33
Yann LeCun
Yeah, that's really new.

40:35
Lex Fridman
Okay. But can this actually get us to this kind of world model that understands enough about the world to be able to drive a car?

40:49
Yann LeCun
Possibly.

40:50
Yann LeCun
This is going to take a while before we get to that point. And there are systems already, robotic systems, that are based on this idea. And what you need for this is a slightly modified version of this, where.

41:05
Yann LeCun
Imagine that you have.

41:09
Yann LeCun
A video and.

41:11
Yann LeCun
A complete video, and what you're doing to this video is that you are either translating it in time towards the.

41:18
Yann LeCun
Future, so you only see the beginning of the video, but you don't see the latter part of it that is in the original one. Or you just mask the second half of the video, for example. And then you train a Jepa system of the type I described to predict the representation of the full video from the shifted one. But you also feed the predictor with an action.

41:39
Yann LeCun
For example, the wheel is turned ten.

41:42
Yann LeCun
Degrees to the right or something. Right. So if it's a dash cam in.

41:48
Yann LeCun
A car and you know the angle.

41:50
Yann LeCun
Of the wheel, you should be able to predict to some extent what's going to happen to what you see. You're not going to be able to predict all the details of objects that appear in the view, obviously, but at an abstract representation level, you can probably predict what's going to happen.

42:08
Yann LeCun
So now what you have is an.

42:11
Yann LeCun
Internal model that says, here is my idea of state of the world at.

42:14
Yann LeCun
Time t. Here is an action I'm taking.

42:17
Yann LeCun
Here is a prediction of the state of the world at time t plus.

42:20
Yann LeCun
One, t plus delta, t, t plus.

42:22
Yann LeCun
2 seconds, whatever it is. If you have a model of this.

42:25
Yann LeCun
Type, you can use it for planning.

42:27
Yann LeCun
So now you can do what LLMs.

42:30
Yann LeCun
Cannot do, which is planning what you're going to do, so as to arrive.

42:35
Yann LeCun
At a particular outcome or satisfy a particular objective, right? So you can have a number of objectives. I can predict that if I have an object like this and I open my hand, it's going to fall, right? And if I push it with a particular force on the table, it's going to move. If I push the table itself, it's probably not going to move with the same force. So we have this internal model of the world in our mind, which allows us to plan sequences of actions to.

43:11
Yann LeCun
Arrive at a particular goal.

43:16
Yann LeCun
So now if you have this world model, we can imagine a sequence of actions, predict what the outcome of the sequence of action is going to be, measure to what extent the final state satisfies a particular objective, like moving the bottle to the left of the table, and then plan a sequence of actions that will minimize this objective at runtime. We're not talking about learning. We're talking about inference time. So this is planning, really, and in optimal control. This is a very classical thing. It's called model predictive control. You have a model of the system you want to control that can predict the sequence of states corresponding to a sequence of commands. And you're planning a sequence of commands so that according to your world model, the end state of the system will satisfy an objective that you fix.

44:11
Yann LeCun
This is the way rocket trajectories have been planned since computers have been around, so since the early 60s, essentially.

44:19
Lex Fridman
So, yes, for model predictive control. But you also often talk about hierarchical planning. Can hierarchical planning emerge from this somehow?

44:28
Yann LeCun
Well, so, no. You will have to build a specific architecture to allow for hierarchical planning. So hierarchical planning is absolutely necessary if you want to plan complex actions. If I want to go from, let's say, from New York to Paris, it's the example I use all the time, and I'm sitting in my office at NYU. My objective that I need to minimize.

44:50
Yann LeCun
Is my distance to Paris at a.

44:52
Yann LeCun
High level, a very abstract representation of my location. I will have to decompose this into two sub goals.

44:59
Yann LeCun
First one is go to the airport.

45:02
Yann LeCun
Second one is catch a plane to Paris. Okay, so my sub goal is now going to the airport. My objective function is my distance to the airport.

45:12
Yann LeCun
How do I go to the airport where I have to go in the.

45:16
Yann LeCun
Street and have a taxi, which you can do in New York. Okay, now I have another sub goal.

45:22
Yann LeCun
Go down on the street.

45:24
Yann LeCun
Well, that means going to the elevator, going down the elevator, walk out the street. How do I go to the elevator?

45:32
Yann LeCun
I have to stand up from my.

45:35
Yann LeCun
Chair, open the door of my office.

45:37
Yann LeCun
Go to the elevator, push the button. How do I get up from my chair?

45:42
Yann LeCun
You can imagine going down, all the way down to basically what amounts to millisecond by millisecond muscle control.

45:50
Yann LeCun
Okay.

45:51
Yann LeCun
And obviously you're not going to plan your entire trip from New York to Paris in terms of millisecond by millisecond muscle control first. That would be incredibly expensive, but it will also be completely impossible because you don't know all the conditions, what's going to happen, how long it's going to take to catch a taxi or to go to the airport with traffic. You would have to know exactly the condition of everything to be able to do this planning, and you don't have the information. So you have to do this hierarchical planning so that you can start acting and then sort of replanning as you go.

46:28
Yann LeCun
And nobody really knows how to do this in AI.

46:33
Yann LeCun
Nobody knows how to train a system to learn the appropriate multiple levels of representation so that hierarchical planning works.

46:41
Lex Fridman
Does something like that already. So, like, can you use an LLM, state of the art LLM, to get you from New York to Paris by doing exactly the kind of detailed set of questions that you just did, which is, can you give me a list of ten steps I need to do to get from New York to Paris? And then for each of those steps, can you give me a list of ten steps how I make that step happen? And for each of those steps, can you give me a list of ten steps to make each one of those until you're moving your individual muscles? Maybe not. Whatever you can actually act upon using your own mind.

47:21
Yann LeCun
Right.

47:21
Yann LeCun
So there's a lot of questions that are sort of implied by this. Right. So the first thing is LLMs will be able to answer some of those questions down to some level of abstraction under the condition that they've been trained with similar scenarios in their training set.

47:37
Lex Fridman
They would be able to answer all of those questions, but some of them may be hallucinated, meaning non factual.

47:44
Yann LeCun
Yeah, true. I mean, they will probably produce some answer, except they're not going to be able to really kind of produce millisecond by millisecond muscle control of how you stand up from your chair. Right. But down to some level of abstraction where you can describe things by words. They might be able to give you a plan, but only under the condition that they've been trained to produce those kind of plans.

48:03
Yann LeCun
Right.

48:04
Yann LeCun
They're not going to be able to plan for situations that they never encountered before. They basically are going to have to regurgitate the template that they've been trained.

48:13
Lex Fridman
Like, just for the example of New York to Paris, is it going to start getting into trouble? Which layer of abstraction do you think you'll start? Because I can imagine almost every single part of that an LLM will be able to answer somewhat accurately, especially when you're talking about New York and Paris.

48:29
Yann LeCun
Major mean, certainly LLM would be able to solve that problem if you fine tuned it for. So I can't say that NLM cannot do this. It can do this if you train it for it. There's no question down to a certain level where things can be formulated in terms of words. But if you want to go down to how do you climb down the stairs or just stand up from your chair in terms of words, you can't do it. That's one of the reasons you need experience of the physical world, which is much higher bandwidth than what you can.

49:08
Yann LeCun
Express in words in human language.

49:11
Lex Fridman
So everything we've been talking about on the joint embedding space, is it possible that's what we need for interaction with physical reality on the robotics front, and then just the LLMs are the thing that sits on top of it. For the bigger reasoning about the fact that I need to book a plane ticket and I need to know, I know how to go to the websites and so on.

49:33
Yann LeCun
Sure.

49:34
Yann LeCun
And a lot of plans that people know about that are relatively high level are actually learned. Most people don't invent the plans by themselves. We have some ability to do this, of course, obviously, but most plans that.

49:56
Yann LeCun
People use are plans that they've been trained on.

49:59
Yann LeCun
Like they've seen other people use those plans, or they've been told how to do things right. You can't invent how you take a person who's never heard of airplanes and tell them, how do you go from New York to Paris? And they're probably not going to be able to kind of deconstruct the whole plan unless they've seen examples of that before. So certainly lms are going to be.

50:20
Yann LeCun
Able to do this, but then how.

50:23
Yann LeCun
You link this from the low level of actions that needs to be done with things like Jepad that basically lift the abstraction level of the representation without attempting to reconstruct every detail of the situation.

50:37
Yann LeCun
That's what we need Jepas for.

50:40
Lex Fridman
I would love to sort of linger on your skepticism around auto aggressive llms, so one way I would like to test that skepticism is everything you say makes a lot of sense. But if I apply everything you said today and in general to, like, I don't know, ten years ago, maybe a little bit less, no, let's say three years ago, I wouldn't be able to predict the success of llms. So does it make sense to you that autoregressive llms are able to be so damn good?

51:20
Yann LeCun
Yes.

51:21
Lex Fridman
Can you explain your intuition? Because if I were to take your wisdom and intuition at face value, I would say there's no way auto regressive llms, one token at a time, would be able to do the kind of things they're doing.

51:36
Yann LeCun
No, there's one thing that autoregressive llms, or that llms in general, not just the autoregressive one, but including the Bert style bi directional ones, are exploiting. And it's self supervised learning. And I've been a very strong advocate of self supervised learning for many years.

51:53
Yann LeCun
So those things are an incredibly impressive.

51:58
Yann LeCun
Demonstration that self supervised learning actually works. The idea that started it didn't start.

52:06
Yann LeCun
With Bert, but it was really kind.

52:08
Yann LeCun
Of good demonstration with this. So the idea that you take a piece of text, you corrupt it, and then you train some gigantic neural net to reconstruct the parts that are missing. That has been an enormous, produced an enormous amount of benefits. It allowed us to create systems that.

52:29
Yann LeCun
Understand language, systems that can translate hundreds.

52:34
Yann LeCun
Of languages in any direction, systems that are multilingual. So it's a single system that can be trained to understand hundreds of languages.

52:42
Yann LeCun
And translate in any direction and produce.

52:47
Yann LeCun
Summaries and then answer questions and produce text. And then there's a special case of it, which is your autographsive trick, where you constrain the system to not elaborate a representation of the text from looking.

53:02
Yann LeCun
At the entire text, but only predicting.

53:06
Yann LeCun
A word from the words that come before.

53:08
Yann LeCun
Right.

53:08
Yann LeCun
And you do this by constraining the architecture of the network, and that's what you can build an autoregressive LLM from. So there was a surprise many years ago with what's called decoder only llms. So systems of this type that are just trying to produce words from the previous one, and the fact that when you scale them up, they tend to really kind of understand more about language when you train them on lots of data and you make them really big. That was kind of a surprise. And that surprise occurred quite a while back with work from Google, meta, OpenAI, et cetera. Going back to the GPT kind of work, general pretrained transformers.

53:56
Lex Fridman
You mean like GPT-2 there's a certain place where you start to realize scaling might actually keep giving us an emergent benefit.

54:06
Yann LeCun
Yeah, I mean, there were work from various places, but if you want to kind of place it in the GPT timeline, that would be around GPT too.

54:19
Lex Fridman
Because you said it. You're so charismatic, and you said so many words. But self supervised learning. Yes, but again, the same intuition you're applying to saying that autoregressive llms cannot have a deep understanding of the world. If we just apply that same intuition, does it make sense to you that they're able to form enough of a representation of the world to be damn convincing, essentially passing the original touring test with flying colors.

54:50
Yann LeCun
Well, we're fooled by their fluency, right? We just assume that if a system is fluent in manipulating language, then it has all the characteristics of human intelligence.

55:00
Yann LeCun
But that impression is false.

55:04
Yann LeCun
We're really fooled by it.

55:06
Lex Fridman
What do you think Alan Turing would say without understanding anything, just hanging out with it?

55:11
Yann LeCun
Alan Turing would decide that his Turing.

55:12
Yann LeCun
Test is a really bad test.

55:14
Yann LeCun
Okay, this is what the AI community has decided many years ago, that the Turing test was a really bad test of intelligence.

55:21
Lex Fridman
What would Hans Marovak say about the large language models?

55:26
Yann LeCun
Hans Marovac would say the Morvec paradox still can pass.

55:32
Lex Fridman
You don't think he would be really impressed?

55:34
Yann LeCun
No, of course, everybody would be impressed. But it's not a question of being impressed or not. It's a question of knowing what the limit of those systems can do. Again, they are impressive. They can do a lot of useful things. There's a whole industry that is being built around them. They're going to make progress, but there is a lot of things they cannot do, and we have to realize what they cannot do and then figure out how we get there. And I'm saying this from basically ten years of research.

56:08
Yann LeCun
On the idea of self supervised learning.

56:12
Yann LeCun
Actually, that's going back more than ten years. But the idea of self supervised learning. So basically capturing the internal structure of a piece of a set of inputs without training the system for any particular task. Right. Learning representations. The conference I co founded 14 years ago is called international conference on learning representations. That's the entire issue that deep learning is dealing with. Right. And it's been my obsession for almost 40 years now. So learning representation is really the thing. For the longest time, we could only do this with supervised learning, and then we started working on what we used to call unsupervised learning and sort of revived the idea of unsupervised learning in the early 2000s with Yosha Benjo and Jay Finton, then discovered that supervised running actually works pretty well if you can collect enough data.

57:03
Yann LeCun
And so the whole idea of unsupervised.

57:06
Yann LeCun
Self supervised running kind of took a backseat for a bit, and then I kind of tried to revive it in a big way starting in 2014, basically, when we started fair and really pushing for finding new methods to do self supervised learning, both for text and for images and for video and audio. And some of that work has been incredibly successful. I mean, the reason why we have multilingual translation know things to do, content moderation on meta, for example, on Facebook, that are multilingual, that understand whether a piece of text is HPE or not or something is due to that progress using self supervised learning for NLP, combining this with transformer architectures and blah, blah. But that's the big success of self supervised learning.

57:55
Yann LeCun
We had similar success in speech recognition, a system called wave two vec, which is also a joint embedding architecture, by the way, trained with contrastive learning. And that system also can produce speech recognition systems that are multilingual with mostly unlabeled data, and only need a few minutes of labeled data to actually do speech recognition. That's amazing. We have systems now based on those combination of ideas that can do real time translation of hundreds of languages into each other.

58:27
Lex Fridman
Speech to speech to speech, even including just fascinating languages that don't have written forms.

58:34
Yann LeCun
That's right.

58:34
Lex Fridman
They're spoken only.

58:35
Yann LeCun
That's right. We don't go through text. It goes directly from speech to speech using an internal representation of kind of speech units that are discrete. But it's called text lesson LP. We used to call it this way. Incredible success there and then for ten years, we tried to apply this idea to learning representations of images by training a system to predict videos, learning intuitive physics by training a system to predict what's going to happen in the video. And tried and failed. And failed. With generative models, with models that predict.

59:07
Yann LeCun
Pixels, we could not get them to.

59:11
Yann LeCun
Learn good representations of images, we could not get them to learn good representations of videos. We tried many times. We published lots of papers on it. They kind of sort of work, but not really great. They started working.

59:24
Yann LeCun
We abandoned this idea of predicting every.

59:27
Yann LeCun
Pixel and basically just doing digital embedding and predicting in representation space that works. So there's ample evidence that we're not going to be able to learn good representations of the real world using generative model. So I'm telling people, everybody's talking about generative AI. If you're really interested in human level AI, abandon the idea of generative AI.

59:51
Lex Fridman
Okay, but you really think it's possible to get far with the joint embedding representation? So there's common sense reasoning and then there's high level reasoning. I feel like those are two the kind of reasoning that llms are able to do. Okay, let me not use the word reasoning, but the kind of stuff that llms are able to do seems fundamentally different than the common sense reasoning we use to navigate the world. It seems like we're going to need both.

01:00:23
Yann LeCun
Would you be able to get with.

01:00:25
Lex Fridman
The joint embedding, with the jumper type of approach, looking at video? Would you be able to learn, let's see, how to get from New York to Paris, or how to understand the state of politics in the world today. Right. These are things where various humans generate a lot of language and opinions on in the space of language, but don't visually represent that in any clearly compressible way.

01:00:56
Yann LeCun
Right.

01:00:56
Yann LeCun
Well, there's a lot of situations that might be difficult for a purely language based system to know.

01:01:05
Yann LeCun
Okay.

01:01:06
Yann LeCun
You can probably learn from reading text, the entirety of the publicly available text in the world that I cannot get from New York to Paris by snapping my fingers. That's not going to work. Right? Yes, but there's probably sort of more complex scenarios of this type which an.

01:01:22
Yann LeCun
LLM may never have encountered and may.

01:01:26
Yann LeCun
Not be able to determine whether it's possible or not. So that link from the low level.

01:01:34
Yann LeCun
To the high level.

01:01:35
Yann LeCun
The thing is that the high level that language expresses is based on the common experience of the low level, which llms currently do not have. When we talk to each other, we know we have a common experience of the world. A lot of it is similar.

01:01:55
Yann LeCun
And.

01:01:57
Yann LeCun
Llms don't have that.

01:01:58
Lex Fridman
But see, it's present. You and I have a common experience of the world in terms of the physics of how gravity works and stuff like this. And that common knowledge of the world, I feel like is there in the language. We don't explicitly express it, but if you have a huge amount of text you're going to get this stuff that's between the lines. In order to form a consistent world molecule, you're going to have to understand how gravity works, even if you don't have an explicit explanation of gravity. So even though in the case of gravity, there is explicit explanation of gravity and Wikipedia, but the stuff that we think of as common sense reasoning, I feel like to generate language correctly, you're going to have to figure that out. Now, you could say as you. There's not enough text.

01:02:54
Yann LeCun
Sorry. Okay.

01:02:56
Lex Fridman
You don't think so?

01:02:57
Yann LeCun
No, I agree with what you just said, which is that to be able.

01:03:00
Yann LeCun
To do high level common sense, to.

01:03:03
Yann LeCun
Have high level common sense, you need to have the low level common sense to build on top of.

01:03:10
Yann LeCun
And that's.

01:03:10
Yann LeCun
Not there in llms. Llms are purely trained from tech. So then the other statement you made, I would not agree with the fact that implicit in all languages in the world is the underlying reality. There's a lot about underlying reality which is not expressed in language.

01:03:26
Lex Fridman
Is that obvious to you?

01:03:28
Yann LeCun
Yeah, totally.

01:03:30
Lex Fridman
So, like, all the conversations we have, okay, there's the dark web, meaning whatever. The private conversations like dms and stuff like this, which is much larger probably than what's available, what llms are trained on.

01:03:46
Yann LeCun
You don't need to communicate the stuff.

01:03:48
Yann LeCun
That is common, but the humor, all of it.

01:03:51
Lex Fridman
No, you do. You don't need to, but it comes through. If I accidentally knock this over, you'll probably make fun of me in the content of the you making fun of me will be explanation of the fact that cups fall and then gravity works in this way, and then you'll have some very vague information about what kind of things explode when they hit the ground. And then maybe you'll make a joke about entropy or something like this, and we'll never be able to reconstruct this again. Okay. You'll make a little joke like this, and there'll be trillion of other jokes. And from the jokes you can piece together the fact that gravity works and mugs can break and all this kind of stuff, you don't need to see. It'll be very inefficient.

01:04:36
Lex Fridman
It's easier for it to knock the thing over, but I feel like it would be there if you have enough of that data.

01:04:46
Yann LeCun
I just think that most of the information of this type that we have accumulated when were babies is just not present in text, in any description, essentially.

01:04:59
Lex Fridman
And the sensory data is a much richer source for getting that kind of understanding.

01:05:04
Yann LeCun
I mean, that's the 16,000 hours of wake time of a four year old and ten to the 15 bytes going through vision. Just vision, right. There is a similar bandwidth of touch and a little less through audio. And then text language doesn't come in until like a year in life. And by the time you are nine years old, you've learned about gravity. You know about inertia, you know about gravity, you know the stability, you know about the distinction between animate and inanimate objects. You know by 18 months, you know about why people want to do things and you help them if they can't. There's a lot of things that you learn mostly by observation, really, not even through interaction. In the first few months of life, babies don't really have any influence on the world. They can only observe. Right.

01:05:57
Yann LeCun
And you accumulate, like, a gigantic amount of knowledge just from that.

01:06:03
Yann LeCun
So that's what we're missing from current AI systems.

01:06:07
Lex Fridman
I think in one of your slides you have this nice plot. That is one of the ways you show that llms are limited.

01:06:13
Yann LeCun
I wonder if you could talk about.

01:06:15
Lex Fridman
Hallucinations from your perspectives. Why hallucinations happen from large language models?

01:06:23
Yann LeCun
And why.

01:06:24
Lex Fridman
And to what degree is that a fundamental flaw of large language models?

01:06:29
Yann LeCun
Right. So because of the autoregressive prediction, every time an LLM produces a token or a word, there is some level of probability for that word to take you out of the set of reasonable answers. And if you assume, which is a very strong assumption, that the probability of.

01:06:49
Yann LeCun
Such error is that those errors are.

01:06:54
Yann LeCun
Independent across a sequence of tokens being produced, what that means is that every time you produce a token, the probability that you stay within the set of correct answer decreases and it decreases exponentially.

01:07:08
Lex Fridman
So there's a strong, like you said, assumption there that if there's a nonzero probability of making mistake, which there appears to be, then there is going to be a kind of drift.

01:07:18
Yann LeCun
Yeah, and that drift is exponential.

01:07:21
Yann LeCun
It's like errors accumulate. Right?

01:07:24
Yann LeCun
So the probability that answer would be nonsensical increases exponentially with the number of tokens.

01:07:31
Lex Fridman
Is that obvious to you, by the way? Well, mathematically speaking, maybe. But isn't there a kind of gravitational pull towards the truth? Because on average, hopefully the truth is well represented in the training set.

01:07:49
Yann LeCun
No, it's basically a struggle against the curse of dimensionality. So the way you can correct for this is that you fine tune the system by having it produce answers for all kinds of questions that people might come up with. And people are people. So a lot of the questions that they have are very similar to each other. So you can probably cover 80% or whatever of questions that people will ask.

01:08:17
Yann LeCun
By collecting data, and then you fine.

01:08:22
Yann LeCun
Tune the system to produce good answers for all of those things.

01:08:25
Yann LeCun
And it's probably going to be able.

01:08:27
Yann LeCun
To learn that because it's got a lot of capacity to learn. But then there is the enormous set of prompts that you have not covered during training, and that set is enormous. Like, within the set of all possible prompts, the proportion of prompts that have been used for training is absolutely tiny. It's a tiny, tiny subset of all possible prompts. And so the system will behave properly on the prompts that has been either trained, pretrained, or fine tuned.

01:09:01
Yann LeCun
But then there is an entire space of things that it cannot possibly have.

01:09:06
Yann LeCun
Been trained on, because it's just the number is gigantic. So whatever training the system has been subject to produce appropriate tensors, you can break it by finding out a.

01:09:20
Yann LeCun
Prompt that will be outside of the.

01:09:23
Yann LeCun
Set of prompts has been trained on.

01:09:25
Yann LeCun
Or things that are similar, and then.

01:09:27
Yann LeCun
It will just spew complete nonsense.

01:09:30
Lex Fridman
When you say prompt, do you mean that exact prompt, or do you mean a prompt that's, like, in many parts, very different than is it that easy to ask a question or to say a thing that hasn't been said before on the Internet?

01:09:46
Yann LeCun
I mean, people have come up with things where you put essentially a random sequence of characters in a prompt, and that's enough to kind of throw the system into a mode where it's going to answer something completely different than it would have answered without this. So that's a way to jailbreak the system, basically, get it go outside of its conditioning. Right?

01:10:09
Lex Fridman
That's a very clear demonstration of it. But, of course, that goes outside of what is designed to do. If you actually stitch together reasonably grammatical sentences, is it that easy to break it?

01:10:26
Yann LeCun
Yeah, some people have done things like you write a sentence in English, or you ask a question in English, and it produces a perfectly fine answer, and then you just substitute a few words by the same word in another language, and all of a sudden the answer is complete nonsense.

01:10:44
Lex Fridman
Yes. So I guess what I'm saying is, like, which fraction of prompts that humans are likely to generate are going to break the system?

01:10:54
Yann LeCun
So the problem is that there is a long tail.

01:10:57
Yann LeCun
Yes.

01:10:58
Yann LeCun
This is an issue that a lot of people have realized in social networks and stuff like that, which is there is a very long tail of things that people will ask, and you can fine tune the system for the 80% or whatever of the things that most people will ask. And then this long tail is so large that you're not going to be able to fine tune the system for all the conditions. And in the end, the system ends up being kind of a giant lookup table, right? Essentially, which is not really what you want. You want systems that can reason, certainly that can plan. So the type of reasoning that takes place in LLM is very primitive. And the reason you can tell is primitive is because the amount of computation.

01:11:39
Yann LeCun
That is spent per token produced is constant. So if you ask a question and that question has answer in a.

01:11:48
Yann LeCun
Given number of token, the amount of computation devoted to computing that answer can be exactly estimated. It's the size of the prediction network with its 36 layers or 92 layers or whatever it is multiplied by number of tokens.

01:12:05
Yann LeCun
That's it.

01:12:06
Yann LeCun
And so essentially it doesn't matter if.

01:12:09
Yann LeCun
The question being asked is simple to.

01:12:14
Yann LeCun
Answer, complicated to answer, impossible to answer because it's undecided.

01:12:18
Yann LeCun
Well, there's something the amount of computation.

01:12:22
Yann LeCun
The system will be able to devote to that, to the answer is constant or is proportional to number of token produced in the answer, right. This is not the way we work. The way we reason is that when.

01:12:34
Yann LeCun
We'Re faced with a complex problem or.

01:12:37
Yann LeCun
A complex question, we spend more time trying to solve it and answer it, right, because it's more difficult.

01:12:43
Lex Fridman
There's a prediction element, there's an iterative element where you're like adjusting your understanding of a thing by going over and over. There's a hierarchical element, so on. Does this mean it's a fundamental flaw of llms, or does it mean that there's more part to that question now you're just behaving like an LLM immediately answer no, that it's just the low level world model, on top of which we can then build some of these kinds of mechanisms, like you said, persistent long term memory or reasoning, so on. But we need that world model that comes from language. Maybe it is not so difficult to build this kind of reasoning system on top of a well constructed world model.

01:13:36
Yann LeCun
Okay, whether it's difficult or not, the near future will say, because a lot of people are working on reasoning and planning abilities for dialogue systems, even if we restrict ourselves to language, just having the ability to plan your answer before.

01:13:54
Yann LeCun
You answer in terms that are not.

01:13:58
Yann LeCun
Necessarily linked with the language you're going to use to produce the answer, right? So this idea of this mantle model that allows you to plan what you're going to say, before you say it, that is very important. I think there's going to be a lot of systems over the next few years that are going to have this capability, but the blueprint of those systems will be extremely different from autoregressive llms. So it's the same difference as the difference between what psychologists call system one and system two in humans, right? So system one is the type of tasks that you can accomplish without deliberately.

01:14:37
Yann LeCun
Consciously think about how you do them.

01:14:41
Yann LeCun
You've done them enough that you can just do it subconsciously, right. Without thinking about them. If you are an experienced driver, you can drive without really thinking about it, and you can talk to someone at the same time or listen to the radio, right? If you are a very experienced chess.

01:14:58
Yann LeCun
Player, you can play against a non.

01:15:00
Yann LeCun
Experienced chess player without really thinking. Either you just recognize the pattern and you play, right? That's the system one. So all the things that you do.

01:15:09
Yann LeCun
Instinctively without really having to deliberately plan.

01:15:12
Yann LeCun
And think about it, and then there is all the tasks where you need to plan. So if you are a not so experienced chess player, or you are experienced, but you play against another experienced chess player, you think about all kinds of options, right? You think about it for a while, right? And you're much better if you have time to think about it than you are if you play blitz with limited time.

01:15:36
Yann LeCun
So this type of deliberate planning, which.

01:15:40
Yann LeCun
Uses your internal world model, that system two, this is what LLMs currently cannot do. Well, how do we get them to do this, right? How do we build a system that can do this kind of planning or.

01:15:54
Yann LeCun
Reasoning that devotes more resources to complex.

01:15:57
Yann LeCun
Problems than to simple problems? And it's not going to be autography, prediction of tokens. It's going to be more something akin to inference of latent variables in.

01:16:12
Yann LeCun
What.

01:16:12
Yann LeCun
Used to be called probabilistic models or graphical models and things of that type? So basically the principle is like this. The prompt is like observed variables. And what the model does is that it's basically a measure of. It can measure to what extent an.

01:16:35
Yann LeCun
Answer is a good answer for a prompt. Okay?

01:16:38
Yann LeCun
So think of it as some gigantic neural net, but it's got only one output, and that output is a scalar.

01:16:43
Yann LeCun
Number, which is, let's say, zero if.

01:16:47
Yann LeCun
The answer is a good answer for the question, and a large number if the answer is not a good answer for the question. Imagine you had this model. If you had such a model, you could use it to produce good answers.

01:16:58
Yann LeCun
The way you would do is produce.

01:17:02
Yann LeCun
The prompt and then search through the space of possible answers for one that minimizes that number.

01:17:09
Yann LeCun
That's called an energy based model, but.

01:17:11
Lex Fridman
That energy based model would need the model constructed by the LLM.

01:17:18
Yann LeCun
Well, so really what you need to do would be to not search over possible strings of text that minimize that energy. But what you would do is do this in abstract representation space. So in sort of the space of abstract thoughts, you would elaborate a thought using this process of minimizing the output of your model, which is just a scalar, it's an optimization process, right? So now the way the system produces.

01:17:48
Yann LeCun
Its answer is through optimization, by minimizing.

01:17:53
Yann LeCun
An objective function, basically, right? And we're talking about inference, we're not talking about training, right? The system has been trained already. So now we have an abstract representation of the thought of the answer, representation of the answer. We feed that to basically an autograph decoder, which can be very simple, that turns this into a text that expresses this thought. Okay? So that, in my opinion, is the blueprint of future data systems.

01:18:21
Yann LeCun
They will think about their answer, plan.

01:18:23
Yann LeCun
Their answer by optimization before turning it.

01:18:26
Yann LeCun
Into text, and that is Turing complete.

01:18:31
Lex Fridman
Can you explain exactly what the optimization problem there is like, what's the objective function? Just linger on it. You kind of briefly described it, but over what space are you optimizing?

01:18:43
Yann LeCun
The space of representations, those abstract representations. Abstract representation. So you have an abstract representation inside the system.

01:18:51
Yann LeCun
You have a prompt, the prompt goes.

01:18:52
Yann LeCun
To an encoder, produces a representation, perhaps goes through a predictor that predicts a representation of the answer, of the proper answer. But that representation may not be a.

01:19:02
Yann LeCun
Good answer because there might be some.

01:19:05
Yann LeCun
Complicated reasoning you need to do. Right? So then you have another process that takes the representation of the answers and modifies it so as to minimize a cost function that measures to what extent the answer is a good answer for the question. Now we sort of ignore the fact for the issue for a moment of how you train that system to measure whether answer is a good answer for a question.

01:19:35
Lex Fridman
But suppose such a system could be created, right? But what's the process, this kind of search like process?

01:19:42
Yann LeCun
It's an optimization process. You can do this if the entire system is differentiable, that scalar output is the result of running through some neural net, running the answer, the representation of the answer through some neural net. Then by gradient descent, by back propagating gradients, you can figure out how to modify the representation of the answer so as to minimize that.

01:20:05
Lex Fridman
So that's still a gradient based, it's gradient based inference.

01:20:08
Yann LeCun
So now you have a representation of the answer in abstract space. Now you can turn it into text.

01:20:14
Yann LeCun
Right.

01:20:15
Yann LeCun
And the cool thing about this is that the representation now can be optimized through gradient descent, but also is independent of the language in which you're going.

01:20:25
Yann LeCun
To express the answer. Right.

01:20:27
Lex Fridman
So you're operating in the subtract representation. I mean, this goes back to the joint embedding that it's better to work in the space of, I don't know, to romanticize the notion like space of concepts versus the space of concrete sensory information. Right, okay, but can this do something like reasoning, which is what we're talking about?

01:20:51
Yann LeCun
Well, not really, only in a very simple way. I mean, basically you can think of those things as doing the kind of optimization I was talking about, except they optimize in the discrete space, which is the space of possible sequences of tokens, and they do this optimization in a horribly inefficient way, which is generate a lot of hypothesis and then select the best ones.

01:21:13
Yann LeCun
And that's incredibly wasteful in terms of.

01:21:17
Yann LeCun
Computation, because you basically have to run your LLM for every possible generative sequence, and it's incredibly wasteful. So it's much better to do an optimization in continuous space where you can do gradient descent as opposed to generate tons of things, and then select the best.

01:21:38
Yann LeCun
You just iteratively refine your answer to.

01:21:41
Yann LeCun
Go towards the best. Right? That's much more efficient. But you can only do this in continuous spaces with differentiable functions.

01:21:48
Lex Fridman
You're talking about the reasoning like ability to think deeply or to reason deeply. How do you know what is answer that's better or worse based on deep reasoning?

01:22:04
Yann LeCun
Right?

01:22:05
Yann LeCun
So then we're asking the question of conceptually, how do you train an energy based model? Right? So energy based model is a function.

01:22:11
Yann LeCun
With a scalar output, just a number. You give it two inputs, x and y, and it tells you whether y.

01:22:19
Yann LeCun
Is compatible with x or not x.

01:22:20
Yann LeCun
You observe, let's say it's a prompt.

01:22:22
Yann LeCun
An image, a video, whatever. And y is a proposal for an.

01:22:27
Yann LeCun
Answer, a continuation of the video, whatever, and it tells you whether y is compatible with x.

01:22:34
Yann LeCun
And the way it tells you that y is compatible with x is that the output of that function would be zero if Y is compatible with x.

01:22:41
Yann LeCun
And would be a positive number, nonzero.

01:22:44
Yann LeCun
If Y is not compatible with x. Okay, how do you train a system.

01:22:48
Yann LeCun
Like this, at a completely general level, is you show it pairs of x.

01:22:54
Yann LeCun
And y's that are compatible. A question and the corresponding answer, and you train the parameters of the big.

01:23:00
Yann LeCun
Neural net inside to produce zero. Okay, now that doesn't completely work because.

01:23:07
Yann LeCun
The system might decide, well, I'm just.

01:23:09
Yann LeCun
Going to say zero for everything.

01:23:11
Yann LeCun
So now you have to have a process to make sure that for a.

01:23:15
Yann LeCun
Wrong y, the energy would be larger than zero.

01:23:18
Yann LeCun
And there you have two options. One is contrastive method. So contrastive method is you show an x and a bad y and you tell the system, well, that's give a high energy to this. Like push up the energy, right? Change the weights in the neural net. That computes the energy so that it goes up. So that's contrastive methods. The problem with this is if the space of Y is large, the number of such contrastive samples you're going to.

01:23:44
Yann LeCun
Have to show is gigantic. But people do this.

01:23:50
Yann LeCun
They do this. When you train a system with RLHF, basically what you're training is what's called a reward model, which is basically an objective function that tells you whether answer is good or bad.

01:24:02
Yann LeCun
And that's basically exactly what this is.

01:24:06
Yann LeCun
So we already do this to some extent. We're just not using it for inference, we're just using it for training.

01:24:13
Yann LeCun
There is another set of methods which.

01:24:16
Yann LeCun
Are non contrastive, and I prefer those. And those non contrastive methods basically say, okay, the energy function needs to have low energy on pairs of x y's that are compatible, that come from your training set. How do you make sure that the energy is going to be higher everywhere else? And the way you do this is.

01:24:38
Yann LeCun
By having a regularizer, a criterion, a.

01:24:43
Yann LeCun
Term in your cost function that basically.

01:24:46
Yann LeCun
Minimizes the volume of space that can take low energy. And the precise way to do this.

01:24:53
Yann LeCun
Is all kinds of different specific ways to do this, depending on the architecture. But that's the basic principle. So that if you push down the energy function for particular regions in the xy space, it will automatically go up in other places because there's only a limited volume of space that can take low energy.

01:25:11
Yann LeCun
Okay.

01:25:11
Yann LeCun
By the construction of the system or by the regularizing function.

01:25:16
Lex Fridman
We've been talking very generally, but what is a good x and a good y? What is a good representation of x and y? Because we've been talking about language. And if you just take language directly, that presumably is not good. So there has to be some kind of abstract representation of ideas.

01:25:36
Yann LeCun
Yeah, you can do this with language.

01:25:38
Yann LeCun
Directly by just x is a text.

01:25:42
Yann LeCun
And y is a continuation of that text.

01:25:43
Yann LeCun
Yes.

01:25:45
Yann LeCun
Or x is a question, y is.

01:25:47
Lex Fridman
The answer but you're saying that's not going to take it. I mean, that's going to do what llms are doing.

01:25:52
Yann LeCun
Well, no, it depends on how the internal structure of the system is built. If the internal structure of the system is built in such a way that inside of this system there is a latent variable, let's call it z, that.

01:26:07
Yann LeCun
You can manipulate so as to minimize.

01:26:10
Yann LeCun
The output energy, then that z can be viewed as a representation of a good answer that you can translate into a y.

01:26:18
Yann LeCun
That is a good answer.

01:26:20
Lex Fridman
So this kind of system could be trained in a very similar way.

01:26:24
Yann LeCun
Very similar way. But you have to have this way of preventing collapse, of ensuring that there is high energy for things you don't train it on. And currently it's very implicit in LLM. It's done in a way that people don't realize it's being done. But it is being done is due to the fact that when you give.

01:26:44
Yann LeCun
A high probability to a word, automatically.

01:26:49
Yann LeCun
You give low probability to other words.

01:26:51
Yann LeCun
Because you only have a finite amount.

01:26:54
Yann LeCun
Of probability to go around right there to sum to one. So when you minimize the cross entropy or whatever, when you train your LLM to produce to predict the next word, you're increasing the probability your system will give to the correct word, but you're also decreasing the probability will give to the incorrect words. Now, indirectly, that gives a low probability to a high probability to sequences of words that are good, and low probability to sequences of words that are bad. But it's very indirect. And it's not obvious why this actually works at all, because you're not doing it on a joint probability of all the symbols in a sequence. You're just doing it, kind of factorize that probability in terms of conditional probabilities over successive tokens.

01:27:41
Lex Fridman
So how do you do this for visual data?

01:27:43
Yann LeCun
So we've been doing this with Ojepa architectures, basically the joint of banning Ijepa. So there, the compatibility between two things is, here's an image or a video, here is a corrupted, shifted, or transformed version of that image or video, or masked.

01:28:00
Yann LeCun
Okay.

01:28:01
Yann LeCun
And then the energy of the system.

01:28:04
Yann LeCun
Is the prediction error of the representation.

01:28:12
Yann LeCun
The predicted representation of the good thing versus the actual representation of the good thing. Right? So you run the corrupted image to the system, predict the representation of the good input uncorrupted, and then compute the prediction error. That's the energy of the system. So this system will tell you if this is a good image and this is a corrupted version, it will give you zero energy. If those two things are effectively, one of them is a corrupted version of the other, give you a high energy if the two images are completely different.

01:28:46
Lex Fridman
And hopefully that whole process gives you a really nice compressed representation of reality, of visual reality.

01:28:54
Yann LeCun
And we know it does because then we use those representations as input to a classification system.

01:28:59
Lex Fridman
That system works really nicely. Okay, well, so to summarize, you recommend in a spicy way that only Yanakun can you recommend that we abandon generative models in favor of joint embedding architectures?

01:29:14
Yann LeCun
Yes.

01:29:15
Lex Fridman
Abandon autoregressive generation?

01:29:17
Yann LeCun
Yes.

01:29:17
Lex Fridman
Abandon probable. This feels like court testimony. Abandon probabilistic models in favor of energy based models, as we talked about. Abandon contrastive methods in favor of regularized methods. And let me ask you about this. You've been for a while a critic of reinforcement learning.

01:29:36
Yann LeCun
Yes.

01:29:37
Lex Fridman
So the last recommendation is that we abandon RL in favor of model predictive control, as you were talking about, and only use RL when planning doesn't yield the predicted outcome. And we use RL in that case to adjust the world model or the critic.

01:29:55
Yann LeCun
Yes.

01:29:56
Lex Fridman
So you mentioned RLHF, reinforcement learning with human feedback. Why do you still hate reinforcement learning?

01:30:05
Yann LeCun
I don't hate reinforcement learning, and I think it should not be abandoned completely, but I think its use should be minimized because it's incredibly inefficient in terms of samples. And so the proper way to train a system is to first have it learn good representations of the world and world models from mostly observation, maybe a.

01:30:29
Yann LeCun
Little bit of interactions, and then steered based on that.

01:30:33
Lex Fridman
If the representation is good, then the adjustments should be minimal.

01:30:36
Yann LeCun
Yeah. Now there's two things.

01:30:38
Yann LeCun
If you've learned a world model, you.

01:30:40
Yann LeCun
Can use a world model to plan a sequence of actions to arrive at a particular objective.

01:30:45
Yann LeCun
You don't need RL unless the way you measure whether you succeed might be inexact.

01:30:52
Yann LeCun
Your idea of whether you're going to fall from your bike.

01:30:58
Yann LeCun
Might be wrong.

01:30:59
Yann LeCun
Or whether the person you're fighting with MMA was going to do something and then do something else. So there's two ways you can be wrong. Either your objective function does not reflect the actual objective function you want to.

01:31:15
Yann LeCun
Optimize, or your world model is inaccurate. Right.

01:31:19
Yann LeCun
So the prediction you were making about what was going to happen in the world is inaccurate. So if you want to adjust your.

01:31:26
Yann LeCun
World model while you are operating in the world, or your objective function, that is basically in the realm of RL.

01:31:35
Yann LeCun
This is what RL deals with to some extent. Right. So adjust your world model and the.

01:31:41
Yann LeCun
Way to adjust your world model, even.

01:31:43
Yann LeCun
In advance, is to explore parts of the space where your world model, where.

01:31:48
Yann LeCun
You know that your world model is inaccurate.

01:31:50
Yann LeCun
That's called curiosity, basically, or play, right? When you play, you kind of explore parts of the state space that you don't want to do for real because it might be dangerous. But you can adjust your world model without killing yourself, basically. So that's what you want to use RL for. When it comes time to learning a particular task, you already have all the good representations, you already have your world model, but you need to adjust it for the situation at hand.

01:32:25
Yann LeCun
That's when you use RL.

01:32:26
Lex Fridman
Why do you think RLHF works so well? This enforcement, learning with human feedback, why did it have such a transformational effect on large language models before?

01:32:38
Yann LeCun
What's had the transformational effect is human feedback. There is many ways to use it, and some of it is just purely supervised. Actually, it's not really reinforced by learning.

01:32:47
Lex Fridman
So it's the HF.

01:32:49
Yann LeCun
It's the HF. And then there is various ways to use human feedback, right? So you can ask humans to rate answers, multiple answers that are produced by world model. And then what you do is you train an objective function to predict that rating. And then you can use that objective function to predict whether answer is good. And you can back propagate gradient through this to fine tune your system so that it only produces highly rated answers.

01:33:20
Yann LeCun
Okay, so that's one way.

01:33:22
Yann LeCun
So that's like in RL, that means training what's called a reward model, right? So something that basically a small neural net that estimates to what extent an.

01:33:33
Yann LeCun
Answer is good, right?

01:33:35
Yann LeCun
It's very similar to the objective I was talking about earlier for planning, except now it's not used for planning, it's used for fine tuning your system. I think it would be much more efficient to use it for planning, but currently it's used to fine tune the.

01:33:51
Yann LeCun
Parameters of the system.

01:33:52
Yann LeCun
Now there are several ways to do this. Some of them are supervised. You just ask a human person like, what is a good answer for this?

01:34:01
Yann LeCun
Right? And you just type the answer.

01:34:05
Yann LeCun
I mean, there's lots of ways that.

01:34:07
Yann LeCun
Those systems are being adjusted.

01:34:10
Lex Fridman
Now, a lot of people have been very critical of the recently released Google's Gemini 1.5 for essentially, in my words, I could say super woke in the negative connotation of that word. There are some almost hilariously absurd things that it does like it modifies history, like generating images of a black George Washington. Or perhaps more seriously, something that you commented on Twitter, which is refusing to comment on or generate images or even descriptions of Tiananmen Square or the tank man, one of the most sort of legendary protest images in history. Of course, these images are highly censored by the chinese government, and therefore, everybody started asking questions of what is the process of designing these llms? What is the role of censorship in these. All that kind of stuff. So you commented on Twitter saying that open source is the answer. Yeah, essentially.

01:35:26
Lex Fridman
So can you explain?

01:35:29
Yann LeCun
I actually made that comment on just about every social network I can, and I've made that point multiple times in various forums. Here's my point of view on this. People can complain that AI systems are.

01:35:46
Yann LeCun
Biased, and they generally are biased by.

01:35:50
Yann LeCun
The distribution of the training data that.

01:35:52
Yann LeCun
They'Ve been trained on.

01:35:55
Yann LeCun
That reflects biases in society, and that.

01:36:01
Yann LeCun
Is potentially offensive to some people or potentially not. And some techniques to debias then become offensive to some people.

01:36:15
Yann LeCun
Because of historical incorrectness and things like that.

01:36:23
Yann LeCun
And so you can ask the question. You can ask two questions.

01:36:26
Yann LeCun
The first question is it possible.

01:36:28
Yann LeCun
To produce an AI system that is not biased? And the answer is absolutely not. And it's not because of technological.

01:36:36
Yann LeCun
Challenges, although there are technological challenges to that.

01:36:41
Yann LeCun
It's because bias is in the eye of the beholder.

01:36:46
Yann LeCun
Different people may have different ideas about.

01:36:49
Yann LeCun
What constitutes bias for a lot of things.

01:36:53
Yann LeCun
I mean, there are facts that are indisputable, but there are a lot of opinions or things that can be expressed in different ways. And so you cannot have an unbiased system. That's just an impossibility.

01:37:08
Yann LeCun
And so.

01:37:11
Yann LeCun
What'S the answer to this?

01:37:12
Yann LeCun
And the answer is the same answer that we found in liberal democracy about the press. The press needs to be free and diverse.

01:37:25
Yann LeCun
We have free speech for a good.

01:37:26
Yann LeCun
Reason, is because we don't want all.

01:37:31
Yann LeCun
Of our information to come from a.

01:37:34
Yann LeCun
Unique source, because that's opposite to the.

01:37:38
Yann LeCun
Whole idea of democracy and progress, of ideas and even science. Right. In science, people have to argue for different opinions, and science makes progress when people disagree and they come up with answer and a consensus forms. Right? And it's true in all democracies around the world.

01:37:57
Yann LeCun
So there is a future, which is.

01:38:02
Yann LeCun
Already happening, where every single one of our interaction with the digital world will be mediated by AI systems.

01:38:10
Yann LeCun
Ais systems.

01:38:11
Yann LeCun
Right.

01:38:12
Yann LeCun
We're going to have smart glasses.

01:38:14
Yann LeCun
You can already buy them from meta.

01:38:16
Yann LeCun
The Rayban meta, where you can talk.

01:38:20
Yann LeCun
To them and they are connected with an LLM, and you can get answers on any question you have. Or you can be looking at a monument and there is a camera in the system that in the glasses, you can ask it, like, what can you tell me about this building or this monument? You can be looking at a menu in a foreign language and it's saying, we'll translate it for you. Or we can do real time translation if we speak different languages. So a lot of our interactions with the digital world are going to be mediated by those systems in the near.

01:38:50
Yann LeCun
Future, increasingly, the search engines that we're.

01:38:56
Yann LeCun
Going to use are not going to.

01:38:57
Yann LeCun
Be search engines, they're going to be dialogue systems that we just ask a question and it will answer and then.

01:39:05
Yann LeCun
Point you to perhaps appropriate reference for it. But here is the thing. We cannot afford those systems to come from a handful of companies on the.

01:39:14
Yann LeCun
West coast of the US.

01:39:17
Yann LeCun
Because those systems will constitute the repository of all.

01:39:20
Yann LeCun
Human knowledge, and we cannot have that.

01:39:24
Yann LeCun
Be controlled by a small number of people. Right?

01:39:27
Yann LeCun
It has to be diverse for the.

01:39:29
Yann LeCun
Same reason the press has to be diverse. So how do we get a diverse set of AI assistants? It's very expensive and difficult to train a base model, right. Based LLM at the moment. In the future, it might be something different, but at the moment, that's an LLM.

01:39:47
Yann LeCun
So only a few companies can do this properly. And if some of those subsystems are.

01:39:54
Yann LeCun
Open source, anybody can use them, anybody can fine tune them. If we put in place some systems.

01:40:01
Yann LeCun
That allows any group of people, whether they are individual citizens, groups of citizens, government organizations, ngos, companies, whatever, to take those open source.

01:40:22
Yann LeCun
Systems, AI systems, and fine tune them for their own purpose.

01:40:25
Yann LeCun
On their own data, then we're going.

01:40:28
Yann LeCun
To have a very large diversity of different AI systems that are specialized for all of those things, right?

01:40:34
Yann LeCun
So I tell you, I talked to.

01:40:36
Yann LeCun
The french government quite a bit, and.

01:40:38
Yann LeCun
The french government will not accept that.

01:40:41
Yann LeCun
The digital diet of all their citizen be controlled by three companies on the.

01:40:46
Yann LeCun
West coast of the US, that's just not acceptable.

01:40:49
Yann LeCun
It's a danger to democracy, regardless of how well intentioned those companies are. And it's also a danger to local.

01:40:59
Yann LeCun
Culture, to values, to language. Right?

01:41:05
Yann LeCun
I was talking with the founder of Infosys india. He's funding a project to fine tune Lama Two, the open source model produced by Meta. So that Lama Two speaks all 22 official languages india. It's very important for people india. I was talking to a former colleague of mine, Mustafasise, who used to be a scientist at fair and then moved back to Africa. I created a research lab for Google in Africa and now has a new startup called Kera. And what he's trying to do is basically have LLms that speak the local languages in Senegal so that people can have access to medical information because they don't have access to doctors. It's a very small number of doctors per capita. In mean, you can't have any of this unless you have open source platforms.

01:41:57
Yann LeCun
So with open source platforms, you can have AI systems that are not only diverse in terms of political opinions or things of that type, but in terms.

01:42:05
Yann LeCun
Of language, culture, value systems, political opinions.

01:42:15
Yann LeCun
Technical abilities in various domains. And you can have an industry, an ecosystem of companies that fine tune those open source systems for vertical applications industry, right. You have, I don't know, a publisher has thousands of books and they want to build a system that allows a customer to just ask a question about the content of any of their books. You need to train on their proprietary data. Right. You have a company, we have one within meta, it's called metamate, and it's basically an LLM that can answer any question about internal stuff about the company.

01:42:51
Yann LeCun
Very useful. A lot of companies want this, right?

01:42:55
Yann LeCun
A lot of companies want this not just for their employees, but also for their customers to take care of their customers.

01:43:00
Yann LeCun
So the only way you're going to have an AI industry, the only way.

01:43:04
Yann LeCun
You'Re going to have AI systems that.

01:43:06
Yann LeCun
Are not uniquely biased is if you.

01:43:09
Yann LeCun
Have open source platforms on top of which any group can build specialized systems. So the direction of, inevitable direction of history is that the vast majority of AI systems will be built on top of open source platforms.

01:43:28
Lex Fridman
So that's a beautiful vision. So meaning like a company like Meta or Google or so on, should take only minimal fine tuning steps after the building the foundation, pre trained model, as few steps as possible, basically. Can meta afford to do that?

01:43:51
Yann LeCun
No.

01:43:51
Lex Fridman
So I don't know if you know this, but companies are supposed to make money somehow and open source is like giving away. I don't know. Mark made a video, Mark Zuckerberg very sexy video, talking about 350,000 Nvidia H 100. The math of that is just for the gpus, that's 100 billion.

01:44:19
Yann LeCun
Plus the.

01:44:20
Lex Fridman
Infrastructure for training, everything. So I'm no business guy, but how do you make money on that? So the vision you paint is a really powerful one, but how is it possible to make money?

01:44:32
Yann LeCun
Okay, so you have several business models, right? The business model that meta is built.

01:44:39
Yann LeCun
Around is you offer a service and.

01:44:46
Yann LeCun
The financing of that service is either through ads or through business customers. So for example, if you have an LLM that can help a mom and pop pizza place by talking to the.

01:45:01
Yann LeCun
Customer through WhatsApp, and so the customers can just order a pizza and the.

01:45:06
Yann LeCun
System will just ask them like what topping do you want? Or what size, blah blah, the.

01:45:12
Yann LeCun
Business will pay for that. Okay, that's a model.

01:45:19
Yann LeCun
And otherwise, if it's a system that is on the more kind of classical services, it can be ad supported, or there's several models. But the point is, if you have a big enough potential customer base, and.

01:45:34
Yann LeCun
You need to build that system anyway for them, it doesn't hurt you to.

01:45:41
Yann LeCun
Actually distribute it in open source.

01:45:43
Lex Fridman
Again, I'm no business guy, but if you release the open source model, then other people can do the same kind of task and compete on it, basically provide fine tuned models for businesses. Is the bet that meta is making. By the way, I'm a huge fan of all this, but is the bet that meta is making, it's like, we'll do a better job of it.

01:46:05
Yann LeCun
Well, no, the bet is more we already have a huge user base and customer base, right? So it's going to be useful to them. Whatever we offer them is going to be useful. And there is a way to derive revenue from this.

01:46:22
Yann LeCun
And it doesn't hurt that we provide that system or the base model, right.

01:46:29
Yann LeCun
The foundation model in open source for others to build applications on top of it too. If those applications turn out to be useful for our customers, we can just buy it from them. It could be that they will improve the platform. In fact, we see this already. I mean, there is literally millions of downloads of llama two, and thousands of people who have provided ideas about how to make it better. So this clearly accelerates progress to make the system available to sort of a wide community of people. And there's literally thousands of businesses who are building applications with to. Meta's ability to derive revenue from this technology is not impaired by the distribution of it, of base models in open source.

01:47:26
Lex Fridman
The fundamental criticism that Gemini is getting is that, as you pointed out, on the west coast, just to clarify, we're currently in the east coast, where I would suppose Meta AI headquarters would be. So there strong words about the west coast. But I guess the issue that happens is, I think it's fair to say that most tech people have a political affiliation with the left wing. They lean left. And so the problem that people are criticizing Gemini with is that there's in that debiasing process that you mentioned, that their ideological lean becomes obvious. Is this something that could be escaped. You're saying open source is the only way. Have you witnessed this kind of ideological lean that makes engineering difficult?

01:48:22
Yann LeCun
No, I don't think the issue has to do with the political leaning of the people designing those systems.

01:48:29
Yann LeCun
It has to do with the acceptability.

01:48:33
Yann LeCun
Or political leanings of their customer base or audience. Right. So a big company cannot afford to offend too many people. So they're going to make sure that whatever product they put out is safe, whatever that means.

01:48:53
Yann LeCun
It's very possible to overdo it, and.

01:48:56
Yann LeCun
It'S also very possible to. It's impossible to do it properly for everyone. You're not going to satisfy everyone. So that's what I said before. You cannot have a system that is unbiased, that is perceived as unbiased by everyone. You push it in one way, one set of people are going to see it as biased, and then you push it the other way, and another set of people is going to see it as biased. And then in addition to this, there's the issue of if you push the system perhaps a little too far in one direction, it's going to be non factual. Right. You're going to have black nazi soldiers.

01:49:31
Lex Fridman
We should mention image generation of black nazi soldiers, which is not factually accurate.

01:49:38
Yann LeCun
Right.

01:49:39
Yann LeCun
And can be offensive for some people. As you know, it's going to be impossible to kind of produce systems that are unbiased for everyone. So the only solution that I see is diversity.

01:49:52
Lex Fridman
And diversity in full meaning of that word, diversity in every possible way. Yeah, Mark Andreessen just tweeted today. Let me do a TLDR. The conclusion is only startups and open source can avoid the issue that he's highlighting with big tech. He's asking, can big tech actually field generative AI products? One, ever. Escalating demands from internal activists, employee mobs, crazed executives, broken boards, pressure groups, extremist regulators, government agencies, the press, in quotes, experts and everything corrupting the output. Two, constant risk of generating a bad answer or drawing a bad picture or rendering a bad video. Who knows what is going to say or do at any moment. Three, legal exposure, product liability, slander, election law, many other things, and so on. Anything that makes Congress mad. Four, continuous attempts to tighten grip on acceptable output.

01:50:58
Lex Fridman
Degrade the model, like how good it actually is in terms of usable and pleasant to use and effective and all that kind of stuff. And five, publicity of bad text images, video actual puts those examples into the training data for the next version.

01:51:14
Yann LeCun
So on.

01:51:15
Lex Fridman
So he just highlights how difficult this is from all kinds of people. Being unhappy. As you said, you can't create a system that makes everybody happy.

01:51:24
Yann LeCun
Yes.

01:51:25
Lex Fridman
So if you're going to do the fine tuning yourself and keep it closed source, essentially, the problem there is then trying to minimize the number of people who are going to be, And you're saying that almost impossible to do.

01:51:41
Yann LeCun
Right.

01:51:42
Lex Fridman
And the better ways to do open source, basically, yeah.

01:51:48
Yann LeCun
Mark is right about a number of things that he lists that indeed scare large companies. Certainly congressional investigations is one of them. Legal liability, making things that get people to hurt themselves or hurt others. Big companies are really careful about not producing things of this type.

01:52:17
Yann LeCun
Because they.

01:52:19
Yann LeCun
Don'T want to hurt anyone, first of all. And then second, they want to preserve their business. So it's essentially impossible for systems like this that can inevitably formulate political opinions and opinions about various things that may be political or not, but that people may disagree about moral issues and things, about questions about religion and things like that. Right. Or cultural issues that people from different communities would disagree with in the first place. So there's only kind of a relatively small number of things that people will sort of agree on basic principles, but beyond that, if you want those systems.

01:53:00
Yann LeCun
To be useful, they will necessarily have.

01:53:03
Yann LeCun
To offend a number of people inevitably.

01:53:08
Lex Fridman
And so open source is just better.

01:53:10
Yann LeCun
And then diversity is better. Right?

01:53:13
Lex Fridman
And open source enables diversity.

01:53:15
Yann LeCun
That's right. Open source enables diversity.

01:53:18
Lex Fridman
That's going to be a fascinating world where if it's true that the open source world, if metal leads the way and creates this kind of open source foundation model world, there's going to be like, governments will have a fine tune model, and then potentially people that vote left and right will have their own model and preference and be able to choose, and it will potentially divide us even more. But that's on us humans. We get to figure out basically the technology enables humans to human more effectively. And all the difficult ethical questions that humans raise will just leave it up to us to figure it out.

01:54:02
Yann LeCun
Yeah, I mean, there are some limits to what, the same way there are limits to free speech. There has to be some limit to the kind of stuff that those systems might be authorized to produce, some guardrails. So that's one thing I've been interested in, which is in the type of.

01:54:20
Yann LeCun
Architecture that were discussing before, where.

01:54:23
Yann LeCun
The output of a system is the result of an inference to satisfy an objective. That objective can include guardrails, and we can put guardrails in open source systems. I mean, if we eventually have systems that are built with this blueprint, we can put guardrails in those systems that guarantee that there is sort of a minimum set of guardrails that make the system non dangerous and nontoxic, et cetera. Basic things that everybody will agree on. And then the fine tuning that people will add, or the additional guardrails that people will add will kind of cater to their community, whatever it is.

01:55:05
Lex Fridman
The fine tuning will be more about the gray areas of what is hate speech, what is dangerous, and all that.

01:55:10
Yann LeCun
Kind of stuff with different value systems.

01:55:13
Lex Fridman
Value systems. But still, even with the objectives of how to build a bioweapon, for example, I think something you've commented on, or at least there's a paper where a collection of researchers are trying to understand the social impacts of these llms. And I guess one threshold is nice, is like, does the LLM make it any easier than a search would, like a Google search would?

01:55:39
Yann LeCun
Right. So the increasing number of studies on this seems to point to the fact.

01:55:48
Yann LeCun
That it doesn't help.

01:55:49
Yann LeCun
So, having an LLM doesn't help you design or build a bioweapon or a chemical weapon. If you already have access to a search engine and a library, the sort of increased information you get, or the ease with which you get it doesn't really help you. That's the first thing. The second thing is, it's one thing to have a list of instructions of how to make a chemical weapon, for example, or bioweapon.

01:56:17
Yann LeCun
It's another thing to actually build it.

01:56:19
Yann LeCun
And it's much harder than you might think, and an LLM will not help you with that. In fact, nobody in the world, not even, like countries, use bioweapons because most of the time, they have no idea how to protect their own populations against it. So it's too dangerous, actually, to kind of ever use, and it's, in fact, banned by international treaties. Chemical weapons is different. It's also banned by treaties, but it's the same problem. It's difficult to use in situations that doesn't turn against the perpetrators. But we could ask Elon Musk.

01:56:58
Yann LeCun
I can give you a very precise.

01:57:00
Yann LeCun
List of instructions of how you build a rocket engine. And even if you have a team of 50 engineers that are really experienced building it, you're still going to have to blow up a dozen of them before you get one that works. And it's the same with.

01:57:17
Yann LeCun
Chemical weapons.

01:57:18
Yann LeCun
Or bioweapons or things like this. It requires expertise in the real world that analym is not going to.

01:57:24
Lex Fridman
Help you with, and it requires even the common sense expertise that we've been talking about, which is how to take language based instructions and materialize them in the physical world requires a lot of knowledge that's not in the instructions.

01:57:41
Yann LeCun
Yeah, exactly. A lot of biologists have posted on this, actually, in response to those things saying, like, do you realize how hard it is to actually do the lab work? This is not trivial.

01:57:51
Lex Fridman
Yeah, and that's Hans Marvik comes to light once again, just to linger on. Llama. Mark announced that llama three is coming out eventually. I don't think there's a release date.

01:58:03
Yann LeCun
But what are you most excited about.

01:58:06
Lex Fridman
First of all, llama two that's already out there and maybe the future llama three, four, 5610, just the future of the open source under meta?

01:58:17
Yann LeCun
Well, a number of things. So there's going to be like various versions of llama that are improvements of previous llamas, bigger, better, multimodal things like that. And then in future generations, systems that are capable of planning, that really understand how the world works, maybe are trained from video, so they have some world model, maybe capable of the type of reasoning and planning I was talking about earlier. How long is that going to take? When is the research that is going in that direction going to sort of feed into the product line, if you want, of Lama? I don't know. I can't tell you. And there's a few breakthroughs that we have to basically go through before we can get there. But you'll be able to monitor our progress because we publish our research, right.

01:59:07
Yann LeCun
So last week we published the Vijepa work, which is sort of a first step towards training systems from video. And then the next step is going to be world models based on kind of this type of idea, training from video. The similar work at DeepMind also, and taking place people, and also at UC Berkeley on world models from video. A lot of people are working on this. I think a lot of good ideas are appearing. My bet is that those systems are going to be Jep alike. They're not going to be generative models, and we'll see what the future will tell. There's really good work at.

01:59:54
Yann LeCun
A gentleman.

01:59:54
Yann LeCun
Called Danish R. Hefner, who is not DeepMind, who's worked on kind of models of this type that learn representations and then use them for planning or learning tasks by reinforcement training. And a lot of work at Berkeley by Levine, a bunch of other people of that type I'm collaborating with actually in the context of some grants with my NYU hat, and then collaborations also through meta, because the lab at Berkeley is associated with meta in some way, so with fair. So I think it's very exciting. I'm super excited about. I haven't been that excited about the direction of machine learning and AI since ten years ago when fair was started, and before that, 30 years ago when were working on 35 on combination nets and the early days of neural net.

02:00:52
Yann LeCun
So I'm super excited because I see a path towards potentially human level intelligence with systems that can understand the world. Remember, plan reason there is some set of ideas to make progress there that might have a chance of working. And I'm really excited about this. What I like is that somewhat we get onto a good direction and perhaps succeed before my brain turns to white sauce or before I need to retire.

02:01:28
Lex Fridman
Yeah, you're also excited by.

02:01:34
Yann LeCun
Is it.

02:01:34
Lex Fridman
Beautiful to you, just the amount of gpus involved, sort of the whole training process on this much compute, just zooming out, just looking at earth and humans together have built these computing devices and are able to train this one brain. Then we then open source, like giving birth to this open source brain, trained on this gigantic compute system. There's just the details of how to train on that, how to build the infrastructure and the hardware, the cooling, all of this kind of stuff, or most of your excitement is in theory aspect of it, meaning like the software.

02:02:19
Yann LeCun
Well, I used to be a hardware guy many years ago. Yes, decades ago.

02:02:23
Yann LeCun
Hardware has improved a little bit.

02:02:26
Lex Fridman
Yeah.

02:02:27
Yann LeCun
I mean, certainly scale is necessary, but not sufficient.

02:02:32
Lex Fridman
Absolutely.

02:02:32
Yann LeCun
So we certainly need competition. I mean, we're still far in terms of compute power from what we would need to match the compute power of the human brain. This may occur in the next couple of decades, but we're still some ways away. And certainly in terms of power efficiency, we're really far. So there's a lot of progress to make in hardware. And right now, a lot of the progress, there's a bit coming from silicon technology, but a lot of it coming from architectural innovation, and quite a bit.

02:03:07
Yann LeCun
Coming from more efficient ways of implementing.

02:03:11
Yann LeCun
The architectures that have become popular. Basically, combination of transformers and cognets. Right. There's still some ways to go. Until we're going to saturate, we're going to have to come up with new principles, new fabrication technology, new basic components, perhaps based on sort of different principles than classical digital cmos.

02:03:41
Yann LeCun
Interesting. So you think in order to build.

02:03:46
Lex Fridman
AmI, we potentially might need some hardware innovation too?

02:03:52
Yann LeCun
Well, if we want to make it ubiquitous, yeah, certainly because we're going to have to reduce the power consumption, a GPU today. Right. Is half a kilowatt to a kilowatt human brain is about 25 watts, and a GPU is way below the power of human brain. You need something like 100,000 or million to match it.

02:04:17
Yann LeCun
We are off by a huge factor here.

02:04:21
Lex Fridman
You often say that AGI is not coming soon, meaning like, not this year, not the next few years, potentially farther away. What's your basic intuition behind that?

02:04:35
Yann LeCun
So, first of all, it's not going.

02:04:37
Yann LeCun
To be an event.

02:04:38
Yann LeCun
The idea, you know, is popularized by science fiction and Hollywood that somehow somebody is going to discover the secret to a GI, or human level, AI or Ami, whatever you want to call, then, you know, turn on a machine, and then we have a GI that's just not going to happen. It's not going to be an event.

02:04:59
Yann LeCun
It'S going to be gradual progress. Are we going to have systems that can learn from video how the world works and learn good web presentations? Yeah.

02:05:10
Yann LeCun
Before we get them to the scale and performance that we observe in humans.

02:05:14
Yann LeCun
It'S going to take quite a while.

02:05:15
Yann LeCun
It's not going to happen in one day. Are we going to get systems that can have large amount of associated memory.

02:05:24
Yann LeCun
So they can remember stuff?

02:05:26
Yann LeCun
Yeah, but same, it's not going to happen tomorrow. I mean, there is some basic techniques that need to be developed. We have a lot of them, but to get this to work together with full system is another story. Are we going to have system that can reason and plan, perhaps along the lines of objective driven AI architectures that I described before? Yeah, but before we get this to work properly, it's going to take a while, and before we get all those things to work together, and then on top of this, have systems that can learn, like hierarchical planning, hierarchical representations, systems that can be configured for a lot of different situation at hands, the way the human brain can.

02:06:04
Yann LeCun
All of this is going to take.

02:06:06
Yann LeCun
At least a decade and probably much.

02:06:08
Yann LeCun
More, because there are a lot of.

02:06:10
Yann LeCun
Problems that we're not seeing right now that we have not encountered. And so we don't know if there is an easy solution within this framework. So it's not just around the corner. I mean, I've been hearing people for the last 1215 years claiming that AGI is just around the corner and being systematically wrong. And I knew they were wrong when.

02:06:33
Yann LeCun
They were saying it.

02:06:34
Yann LeCun
I called their bullshit.

02:06:36
Lex Fridman
Why do you think people have been calling? First of all, from the birth of the term artificial intelligence, there has been eternal optimism. That's perhaps unlike other technologies. Is it Morvec paradox? Is the explanation for why people are so optimistic about AGI.

02:06:56
Yann LeCun
I don't think it's just Moravex paradox. Morix paradox is a consequence of realizing that the world is not as easy as we think. So, first of all, intelligence is not a linear thing that you can measure.

02:07:09
Yann LeCun
With a scalar, with a single number.

02:07:13
Yann LeCun
Can you say that humans are smarter than orangutans?

02:07:18
Yann LeCun
In some ways, yes.

02:07:19
Yann LeCun
But in some ways, orangutans are smarter than humans in a lot of domains. That allows them to survive in the forest, for example.

02:07:26
Lex Fridman
So IQ is a very limited measure of intelligence. Q intelligence is bigger than what IQ, for example, measures?

02:07:33
Yann LeCun
Well, IQ can measure approximately something for humans.

02:07:41
Yann LeCun
Because humans kind of come in relatively kind of uniform form.

02:07:48
Yann LeCun
Right? But it only measures one type of ability that may be relevant for some tasks but not others. But then if you are talking about other intelligent entities for which the basic things that are easy to them is.

02:08:07
Yann LeCun
Very different, then it doesn't mean anything. So intelligence is a collection of skills.

02:08:17
Yann LeCun
And an ability to acquire new skills efficiently.

02:08:22
Yann LeCun
Right.

02:08:23
Yann LeCun
And the collection of skills that an.

02:08:26
Yann LeCun
Intelligent, particular intelligent entity possess or is.

02:08:30
Yann LeCun
Capable of learning quickly is different from the collection of skills of another one. And because it's a multidimensional thing, the set of skills is high dimensional space. You cannot compare two things as to whether one is more intelligent than the other.

02:08:45
Yann LeCun
It's multidimensional.

02:08:48
Lex Fridman
So you push back against what are called AI doomers a lot. Can you explain their perspective and why you think they're wrong?

02:08:59
Yann LeCun
Okay, so AI doomers imagine all kinds of catastrophe scenarios of how AI could escape or control and basically kill us all. And that relies on a whole bunch of assumptions that are mostly false. So the first assumption is that the emergence of super intelligence is going to be an event that at some point we're going to figure out the secret and we'll turn on a machine that is super intelligent. And because we'd never done it before, is going to take over the world and kill us all. That is false.

02:09:33
Yann LeCun
It's not going to be an event.

02:09:35
Yann LeCun
We're going to have systems that are.

02:09:37
Yann LeCun
Like as smart as a cat, have.

02:09:40
Yann LeCun
All the characteristics of human level intelligence, but their level of intelligence would be.

02:09:46
Yann LeCun
Like a cat or a parrot maybe, or something.

02:09:51
Yann LeCun
And then we're going to work our way up to kind of make those things more intelligent. And as we make them more intelligent, we're also going to put some guardrails in them and learn how to kind of put some guardrails so they behave properly. And we're not going to do this with just one. It's not going to be one effort, but it's going to be lots of different people doing this. And some of them are going to succeed at making intelligence systems that are controllable and safe and have to write guardrails. And if some other goes rogue, then we can use the good ones to go against the rogue ones. So it's going to be my smart AI police against your rogue AI.

02:10:25
Yann LeCun
So it's not going to be like we're going to be exposed to a single rogue AI that's going to kill us all. That's just not happening. Now, there is another fallacy, which is the fact that because the system is intelligent, it necessarily wants to take over.

02:10:40
Yann LeCun
And there is several arguments that make.

02:10:43
Yann LeCun
People scare of this, which I think are completely false as well. So one of them is, in nature, it seems to be that the more intelligent species are the one that end up dominating the other and even extinguishing the others, sometimes by design, sometimes just by mistake. There is sort of thinking by which you say, well, if AI systems are more intelligent than us, surely they're going to eliminate us, if not by design, simply because they don't care about us. And that's just preposterous for a number of reasons. First reason is they're not going to be a species. They're not going to be a species that competes with us. They're not going to have the desire to dominate, because the desire to dominate is something that has to be hardwired into an intelligent system.

02:11:41
Yann LeCun
It is hardwired in humans.

02:11:44
Yann LeCun
It is hardwired in baboons, in chimpanzees.

02:11:47
Yann LeCun
In wolves, not in orangutans.

02:11:51
Yann LeCun
The species in which this desire to dominate or submit or attain status in.

02:11:58
Yann LeCun
Other ways is specific to social species. Non social species like orangutans don't have it, right?

02:12:06
Yann LeCun
And they are as smart as we are, almost.

02:12:08
Yann LeCun
Right.

02:12:09
Lex Fridman
And to you, there's not significant incentive for humans to encode that into the AI systems. And to the degree they do, there'll be other ais that sort of punish them for it. I'll compete them over.

02:12:23
Yann LeCun
Well, there's all kinds of incentive to make AI systems submissive to humans, right? I mean, this is the way we're going to build them, right? So then people say, oh, but look at llms. Llms are not controllable.

02:12:33
Yann LeCun
And they're right, llms are not controllable.

02:12:36
Yann LeCun
But objective driven AI. So systems that derive their answers by optimization of an objective means, they have to optimize this objective. And that objective can include guardrails.

02:12:48
Yann LeCun
One guardrail is obey humans.

02:12:52
Yann LeCun
Another guardrail is don't obey humans if it's hurting other humans within.

02:12:57
Lex Fridman
I've heard that before somewhere. I don't remember.

02:12:59
Yann LeCun
Yes, maybe in a book.

02:13:02
Lex Fridman
Yeah, but speaking of that book, could there be unintended consequences also from all of this?

02:13:09
Yann LeCun
No, of course. So this is not a simple problem, right? I mean, designing those guardrails so that the system behaves properly is not going to be a simple issue for which there is a silver bullet, for which you have a mathematical proof that the system can be safe. It's going to be very progressive, iterative design system where we put those guardrails in such a way that the system behave properly. And sometimes they're going to do something that was unexpected because the guardrail wasn't right, and we're going to correct them so that they do it right. The idea somehow that we can't get it slightly wrong, because if we get it slightly wrong, we all die, is ridiculous. We're just going to go progressively. And the analogy I've used many Times is turbojet design.

02:14:00
Yann LeCun
How did we figure out how to make turbojets so unbelievably reliable? Right? I mean, those are like incredibly complex pieces of hardware that run at really high temperatures for 20 hours at a time sometimes. And we can fly halfway around the world on a two engine jetliner at near the speed of sound. How incredible is this?

02:14:28
Yann LeCun
It's just unbelievable. And did we do this because we.

02:14:34
Yann LeCun
Invented, like a general principle of how to make turbojets safe? No, it took decades to kind of fine tune the design of those systems so that they were safe.

02:14:43
Yann LeCun
Is there a separate group within General Electric or snackma or whatever that is specialized in turbojet safety? No, the design is all about safety.

02:14:58
Yann LeCun
Because a better turbojet is also a safer turbojet, so a more reliable one. It's the same for AI. Do you need specific provisions to make AI safe? No, you need to make better AI systems, and they will be safe because they are designed to be more useful and more controllable.

02:15:16
Lex Fridman
So let's imagine a system, AI system that's able to be incredibly convincing and can convince you of anything. I can at least imagine such a system, and I can see such a system be weapon like, because it can control people's minds. We're pretty gullible. We want to believe a thing. And you can have an AI system that controls it, and you could see governments using that as a weapon. So do you think if you imagine such a system, there's any parallel to something like nuclear weapons?

02:15:53
Yann LeCun
No.

02:15:54
Lex Fridman
So why is that technology different? So you're saying there's going to be gradual development. It might be rapid, but it'll be iterative, and then we'll be able to kind of respond and so on.

02:16:09
Yann LeCun
So that AI system designed by Vladimir Putin or whatever or his minions, is going to be trying to talk to every American to convince them to vote for whoever pleases Putin or whatever.

02:16:32
Yann LeCun
Or.

02:16:33
Yann LeCun
Rile people up against each other as they've been trying to do. They're not going to be talking to you. They're going to be talking to your AI assistant, which is going to be as smart as theirs.

02:16:47
Yann LeCun
Right?

02:16:48
Yann LeCun
Because, as I said, in the future, every single one of your interaction with the digital world will be mediated by your AI assistant. So the first thing you're going to ask is this a scam? Is this thing, like, telling me the truth? It's not even going to be able to get to you because it's only going to talk to your AI assistant. The to. It's going to be like a spam filter, right? You're not even seeing the email. The spam email, right. It's automatically put in a folder that you never see. It's going to be the same thing. That AI system that tries to convince you of something is going to be talking to AI assistant, which is going to be at least as smart as it, and is going to say, this is spam.

02:17:29
Yann LeCun
It's not even going to bring it to your attention.

02:17:32
Lex Fridman
So to you, it's very difficult for any one AI system to take such a big leap ahead to where it can convince even the other AI systems. There's always going to be this kind of race where nobody's way ahead.

02:17:46
Yann LeCun
That's the history of the world.

02:17:48
Yann LeCun
History of the world is whenever there is a progress, someplace, there is a countermeasure. It's a cat and mouse game.

02:17:57
Lex Fridman
This is why. Mostly, yes, but this is why nuclear weapons are so interesting, because that was such a powerful weapon that it mattered who got it. You know, you could imagine Hitler, Stalin, Mao getting the weapon first, and that having a different kind of impact on the world than the United States getting the weapon first. To you, nuclear weapons. You don't imagine a breakthrough discovery and then Manhattan project like effort for AI?

02:18:35
Yann LeCun
No.

02:18:36
Yann LeCun
As I said, it's not going to be an event.

02:18:39
Yann LeCun
It's going to be continuous progress. And whenever one breakthrough occurs, it's going to be widely disseminated really quickly, probably first within industry. I mean, this is not a domain where government or military organizations are particularly innovative and they're in fact way behind. And so this is going to come from industry. And this kind of information disseminates extremely quickly. We've seen this over the last few years. Right. Where you have a new even. Take AlphaGo, this was reproduced within three months even without particularly detailed information.

02:19:17
Yann LeCun
Right? Yeah.

02:19:18
Lex Fridman
This is an industry that's not good at secrecy.

02:19:21
Yann LeCun
No, but even if there is, just the fact that you know that something is possible makes you realize that it's worth investing the time to actually do it. You may be the second person to do it, but you'll do it. And same for all the innovations of self supervisor, running transformers, decoder only architectures, llms, I mean, those things, you don't need to know exactly the details of how they work to know that it's possible because it's deployed and then it's getting reproduced.

02:19:54
Yann LeCun
And then people who work for those companies move.

02:20:00
Yann LeCun
They go from one company to another and the information disseminates. What makes the success of the US tech industry, and Silicon Valley in particular, is exactly that. It's because information circulates really quickly and disseminates very quickly. And so the whole region sort of is ahead because of that circulation of information.

02:20:24
Lex Fridman
Maybe just to linger on the psychology of AI doomers, you give, in the classic Yan Lakoon way, a pretty good example of just when a new technology comes to be, you say, engineer says, I invented this new thing. I call it a ball pen. And then the Twitter sphere responds, oMg, people could write horrible things with it, like misinformation, propaganda, hate speech. Ban it.

02:20:51
Yann LeCun
Now.

02:20:52
Lex Fridman
Then writing doomers come in akin to the AI doomers. Imagine if everyone can get a ball pen. This could destroy society. There should be a law against using ball pen to write hate speech. Regulate ball pens. Now and then the pencil industry mogul says, yeah, ball pens are very dangerous. Unlike pencil writing, which is erasable, ball pen writing stays forever. Government should require a license for a pen manufacturer. I mean, this does seem to be part of human psychology when it comes up against new technology. What deep insights can you speak to about this?

02:21:37
Yann LeCun
Well, there is a natural fear of new technology and the impact it can have on society. And people have kind of instinctive reaction.

02:21:49
Yann LeCun
To.

02:21:51
Yann LeCun
The world they know being threatened by major transformations that are either cultural phenomena or technological revolutions. And they fear for their culture, they fear for their job, they fear for the future of their children. And their way of life, right?

02:22:14
Yann LeCun
So any change is feared.

02:22:17
Yann LeCun
And you see this along history, like any technological revolution or cultural phenomenon, was always accompanied by.

02:22:27
Yann LeCun
Groups or reaction in.

02:22:29
Yann LeCun
The media that basically attributed all the problems, the current problems of society, to that particular change, right? Electricity was going to kill everyone at some point. The train was going to be a horrible thing because you can't breathe past 50. So there's a wonderful website called the Pessimist Archive, which has all the newspaper clips of all the horrible things people imagine would arrive because of either technological innovation or a cultural phenomenon. This is wonderful examples of jazz or comic books being blamed for unemployment, or young people not wanting to work anymore and things like that, right? And that has existed for centuries.

02:23:32
Yann LeCun
And it's knee jerk reactions.

02:23:38
Yann LeCun
The question is, do we embrace change.

02:23:42
Yann LeCun
Or do we resist it? And what are the real dangers as.

02:23:47
Yann LeCun
Opposed to the imagined ones?

02:23:51
Lex Fridman
So people worry about, I think one thing they worry about with big tech, something we've been talking about over and over, but I think worth mentioning again, they worry about how powerful AI will be, and they worry about it being in the hands of one centralized power of just a handful of central control. And so that's the skepticism with big tech, you can make, these companies can make a huge amount of money and control this technology and by so doing take advantage, abuse the little guy in society.

02:24:28
Yann LeCun
Well, that's exactly why we need open source platforms.

02:24:31
Lex Fridman
Yeah, I just wanted to nail the point home more and more.

02:24:37
Yann LeCun
Yes.

02:24:38
Lex Fridman
So let me ask you on your.

02:24:40
Yann LeCun
Like I said, you do get a.

02:24:41
Lex Fridman
Little bit flavorful on the Internet. Yoshabach tweeted something that you lol that in reference to how 9000 quote. I appreciate your argument and I fully understand your frustration, but whether the pod bay doors should be opened or closed is a complex and nuanced issue. So you're at the head of meta AI. This is something that really worries me, that our AI overlords will speak down to us with corporate speak of this nature, and you sort of resist that with your way of being. Is this something you can just comment on sort of working at a big company, how you can avoid the over fearing, I suppose through caution, create harm?

02:25:41
Yann LeCun
Yeah.

02:25:42
Yann LeCun
Again, I think the answer to this is open source platforms and then enabling a widely diverse set of people to.

02:25:51
Yann LeCun
Build AI assistants that represent the diversity.

02:25:55
Yann LeCun
Of cultures, opinions, languages and value systems across the world. So that you're not bound to just be brainwashed by a particular way of thinking because of a single AI entity. I think it's really important question for society and the problem I'm seeing is.

02:26:18
Yann LeCun
That, which is why I've been so vocal and sometimes a little sardonic about it.

02:26:25
Lex Fridman
Never stop.

02:26:26
Yann LeCun
Never stop.

02:26:26
Yann LeCun
Yeah, we love it is because I see the danger of this concentration of power through proprietary AI systems as a.

02:26:37
Yann LeCun
Much bigger danger than everything else, that.

02:26:40
Yann LeCun
If we really want diversity of opinion.

02:26:45
Yann LeCun
AI systems, that in the future, where.

02:26:49
Yann LeCun
We'Ll all be interacting through AI systems, we need those to be diverse for the preservation of diversity of ideas and creeds and political opinions and whatever.

02:27:05
Yann LeCun
And.

02:27:06
Yann LeCun
The preservation of democracy.

02:27:07
Yann LeCun
And what works against this is people.

02:27:12
Yann LeCun
Who think that for reasons of security, we should keep AI systems under lock.

02:27:18
Yann LeCun
And key, because it's too dangerous to.

02:27:20
Yann LeCun
Put it in the hands of everybody, because it could be used by terrorists.

02:27:25
Yann LeCun
Or something that would lead to potentially.

02:27:34
Yann LeCun
A very bad future in which all of our information diet is controlled by a small number of companies through proprietary systems.

02:27:44
Lex Fridman
Do you trust humans with this technology to build systems that are, on the whole, good for humanity?

02:27:53
Yann LeCun
Isn't that what democracy and free speech is all about?

02:27:56
Lex Fridman
I think so.

02:27:57
Yann LeCun
Do you trust institutions to do the right thing? Do you trust people to do the right thing? And, yeah, there's bad people who are going to do bad things, but they're not going to have superior technology to the good people. So then it's going to be my good AI against your bad AI, right?

02:28:12
Yann LeCun
I mean, it's the examples that we.

02:28:15
Yann LeCun
Were just talking about of maybe some rogue country will build some AI system that's going to try to convince everybody.

02:28:23
Yann LeCun
To go into a civil war or.

02:28:27
Yann LeCun
Something, or elect favorable ruler, but then they will have to go past our AI systems.

02:28:36
Lex Fridman
An AI system with a strong russian accent will be trying to convince our.

02:28:40
Yann LeCun
And doesn't put any articles in their sentences.

02:28:45
Lex Fridman
Well, it'll be, at the very least, absurdly, I. Since we talked about sort of the physical reality, I'd love to ask your vision of the future with robots in this physical reality. So many of the kinds of intelligence you've been speaking about would empower robots to be more effective collaborators with us humans. So since Tesla's Optimus team has been showing off some progress on humanoid robots, I think it really reinvigorated the whole industry. I think Boston Dynamics has been leading for a very long time. So now there's all kinds of companies figure AI, obviously, Boston Dynamics, Unitree. But there's, like, a lot of them. It's great. I love. So do you think there'll be millions of humanoid robots walking around soon?

02:29:43
Yann LeCun
Not soon, but it's going to happen, like, the next decade, I think, is going to be really interesting in robots. The emergence of the robotics industry has been in the waiting for 1020 years without really emerging other than for kind of pre programmed behavior and stuff like that.

02:30:05
Yann LeCun
And the main issue is, again, the.

02:30:07
Yann LeCun
More of a paradox like, you know, how do we get those systems to understand how the world works and kind of plan actions and so we can do it for really specialized tasks. And the way Boston Dynamics goes about it is basically with a lot of handcrafted dynamical models and careful planning in advance, which is very classical robotics with a lot of innovation, a little bit of perception, but it's still not like they can't build a domestic robot, right? And we're still some distance away from completely autonomous level five driving, and we're certainly very far away from having level five autonomous driving by a system that.

02:30:54
Yann LeCun
Can train itself by driving 20 hours.

02:30:57
Yann LeCun
Like any 17 year old.

02:31:01
Yann LeCun
So until we have, again, world models.

02:31:07
Yann LeCun
Systems that can train themselves to understand how the world works, we're not going to have significant progress in robotics. So a lot of the people working on robotic hardware at the moment are betting or banking on the fact that AI is going to make sufficient progress.

02:31:27
Lex Fridman
Towards that, and they're hoping to discover a product in it, too. Before you have a really strong world model, there'll be an almost strong world model, and people are trying to find a product in a clumsy robot, I suppose, like not a perfectly efficient robot. So there's the factory setting where humanoid robots can help automate some aspects of the factory. I think that's a crazy difficult task because of all the safety required and all this kind of stuff. I think in the home is more interesting. But then you start to think, I think you mentioned loading the dishwasher, right? Yeah, I suppose that's one of the main problems you're working on.

02:32:07
Yann LeCun
I mean, there's cleaning the house, clearing up the table after a meal, washing the dishes, all those tasks, cooking, all the tasks that in principle could be automated, but are actually incredibly sophisticated, really complicated.

02:32:28
Lex Fridman
But even just basic navigation around an space full of uncertainty that sort of works.

02:32:33
Yann LeCun
Like, you can sort of do this.

02:32:34
Yann LeCun
Now, navigation is fine.

02:32:37
Lex Fridman
Well, navigation in a way that's compelling to us humans is a different thing.

02:32:42
Yann LeCun
Yeah, it's not going to be necessarily. I mean, we have demos, actually, because there is a so called embodied AI group at fair, and they've been not building their own robots, but using commercial robots. And you can tell a robot dog, go to the fridge and they can actually open the fridge, and they can probably pick up a can in the fridge and stuff like that and bring it to you. So it can navigate, it can grab objects, as long as it's been trained to recognize them, which vision systems work pretty well nowadays, but it's not like.

02:33:20
Yann LeCun
A completely general robot that would be sophisticated enough to do things like clearing.

02:33:28
Yann LeCun
Up the dinner table.

02:33:30
Yann LeCun
Yeah.

02:33:31
Lex Fridman
To me, that's an exciting future of getting humanoid robots in general, in the whole more and more, because that gets humans to really directly interact with AI systems in the physical space. And in so doing, it allows us to philosophically, psychologically explore our relationships with robots can be really interesting. So I hope you make progress on the whole Japa thing soon.

02:33:54
Yann LeCun
Well, I hope things kind of work as planned. Again, we've been kind of working on this idea of self supervised learning from video for ten years, and only made significant progress in the last two or three.

02:34:12
Lex Fridman
And actually, you've mentioned that there's a lot of interesting breakthroughs that can happen without having access to a lot of compute. So if you're interested in doing a PhD in this kind of stuff, there's a lot of possibilities still to do innovative work. So what advice would you give to an undergrad that's looking to go to grad school and do a PhD?

02:34:32
Yann LeCun
So basically, I've listed them already. This idea of how do you train a world model?

02:34:37
Yann LeCun
By observation. And you don't have to train necessarily.

02:34:41
Yann LeCun
On gigantic data sets, you could turn.

02:34:46
Yann LeCun
That to be necessary to actually train on large data sets to have emergent properties like we have with llms. But I think there is a lot of good ideas that can be done without necessarily scaling up, then there is, how do you do planning with a learn world model? If the world the system evolves in is not the physical world, but is the world of, let's say, the Internet, or some sort of world where an action consists in doing a search in a search engine, or interrogating a database, or running a simulation, or calling a calculator, or solving a differential equation. How do you get a system to actually plan a sequence of actions to give the solution to a problem? The question of planning is not just a question of planning physical actions.

02:35:35
Yann LeCun
It could be planning actions to use tools for a dialogue system or for any kind of intelligence system. And there's some work on this, but not a huge amount. Some work at fair one called Toolformer, which was a couple of years ago, and some more recent work on planning. But I don't think we have a good solution for any of that. Then there is the question of hierarchical planning. So the example I mentioned of planning.

02:36:07
Yann LeCun
A trip from New York to Paris, that's hierarchical.

02:36:11
Yann LeCun
But almost every action that we take involves hierarchical planning in some sense. And we really have absolutely no idea how to do this. Like, there's zero demonstration of hierarchical planning in AI, where the various levels of representations that are necessary have been learned. We can do like two level hierarchy, hierarchical planning when we design the two levels. So for example, you have, like a dog like robot, right? You want it to go from the living room to the kitchen. You can plan a path that avoids.

02:36:49
Yann LeCun
The obstacle, and then you can send.

02:36:53
Yann LeCun
This to a lower level planner that figures out how to move the legs to kind of follow that trajectories, right? So that works. But that two level planning is designed by hand, right?

02:37:05
Yann LeCun
We specify what the proper levels of.

02:37:09
Yann LeCun
Abstraction, the representation at each level of abstraction have to be. How do you learn this?

02:37:14
Yann LeCun
How do you learn that?

02:37:15
Yann LeCun
Hierarchical representation of action plans. With cognetes and deep learning, we can train a system to learn hierarchical representations of percepts. What is the equivalent when what you're trying to represent are action plans for action plans, yeah.

02:37:32
Lex Fridman
So you want basically a robot dog or humanoid robot that turns on and travels from New York to Paris all.

02:37:39
Yann LeCun
By itself, for example.

02:37:42
Lex Fridman
All right. It might have some trouble at the.

02:37:46
Yann LeCun
Yeah, no, but even doing something fairly.

02:37:48
Yann LeCun
Simple, like a household task, like cooking or something.

02:37:53
Lex Fridman
Yeah, there's a lot involved. It's a super complex task, and once again, we take it for granted. What hope do you have for the future of humanity? We're talking about so many exciting technologies, so many exciting possibilities. What gives you hope when you look out over the next 1020, 5000 years, if you look at social media, there's wars going on, there's division, there's hatred, all this kind of stuff that's also part of humanity. But amidst all that, what gives you hope?

02:38:28
Yann LeCun
I love that question. We can make humanity smarter with AI. Okay.

02:38:40
Yann LeCun
I mean, AI basically will amplify human intelligence. It's as if every one of us will have a staff of smart AI assistants.

02:38:52
Yann LeCun
They might be smarter than us. They'll do our bidding, perhaps execute tasks in ways that are much better than we could do ourselves, because they'd be smarter than us.

02:39:07
Yann LeCun
And so it's like everyone would be the boss of a staff of super smart virtual people. So we shouldn't feel threatened by this any more than we should feel threatened by being the manager of a group of people, some of whom are more intelligent than us. I certainly have a lot of experience with this, of having people working with.

02:39:33
Yann LeCun
Me who are smarter than me.

02:39:35
Yann LeCun
That's actually a wonderful thing. So having machines that are smarter than us, that assist us in all of our tasks, our daily lives, whether it's professional or personal, I think would be an absolutely wonderful thing, because intelligence is the most, is the commodity that is most in demand. That's really what I mean. All the mistakes that humanity makes is because of lack of intelligence, really, or lack of knowledge, which is related. So making people smarter can only better. I mean, for the same reason that public education is a good thing, and.

02:40:13
Yann LeCun
Books are a good thing, and the.

02:40:15
Yann LeCun
Internet is also a good thing intrinsically. And even social networks are a good thing. If you run them properly, it's difficult, but you can, because it helps the communication of information and knowledge and the transmission of knowledge. So AI is going to make humanity smarter.

02:40:37
Yann LeCun
And the analogy I've been using is.

02:40:41
Yann LeCun
The fact that perhaps an equivalent event in the history of humanity to what might be provided by generalization of AI assistant is the invention of the printing press. It made everybody smarter. The fact that people could have access to books. Books were a lot cheaper than they.

02:41:05
Yann LeCun
Were before, and so a lot more.

02:41:08
Yann LeCun
People had an incentive to learn to.

02:41:10
Yann LeCun
Read, which wasn't the case before. And.

02:41:16
Yann LeCun
People became smarter.

02:41:18
Yann LeCun
It enabled the Enlightenment, right?

02:41:21
Yann LeCun
There wouldn't be an enlightenment without the printing press.

02:41:24
Yann LeCun
It enabled philosophy, rationalism, escape from religious doctrine, democracy, science. And certainly without this, there wouldn't have.

02:41:44
Yann LeCun
Been the American Revolution or the French Revolution. And so we'll still be under a feudal regimes, perhaps. And so it completely transformed the world because people became smarter and kind of learned about things.

02:42:01
Yann LeCun
Now, it also created 200 years of.

02:42:05
Yann LeCun
Essentially religious conflicts in Europe, right? Because the first thing that people read was the Bible, and realized that perhaps there was a different interpretation of the Bible than what the priests were telling them. And so that created the protestant movement and created a rift. And in fact, the catholic school, the Catholic Church, didn't like the idea of the printing press, but they had no choice. And so it had some bad effects and some good effects. I don't think anyone today would say that the invention of the printing press had an overall negative effect, despite the fact that it created 200 years of.

02:42:41
Yann LeCun
Religious conflicts in Europe. Now, compare this.

02:42:46
Yann LeCun
And I thought, I was very proud of myself to come up with this analogy, but realized someone else came with the same idea before me. Compare this with what happened in the otoman empire. The Otoman Empire banned the printing press for 200 years, and it didn't ban it for all languages, only for Arabic. You could actually print books in Latin.

02:43:14
Yann LeCun
Or Hebrew or whatever in the Otoman.

02:43:17
Yann LeCun
Empire, just not in Arabic. And I thought it was because the rulers just wanted to preserve the control over the population and the dogma, religious dogma and everything. But after talking with the UAE minister of AI, Omar al Oloma, he told me no, there was another reason. And the other reason was that it was to preserve the cooperation of calligraphers, right?

02:43:53
Yann LeCun
There's like an art form which is.

02:43:57
Yann LeCun
Writing those beautiful arabic poems or whatever religious text in this thing. And it was a very powerful corporation of scribes, basically, that kind of run a big chunk of the empire. And we couldn't put them out of business, so they banned the pigeon press.

02:44:16
Yann LeCun
In part to protect that business.

02:44:21
Yann LeCun
Now, what's the analogy for AI today? Who are we protecting by banning AI? Who are the people who are asking.

02:44:26
Yann LeCun
That AI be regulated to protect their jobs?

02:44:31
Yann LeCun
And of course, it's a real question of what is going to be the effect of technological transformation like AI on the job market and the labor market. And there are economists who are much more expert at this than I am. But when I talk to them, they tell us we're not going to run out of job. This is not going to cause mass unemployment. This is just going to be gradual shift of different professions. The professions that are going to be hot ten or 15 years from now, we have no idea today what they're going to be. The same way. If we go back 20 years in.

02:45:11
Yann LeCun
The past, who could have thought 20.

02:45:14
Yann LeCun
Years ago that the hottest job, even.

02:45:17
Yann LeCun
Like 510 years ago, was mobile app developer? Like smartphones weren't invented?

02:45:23
Lex Fridman
Most of the jobs of the future might be in the metaverse.

02:45:26
Yann LeCun
Well, it could be, yeah.

02:45:28
Lex Fridman
But the point is, you can't possibly predict. But you're right, you made a lot of strong points. And I believe that people are fundamentally good. And so if AI, especially open source AI, can make them smarter, it just empowers the goodness in humans.

02:45:48
Yann LeCun
So I share that feeling, okay. I think people are fundamentally good, and in fact, a lot of doomers are doomers, because they don't think that people.

02:45:58
Yann LeCun
Are fundamentally good and they either don't.

02:46:02
Yann LeCun
Trust people or they don't trust the institution to do the right thing so that people behave properly.

02:46:10
Lex Fridman
Well, I think both you and I believe in humanity, and I think I speak for a lot of people in saying thank you for pushing the open source movement, pushing to making both research in AI open source, making it available to people, and also the models themselves, making it open source. So thank you for that. And thank you for speaking your mind in such colorful and beautiful ways on the Internet. I hope you never stop. You're one of the most fun people I know and get to be a fan of. So Jan, thank you for speaking to me once again. And thank you for being you.

02:46:43
Yann LeCun
Thank you, Rex.

02:46:45
Lex Fridman
Thanks for listening to this conversation with Jan Lacoon. To support this podcast, please check out our sponsors in the description. And now let me leave you with some words from Arthur C. Clark. The only way to discover the limits of the possible is to go beyond them into the impossible. Thank you for listening and hope to see you next time. Bye.

Source | Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416


Try Fireflies for free