Introduction
Song Han is an associate professor in Electrical Engineering and Computer Science whose research focuses on efficient AI computing. His work spans high-resolution computer vision for autonomous vehicles, more efficient image generation, improved GPT performance, and novel methods for training machine learning models. He also leads the Efficient AI team at NVIDIA Research, focused on optimizing GPU-accelerated AI systems.
In this episode, President Sally Kornbluth speaks with Song Han about efficient AI, why it’s so energy-hungry in the first place, and the benefits of lighter models.
Links
Timestamps
Transcript
Sally Kornbluth: Hello. I'm Sally Kornbluth, president of MIT, and I'm thrilled to welcome you to my podcast, Curiosity Unbounded. Here at MIT, our endless curiosity fuels a steady stream of inspiring discoveries and innovations. This podcast is your chance to join me as I talk with pioneers who are pushing the frontiers of knowledge and inventing real world solutions for a better future. Today, my guest is Song Han. Song is an associate professor of electrical engineering and computer science at MIT, whose research focuses on boosting the efficiency of AI computing. He develops techniques to shrink and speed up large AI models, cutting their energy use, lowering their cost, and enabling them to run faster on everything from cloud servers to personal devices. His work has advanced high resolution computer vision for autonomous vehicles, led to more efficient AI image generation, improved GPT performance, and yielded new methods for training machine learning models.
All things we're thinking about, all things that are in the news, all of the time, and if our listeners are like me, all things they're incredibly curious about. So, Song, welcome to the show.
Song Han: Thank you, Sally. I'm very honored to be here. Excited to talk about efficient AI stuff.
Sally Kornbluth: Excellent. So what first inspired you to make AI models more efficient?
Song Han: Actually, that dates back to a decade ago when I was doing PhD at Stanford. We were trying to accelerate large neural networks with my advisor, Professor Bill Dally. Initially, we were trying to use the hardware solution to build accelerators for that. And we actually find, before we do hardware acceleration, there's a huge opportunity we can actually shrink these models from the software perspective, make a lot smaller, lower memory footprint, before you can do hardware acceleration. And then we can design the specialized hardware to accelerate a small and compressed model, to combine the efforts from software and hardware to do this co-design, and make it a lot more, a lot smaller and a lot more efficient.
Sally Kornbluth: You know, it's interesting. The first time I heard you talk about this was right after, uh, DeepSeek came out. It was DeepSeek, 'cause that, was it called?
Song Han: Yeah.
Sally Kornbluth: That hit the news. And you came and talked, uh, I think to some leadership about it. And I think we all thought that there was some revolution in technology, but it was actually sort of an optimization of many different elements. So what's happening under the hood for the large AI systems that makes them so energy intensive? And how do you think about compression? In other words, what are the elements of compression?
Song Han: Yeah. So actually, there are two reasons why it's so energy consuming. One is compute. One is data movement. So these large neural networks will require a lot of arithmetic compute, so they require a lot of uh, energy for that perspective. And secondly, moving the data is even more expensive, including moving the weight, the activation, the KV cache, moving it from machine to machine, GPU to GPU, memory to cache. It's even more expensive. So those are the two reasons. And model compression can improve the efficiency because it can not only reduce the model size, but also it shrinks the amount of arrhythmic compute, and also reduce the amount of data movement you get is compound benefit for saving the energy.
Sally Kornbluth: So there's a lot of details in the large models that are nice to have, but not essential for AI to function. Is that correct? And so you're actually sort of eliminating data, in a way, when you compress?
Song Han: Oh yeah, exactly. That's the motivation where we find, that's called pruning, where pruning a neural network is like a pruning a tree, where you have so many branches, some of that are the trunk. You cannot prune it. Some of them are like side branches, and you can remove them safely without hurting the accuracy, without hurting the tree. Similar for neural networks, we can prune it at different granularity. Also, we can quantize it rather than using a full precision, like 32-bits, to represent a number. We can use only 8-bit or even 4-bit to represent a number. And finally, we can use distillation techniques, having a large model teach a small model, so that a small model can learn from the large model, and get to a closer accuracy as the larger model, so that we can use a small model in production, in practice.
Sally Kornbluth: I see. I see. So when I was introducing you, I mentioned your interest in sort of image generation. Is it much more difficult to compress image generation AI, than what we think of as just traditional large language model, sort of, text AI?
Song Han: Exactly. Indeed. There are two levels of difficulty. One is people need high resolution images, so you're not just predict a word, a few tokens, but you are predicting pixel-wise, like the whole image, like high resolution 4K image, that's a lot of pixels and a lot of tokens. So you need the machine learning model to generate a lot of tokens for high resolution image. And also for videos, you need to generate long videos. And so that also add the complexity to generate super long video, make it robust, make it no blurring, so lot of token consumption. So we develop techniques called deep compression auto encoder, to shrink the number of tokens so that you need to generate less tokens, and try to be lazy, but still you can recover it with very high quality.
Sally Kornbluth: Yeah, that's interesting. You know, social media is rife with these now, kind of, artificial videos, and people debating whether it's real or AI, and it's sort of horrifying to think of using tons of energy to, you know, generate cat videos, for example. (laughs) And so the way people use AI, I think they don't think at all about energy consumption or compression. They just wanna do what they wanna do with it.
Song Han: Right. Actually, there's a huge opportunity to make it more efficient by leveraging the nature of video, the temporal similarity, and also the spatial similarity. We can get rid of tons of compute by using that, called sparse attention. Attention is all you need, but you only pay attention to where you need, and you don't need to pay attention where you don't need or is redundant, and that can save a lot of energy.
Sally Kornbluth: Yeah, no, very interesting. So besides consuming less energy, what are some of the advantages of these sort of lighter weight AI models?
Song Han: Yeah, so it's not just saving energy, but also improving the productivity and revenue. For example, data center is usually having a fixed power budget, right? And the more compute you can squeeze given this fixed power envelope, the more productivity, the more intelligence you can generate, the more tokens you can generate, and the more revenue you can generate, the more users you can serve. So it's also productivity. And also, once the efficiency is beyond a threshold, you can realize real-time AI. For example, you don't need to wait for the AI to give you the information offline, but you can interactively experience that. Example is NVIDIA's DLSS, Deep Learning, uh, Super Sampling.
Sally Kornbluth: Yes.
Song Han: You can do real time translation from a gaming style video to a, like, real world-like-
Sally Kornbluth: I see.
Song Han: ... uh, video. So this kind of real time, uh, is a very good point when the neural network is getting efficient enough.
Sally Kornbluth: I mean, it sounds like you've been working on these things since you were a graduate student, and I'm just curious, are the capabilities that we're seeing emerge surprising to you? Had you anticipated this level of progress and pace, or are you also amazed, as I am, about the capabilities of these systems?
Song Han: I'm pretty amazed, thanks to the scaling law, right, since 2022 and Chat- ChatGPT moment, it was a lot different with this generative AI compared with, uh, before that. I did PhD 2012, that's when Alexnet just came as the first wave, and a decade later, since 2022, I think that's the second wave with the scaling law, people are squeezing a lot more intelligence by training on a much larger amount of compute, much larger amount of data, and the result is just so astonishing.
Sally Kornbluth: It really is amazing. What do you think the most, sort of, exciting capabilities are that have emerged in the last couple of years?
Song Han: I think the most exciting thing is the self-reflection capability, the post-training capability, and the reasoning capability. So previously, pre-training is like you train a student to college, right? And after he go to find a specific job, he needs, uh, specific training, right? So we can do post-training for alignment, and also do this supervised fine-tuning followed by reinforcement learning, to inject, uh, new paradigm and new, to align their system, right, get a n- lot of new knowledge in, in a specific domain. And AI these days, with the test-time scaling, I think that's a very cool feature, since GPT-4o, where you can just increase the compute at the inference time, and then you can do self-reflection to verify whether the previous generated result hallucinate or is true or not, and self-reflection, and do continued generation by scaling the inference time, the inference compute, the accuracy could be increased a lot. And I think that's a very exciting milestone.
Sally Kornbluth: It's so interesting, because I think the average user begins to forget that they're not dealing with a person. And that these AIs now have different personalities, different modes of interacting with the environment. I realize that that's, uh, illusion in a way, but I think it's definitely influencing people's perception of AI and what they can do with it.
Song Han: Yeah. Those are all thanks to post-training.
Sally Kornbluth: Yeah.
Song Han: So that you can align the AI with, uh, different characteristics.
Sally Kornbluth: Exactly. Exactly. So you've actually been working now to make things run faster, more efficiently, et cetera, and potentially, you know, on a variety of devices. So what's the impact, for instance, of my now being able to pull out my phone and run AI on my phone, as opposed to having to either sit in my office, or before that to be someone who had access to large computing resources?
Song Han: Yeah, I think in the future it will be a hybrid mode. Some super gigantic, powerful AI will sit in the cloud, in the data center. In the meantime, there will be a bunch of smaller language models, or generative models, sitting on mobile devices. They can talk, they can interact, depending on our prompt. If it is simple enough, uh, the local device would be capable enough to give you real-time answer, and if it is more challenging or more difficult, it'll be routed to the data center and get it back. And we'll see a lot of on-device AI scenarios, latency critical scenarios, for example, self-driving cars, or robots, these physical AI applications, they need, uh, real time, they need, um, very robust, doesn't rely on the internet-
Sally Kornbluth: But really miniaturized.
Song Han: ... and miniaturized with a fixed power budget.
Sally Kornbluth: Yes.
Song Han: You don't want a whole trunk of computers sitting in the, in the trunk. So that's where I think efficient AI is very critical.
Sally Kornbluth: Do you think there's gonna be, you know, just as we're talking, I'm thinking, do you think there's gonna be a sort of market for much more specialty small models that can operate quickly? So for example, if you go on to, you know, any of the generative AI models, you can do very extensive travel planning. I can imagine very small, fast, sophisticated models that are marketing direct to consumer, for example, just for planning trips, or just for asking medical questions, or d- I mean, I guess some of that is emerging, but do you think that's gonna become more common?
Song Han: Yeah, that's vertical AI. And once you focus on a very specific domain, you can make it very focused and shrink it to only what is useful, without having to have a lot of redundancy and, but give opportunity to make it, uh, very dedicated, very specialized, while being super-efficient.
Sally Kornbluth: Yeah, no, it's interesting. So you focused a little bit on making AI handle long documents or long videos very efficiently. What kind of things might that unlock for everyday users? How would that help me, for instance, or one of our listeners?
Song Han: Yeah. Uh, we did some long context work such as streaming LLM. So our natural world is, is long context. Like we have, uh, tens of years of memories, and very long textbook covered the whole semester, or hour-long video, or a whole year of emails we may want to reference. So our world is naturally long context, and having the long context capability is difficult from the computing perspective, because they occupy a large amount of memory, uh, very slow to index, to search, to locate. And very interestingly, conventional large language models suffer from the loss-in-the-middle issue, can remember the beginning very well, the end very well, but in the middle, tend to forget about it. So we propose a few techniques, try to shrink the memory consumption, will be able to, uh, memorize very long document and continuously provide the interaction without having the memory to explode. Uh, that work is called streaming LLM, is now part of the OpenAI's, uh, gpt-oss, open source version of the, uh, GPT model, uh, very helpful technique.
Sally Kornbluth: I mean, we've all had the experience, I guess, of, you know, you've watched some long video or read some long book, and you're like, "I know there was a scene or a piece of information that conveyed this, but I could never find it again." So I assume that, in terms of people processing large amounts of information, seeing something very long, that those kinds of things will also help in speeding up data location, processing, et cetera?
Song Han: Exactly. For example, in a, in a car, you wanna prompt it, "When did I hit another pedestrian?" For example.
Song Han: You just locate-
Sally Kornbluth: I never wanna process that.
Song Han: ... Yeah.
Sally Kornbluth: But okay.
Song Han: You just wanna locate a specific event that happened, like in your home, you wanna see, "Did someone leave a package at my home? When did that happen?" And can locate that for you, and it's continuous interaction. Or maybe finding a relationship between two different events. They might be long time apart from each other, but if you can understand them, the whole video as a whole, that can need, uh, enable such applications.
Sally Kornbluth: That would save my husband and I a lot of conversation during TV shows saying, "What did they do? Who are they again?" (laughs) You know? It'd be nice to be able to figure that out quickly. I'm just joking, that's a trivial example. But, so you've developed these AI model compression techniques that have already been downloaded millions of times. And what does that kind of adoption mean for the field? I mean, are the users leading to iteration for you? Are you getting feedback from all of the folks that are using these compressed models?
Song Han: Right, that's a good point. Like our 4-bit, our, um, quantization technique called AWQ, has been downloaded more than 60 million times, not only from academia, but also industry, like NVIDIA, a lot of companies have integrated them into their products, and that means efficient AI is not, uh, just good to have, it's a must have. So we need to think about as, even as continuous scaling up, the efficiency demand would just be more important. We can unlock a lot of our capabilities, and we get a lot of feedback from our industry, like NVIDIA, and we can continue iterate on these algorithms.
Sally Kornbluth: Right. And they tell you what they'd like to see, presumably?
Song Han: Yeah.
Sally Kornbluth: Yeah.
Song Han: For example, extending from, uh, language model to visual language model, that's a good example for that.
Sally Kornbluth: Yes, yes. Yeah, that's really interesting. You know, you mentioned about small devices linking to huge cloud servers, but presumably the ultimate, you mentioned things like self-driving cars, et cetera. Ultimately, I might be able to own a phone that has everything it needs in a contained way. And I think that addresses one of the concerns a lot of people have, which is about security and AI. You can glean a lot from an individual by the kinds of questions they're asking-
Song Han: Yes.
Sally Kornbluth: ... the generative AI. So how do you think about that? Do you think, as you build these models, do you think about how secure they are, and how they might be used for less than, uh, salutary purposes?
Song Han: Absolutely. I think as the AI model are getting to multiple modalities, people have a bigger demand for privacy, like they are listening to you, to your meetings, and looking at your cameras and reading your emails and notes, right? We want a, those information to stay local. Nowadays, we see, uh, these three billion parameter, seven billion parameter model are getting very capable from the algorithm perspective. And these silicons, these chips, are getting also getting more powerful. Even mobile chips, they can run, uh, these three billion, seven billion parameter in real time without a problem. So I'm very optimistic with this co-design software and, uh, hardware co-design, to getting, bring the gap smaller and smaller so that in one day we can have a lot of applications running local, and helping people solve the privacy concern.
Sally Kornbluth: Yes. So, you know, it's interesting. How do you conceive of these things day to day? When you try to approach these problems, I always like to sort of give people a feeling for, like, what the actual work you do looks like. So on a day-to-day basis, how do you operate? What are you trying to achieve in small pieces, to kind of make a larger successful, uh, compressed model, or smaller compressed model?
Song Han: Mm-hmm. We usually approach this problem from different angles, perspectives. Uh, one angle is software and hardware co-design. We will think about which is a model or algorithm problem? Which is a hardware or system problem? For example, 4-bit quantization algorithm need to be coupled super well with the 4-bit inference kernel libraries, could a code that complement or implement those algorithms? So that's one angle. The other angle is the training and inference, which technique should we target for training, accelerating the training, which target the acceleration of inference? And these days, we see more demand for efficient inference, uh, which is a good sign where, which means AI models are not just sitting in the lab for each iteration, but actually go into mass production. So inference optimization demand is, you know, even bigger.
We also look at different applications, uh, the generation versus understanding. So generation means we can generate a lot of images, videos. Understanding means understand images and videos and see what's happening. And they can help each other by using the understanding model to label the data feeding to the training of the generation model. So these are the synergies between these perspectives. We also look at different technology perspectives, for example, sparse versus dense, high precision versus low precision. There's a very large design space, and I think the key, uh, thing I think day to day is, is kind of co-design, right? Uh, opening the whole space, I see if you have the f- complete freedom in the ecosystem, what you can do to make AI more efficient, from training to inference system to algorithm, hardware to software, generation to understanding, put them all together, unlock the space, and just let your imagination soar.
Sally Kornbluth: You know, it's interesting. I think probably a decade ago when we had students come to MIT, they felt that they could just learn to code and get a great job. And what you're describing is expertise in software, expertise in hardware, understanding of the context in which the AI, for instance, is gonna be operating, et cetera. So what do you advise students who are sort of thinking about entering this area now? Like, how should they be thinking about what they need to know to operate successfully in this area?
Song Han: Mm-hmm. Yeah, I think it's a great moment to rethink about the education and teaching, and what students should learn. 80 or 90% of the coding, uh, during our day-to-day research, nowadays, uh, we find AI tools can do it super well. While some fundamental connection to connect different concepts together, and the design space exploration to know what are the, uh, possibilities. And I think that's something newer students should know. Not only algorithm, not only how to train models, but also the systems, the kernels, how, how to actually implement this stuff. AI is a very special animal, where it's not a fixed workload from the computing perspective. It can be dense, it can be sparse, it can be full precision, it can be quantized, and there's just so many co-design opportunities, which means make it more important to, uh, learn the whole stack, from computer architecture to operating system, high performance computing to compilers, to- to machine learning, to, uh, artificial intelligence, from NLP to vision. So this whole stack, I think they are, uh, getting tighter and tighter.
Sally Kornbluth: Yes. So, so what someone needs to know to really operate fluidly within the space, in some ways it's getting larger, even if we think about ... I think sometimes people think about AI and education, "Oh, we have to know a lot less because the AI will do it for us." But in reality, if you wanna work in this area, you have to have a much deeper understanding if you're really gonna be creative.
Song Han: Exactly. Having the capability to connect the dots, and for each dot, AI might be able to do it very well, very deeply-
Sally Kornbluth: Exactly.
Song Han: ... a capability to connect the dots across design space.
Sally Kornbluth: That's really interesting. So talking about teaching, can you talk a little bit about your teaching platform, efficientML.ai, and tell me a little bit about it and actually what motivated you to start this effort?
Song Han: Oh yeah, I love to. I initiate this efficientml.ai AI course three, four years ago, try to disseminate the knowledge on efficient AI. Many companies are actually using that as a training tutorial for onboarding new employees to deploy models, 'cause I see the huge demand for efficient AI, and deployment from the cloud to the edge, but there's a big shortage of such talents, so we wanna disseminate to more students, equip them with these capabilities. Some of the students now are in, like, large companies, some of them are, uh, become a professor, some of them also initiate their own companies, and I see, uh, it's getting very fruitful these days, and we'll continue that effort, make it public, uh, lecture materials and videos or slides are all public, and everybody can access.
Sally Kornbluth: What do you think about the sort of industry, academia ... I wouldn't even call it divide. It's not really a divide, but the different contributions that industry and academia are gonna make to the future of AI? You know, what do things look like 10 years from now? Are we still doing deep levels of AI research in universities? How are we collaborating with all of the companies that are emerging? How do you think about that?
Song Han: I think companies offer lots of, uh, new problems and also offer lots of crucial resources. Like, I'm very thankful for NVIDIA donating us the GPU, uh, cloud compute resources. Uh, Jensen Huang, very ambitious and very innovative, and that impact our research a lot. And I think, uh, for academia, we have a lot of freedom to explore different, like, crazy ideas, like going to 4-bits or even 2-bits.
Sally Kornbluth: Yeah, yeah.
Song Han: And making 99% sparse, all zeros. So these crazy ideas, we can explore them and get it, uh, open source and contribute back, uh, to industry. And I think this relationship will just getting tighter and getting more fruitful.
Sally Kornbluth: Mm-hmm. That's great. That's great. So what do you do outside of this? Do you have, uh, hobbies? Do you, tell me a little bit about things besides AI that you find of interest?
Song Han: Oh, yeah. There's a coming ski trip, uh, with our lab, lots of, uh, fun activities-
Sally Kornbluth: Downhill.
Song Han: ... Downhill. Uh, speed at the, ski at the speed of GPUs that-
Sally Kornbluth: I'm more of a, I guess that explains a lot, 'cause I'm more of a cross country kind of analog skier. Nice and slow and flat.
Song Han: Yeah, that's fun. Especially northeast.
Sally Kornbluth: Very fun. Very fun. And for listeners that really wanna understand a lot more about AI, beyond sort of the headlines, where do you think they should start? In other words, how do people understand a little bit more what tools are available to them, how they might use them, how they might think about them, how they can use it in their own lives?
Song Han: Yeah, I think these days there are so many great tools. Uh, just get hand dirty and try a few tools, get hand dirty and try the tools, and then use the tools, write the code-
Sally Kornbluth: Yeah.
Song Han: ... and start implementing stuff, uh, take the efficient ML lectures, and we have lots of hands-on projects. Like after doing the projects, basically you can deploy a seven billion parameter model locally on your laptop.
Sally Kornbluth: Oh, really?
Song Han: By completing the course.
Sally Kornbluth: Wow. Very cool. Yeah, no, I think the, the future's gonna be very interesting in this regard. And, you know, my own ... I don't do AI research, obviously, but I've started using the tools, and they're really quite amazing, and I think people would find ways to make their own work better, kind of allaying some of the fears of AI replacing what they're doing, but actually using AI as a extender of their own creativity.
Song Han: Yeah, absolutely. And for science, I think there's a huge, uh, momentum.
Sally Kornbluth: Exactly. Exactly. Well, this has been very, uh, enlightening to me. I'm sure our audience is gonna love hearing all about this, and you're gonna get more than 60 million, uh, downloads, I'm sure, as time progresses. And so I wanna thank you for joining us, and I wanna thank our audience for listening to Curiosity Unbounded. I very much hope our audience will join us again, and I very much hope you'll join us again. So, I'm Sally Kornbluth. Stay curious.
Song Han: Thank you, Sally.
Curiosity Unbounded is a production of MIT News and the Institute Office of Communications, in partnership with the Office of the President. This episode was researched, written, and produced by Christine Daniloff and Melanie Gonick. Our sound engineer is Dave Lishansky. For show notes, transcripts, and other episodes, please visit news.mit.edu/podcasts/curiosity-unbounded. Please find us on YouTube, Spotify, Apple, or wherever you get your podcasts. To learn about the latest developments and updates from MIT, please visit news.mit.edu. You can follow us on Facebook and Instagram at CuriosityUnboundedPodcast.
Glossary:
GPU: A GPU (graphics processing unit) is a type of computer chip that can handle lots of calculations at the same time. It was originally made for graphics, but because it has thousands of small processing cores, it’s great at the kind of math that AI systems need. That makes GPUs essential for training AI models and running them quickly once they’re built.
CPU: A CPU (central processing unit) is the main “brain” of a computer. It executes instructions, performs calculations, and manages data flow. Found in everything from smartphones to servers, it runs the operating system, applications, and the core tasks that keep a device working.
Tokens: Tokens are the fundamental, atomic units of data—words, parts of words, or characters—that AI models use to process, understand, and generate information.
Pruning: In computing, pruning is a technique used to reduce the size and complexity of a model or circuit by removing unnecessary or redundant components. It is applied in machine learning to simplify decision trees and neural networks, and in digital circuit design to reduce power consumption and area. The goal is to maintain performance while improving efficiency.
Quantization: Quantization is a technique used to make AI models smaller and faster without changing what they can do in a noticeable way.
Imagine an AI model is like a huge spreadsheet full of very precise numbers—lots of decimal places. Storing and processing all those exact numbers takes a lot of memory and energy.
Quantization works by rounding those numbers to simpler, smaller ones (for example, going from numbers with many decimal points to whole numbers). The AI still works the same for most tasks, but now the model:
- Takes up less space
- Uses less energy
- Runs faster on regular hardware like phones and laptops
It’s a bit like compressing a high-resolution photo so it loads faster—you lose a tiny amount of detail, but it’s usually not noticeable, and everything works more smoothly.
Model compression: Model compression is a set of techniques in computing that reduces the size of AI models to make them more memory-efficient and faster for inference, while maintaining their accuracy.
Accelerated AI computing: Accelerated AI computing uses specialized hardware, like GPUs, to perform the massive parallel computations required for AI tasks like training large language models and running generative AI applications. This approach is significantly faster and more energy-efficient than using traditional CPUs alone, making complex AI applications feasible and scalable for modern businesses and researchers in fields ranging from scientific modeling to cloud services.
Parallel computing: Parallel computing is the simultaneous execution of multiple calculations or processes to solve a problem more quickly. It works by breaking down a large, complex task into smaller, independent parts that can be processed concurrently on multiple processors. This approach significantly speeds up computation and is crucial for handling large datasets and complex problems in areas like scientific simulations and data analysis.
Inference (in computing): Inference refers to when an AI model is actually being used to make predictions or generate outputs — after it has already been trained.
Examples of inference in everyday life:
- When you ask a chatbot a question and it gives you an answer.
- When your phone recognizes a face in a photo.
- When an app translates a sentence you type.
So:
- A model with 100 million parameters is powerful but relatively small.
- Modern models like ChatGPT have tens or hundreds of billions of parameters, meaning they’ve learned from far more data and can handle much more complex tasks.