Multicore may not be so scary

Research suggests that the free operating system Linux will keep up with the addition of more ‘cores,’ or processing units, to computer chips.

Computer chips have stopped getting faster. To keep improving chips’ performance, manufacturers have turned to adding more “cores,” or processing units, to each chip. In principle, a chip with two cores can run twice as fast as a chip with only one core, a chip with four cores four times as fast, and so on.

But breaking up computational tasks so that they run efficiently on multiple cores is a difficult task, and it only gets harder as the number of cores increases. So a number of ambitious research projects, including one at MIT, are reinventing computing, from chip architecture all the way up to the design of programming languages, to ensure that adding cores continues to translate to improved performance.

To managers of large office networks or Internet server farms, this is a daunting prospect. Is the computing landscape about to change completely? Will information-technology managers have to relearn their trade from scratch?

Probably not, say a group of MIT researchers. In a paper they’re presenting on Oct. 4 at the USENIX Symposium on Operating Systems Design and Implementation in Toronto, the researchers argue that, for at least the next few years, the Linux operating system should be able to keep pace with changes in chip design.

Linux is an open-source operating system, meaning that any programmer who chooses to may modify its code, adding new features or streamlining existing ones. By the same token, however, any public distribution of those modifications must be free of charge, which makes Linux popular among managers of large data centers. Programmers around the world have contributed thousands of hours of their time to the continuing improvement of Linux.

Clogged counter

To get a sense of how well Linux will run on the chips of the future, the MIT researchers built a system in which eight six-core chips simulated the performance of a 48-core chip. Then they tested a battery of applications that placed heavy demands on the operating system, activating the 48 cores one by one and observing the consequences.

At some point, the addition of extra cores began slowing the system down rather than speeding it up. But that performance drag had a surprisingly simple explanation. In a multicore system, multiple cores often perform calculations that involve the same chunk of data. As long as the data is still required by some core, it shouldn’t be deleted from memory. So when a core begins to work on the data, it ratchets up a counter stored at a central location, and when it finishes its task, it ratchets the counter down. The counter thus keeps a running tally of the total number of cores using the data. When the tally gets to zero, the operating system knows that it can erase the data, freeing up memory for other procedures.

As the number of cores increases, however, tasks that depend on the same data get split up into smaller and smaller chunks. The MIT researchers found that the separate cores were spending so much time ratcheting the counter up and down that they weren’t getting nearly enough work done. Slightly rewriting the Linux code so that each core kept a local count, which was only occasionally synchronized with those of the other cores, greatly improved the system’s overall performance.

On the job

“That basically tells you how scalable things already are,” says Frans Kaashoek, one of three MIT computer-science professors who, along with four students, conducted the research. “The fact that that is the major scalability problem suggests that a lot of things already have been fixed. You could imagine much more important things to be problems, and they’re not. You’re down to simple reference counts.” Nor, Kaashoek says, do Linux contributors need a trio of MIT professors looking over their shoulders. “Our claim is not that our fixes are the ones that are going to make Linux more scalable,” Kaashoek says. “The Linux community is completely capable of solving these problems, and they will solve them. That’s our hypothesis. In fact, we don’t have to do the work. They’ll do it.”

Kaashoek does say, however, that while the problem with the reference counter was easy to repair, it was not easy to identify. “There’s a bunch of interesting research to be done on building better tools to help programmers pinpoint where the problem is,” he says. “We have written a lot of little tools to help us figure out what’s going on, but we’d like to make that process much more automated.”

"The big question in the community is, as the number of cores on a processor goes up, will we have to completely rethink how we build operating systems," says Remzi Arpaci-Dusseau, a professor of computer science at the University of Wisconsin. "This paper is one of the first to systematically address that question."

Someday, Arpaci-Dusseau says, if the number of cores on a chip gets "significantly beyond 48," new architectures and operating systems may become necessary. But "for the next five, eight years," he says, "I think this paper answers pretty definitively that we probably don't have to completely rethink things, which is great, because it really helps direct resources and research toward more relevant problems."

Arpaci-Dusseau points out, too, that the MIT researchers "showed that finding the problems is the hard part. What that hints at for the rest of the community is that building techniques — whether they're software techniques or hardware techniques or both — that help to identify these problems is going to be a rich new area as we go off into this multicore world."

Topics: Computer science and technology, Linux, Multicore, Parallel computing


This is transition time for Gen-Next, good thing our reliable Linux will still be reliable in this whole new world of multi-cores.

But it needs to get smarter. Not just an operating system but the applications too need to be designed in a multi-core friendly way. We have to use structures like fork-join and immutable objects to increase the parallelabilty of our apps.

As stated in paper "if 25% of a program is serial, then any number of cores can provide no more than 4-times speedup."

So we must design to keep the serialism to minimum.

Like the workload (of MapReduce) spends 3% of its runtime in the kernel with one core, but this rises to 16% at 48 cores.

So it seems that almost every layer in software must be redesigned for multi cores. But thats impracticle. And may not be needed for every algorithm.

But I do want to stress that every program, starting now must be give a high consideration for increasing the parallelism so that every new app can be ready for multi-core world.

If you want to see a radically different approach to multicore, see the LoseThos operating system.

In LoseThos, you explicitly assign work to other cores from core 0 making it not SMP.

It doesn't use graphics acceleration, so graphics are done with multicores, which is real-time and very demanding, compared to batch, non-real-time processing.

It seems unlikely to me that adding more cores to a shared memory architecture is a scalable solution, it's intrinsically running into a major bottleneck.

Of course we have some tricks to compensate with, such as adding more cpu cache. But even then, we're just shifting the problem down the line. The cache synchronization protocols require either a larger interconnected network to sync with all the remote processors, or a shared bus. Either approach is intrinsically unscalable.

There are two obvious solutions which come to mind:

1. Don't use shared memory (running a cluster of computers in parallel).

2. Abandon x86 memory consistency semantics where atomic operates are guarantied to be atomic across all processors.

Either of these are scalable hardware wise. In my opinion though, solution 2 somewhat defeats the purpose of having a shared memory system in the first place. If the OS and/or app must explicitly synchronize memory accesses anyways, then having shared memory is largely irrelevant.

That leaves solution 1, or having large clusters of separate systems working in conjunction with one another. The goal therefor would be in building an interconnect with extremely low latency, high performance, and guarantied delivery within a local network topology.

So my prediction is that while manufacturers will continue to multiply the cores, the inherent overhead will mean that we eventually abandon the shared memory approach and turn to clusters as the way forward. (Of course by then, they could manufacture entire clusters on a single die. The novice would not really know the difference.)

The ability to break algorithms into multi-core units is all for nothing if we can't synchronize operations across multiple cores.

Utilizing shared memory for lock storage will eventually (as shown in the above article) need to be solved as the latency to acquire and release resource locks becomes exponentially higher. Read-write locks are very good in this regard because they only necessitate a synchronize on acquire, but even so the problem remains.

I anticipate that we will need to add a synchronized register file - a shared lock cache on-chip from which all locks are acquired and released. There would be no need to jump out into main memory. Furthermore this would reduce MMU operations and cache invalidation/miss problems significantly. New cpu instructions would be required to utilize the locking cache, but of course old code should work as before.

A simple 1Kx64 shared on-chip cache could easily use compare-exchange type operations to support a huge number of application locks (either mutex, read/write or spin). Yes it is a limit, but I'm sure OS developers could build lock paging algorithms to allow for unlimited virtual locks.

Back to the top