The ILP Wall and pipelines

Published online on February 15, 2012 by Russell Fish III Future of computing – Part 3: The ILP Wall and pipelines [Part 1 begins a look at the future of computing and, in particular, what happens when multicore processing “hits…

Published online on February 15, 2012 by Russell Fish III

Future of computing – Part 3: The ILP Wall and pipelines

[Part 1 begins a look at the future of computing and, in particular, what happens when multicore processing “hits the Memory Wall.” Part 2 turns its attention to the “Power Wall” – the increasing heat and power issues associated with increased performance.]

Instruction Level Parallelism means executing multiple instructions or pieces of instructions at the same time to make the computer run faster. Computers have hit the parallelism wall. This paper will analyze the problem, but first a little history.

Henry Ford pioneered assembly lines for automobile manufacturing.An assembly line consists of workstations that perform a single
production step before passing the partially completed automobile to the next workstation. All steps at all workstations are performed at the same time so many partially completed automobiles are assembled in parallel. The number of assembly steps determines the rate of production. The more steps in the line, the more cars are built in parallel, and the faster the finished vehicles come out the end.

At Ford’s original factory, a Ford Model T could be assembled in 93 minutes from start to finish. There were 31 workstations on the line, and a completed car rolled off every three minutes.

In the early 1960s, Seymour Cray, Jim Thornton, and Dean Roush of CDC began thinking about performing multiple computer instructions at the same time in order to speed up execution.
The resulting product was the CDC-6600 containing 10 parallel arithmetic, logic, and floating point units. Due to this massive parallelism, it was as much as 10 times as fast as its fastest competitor. When introduced in 1964 it was considered by many to be the first supercomputer. Even though it was extremely complex and difficult to program, more than 100 were sold, many for the booming field of nuclear weapons research.

The 6600 executed multiple instructions at the same time.This might be compared to the early days of automobile assembly where factories consisted of many individuals who built a complete car. The 6600 did not yet have an assembly line.

Seymour Cray immediately began work on its successor, which would use a new type of parallelism. Cray recognized that most of the logic in the 6600 execution units was idle at any moment in time. He surmised that by dividing instructions into steps, he could organize his execution units such that one step of several instructions could be performed at the same time. The architecture appeared very similar to an automotive assembly line, and today we know it as an execution pipeline. The resulting product executed programs 10 times the speed of the 6600 and became the CDC 7600.

In 1972 Cray left CDC to form Cray Research.Its first product was one of the most successful supercomputers in history, the Cray-1, lovingly called “The Beast.” The massively parallel beast featured 12 pipelined functional units and weighed 5.5 tons including Freon refrigeration. The first Cray-1 cost about $8 million and was installed at Los Alamos National Laboratory in 1976.

Seventeen years later Intel introduced the Pentium microprocessor featuring the P5 superscalar pipelined architecture.The 1993 Pentium approximated the performance of the Cray-1 and cost about $500. The benefits of parallelism had come to microprocessors.

The Good, the Bad, and the Ugly
The benefits of computer parallelism did not come for free.Parallel architectures compared to scalar architectures werelogically more complex, required more transistors, dissipated more heat, and were more difficult to program. The greater the parallelism, the greater the pain in each of these areas.

An execution pipeline breaks each instruction into a series of steps. The steps of many instructions are performed at the same time so many instructions are executed in parallel. The number of steps of the pipeline determines the number of instructions executing at the same time. The more instructions executing in parallel, the faster the programs run, just like on a physical assembly line … at least in theory.

An automotive assembly line produces only one model. Some special purpose computer pipelines execute only a single instruction over and over.An example is a pipelined multiplier sometimes used in high speed digital signal processing. The multiplier operation might require 4 clock cycles to complete, but if 4 multipliers are connected like an assembly line, a new multiply can be started each clock. A portion of four multiplies can be performed at the same time, and a multiply product can be produced on each clock. The results of this pipelined multiplier will be delivered 4 times faster than a single multiplier.

General purpose computers on the other hand execute many different instructions. A pipeline can accelerate a general purpose computer program if the many different instructions can be crafted so they each have the same number of steps.

One of the byproducts of reduced instruction set computer (RISC) architectures was the regular structure of their instructions.A RISC machine need not be pipelined, but RISC instructions could be pipelined much easier than the variable length instructions common in complex instruction set computers (CISC).

A simple RISC pipeline might break its instructions into 5 steps.

Each step required one clock period.Since one step from five different instructions was executed during each clock period, one instruction completed each clock, unless there was a “problem.”

In a perfect world, a CPU with a 5 stage pipeline would execute programs 5 times faster than a non-pipelined CPU. In the real world, not so much.

Two Steps Forward, One Step Back
Computer programs are sequential.That means that one instruction executes after another.Pipelines execute pieces of multiple instructions in parallel. It is therefore very easy for the pipeline to get the logical intent of the program out of sequence. A simple example is two consecutive operations that change the value of a register. The first instruction may execute the operation that changes the value of the register but the second instruction has fetched the register before the changed value has been written back.

There are several ways to resolve this synchronization problem. A simple instruction decoder can detect the possible problem and stall the pipeline logic long enough for the register value to be written back. The stall also slows down execution of the program.

A more sophisticated decoder can detect the problem and try to execute another instruction or instructions while waiting for the synchronization problem to be resolved. This is called “out of order execution.”1 It adds significant complexity to an architecture, but can do useful work during the computer cycles that would be stalled otherwise.

A computer’s architecture includes a fixed number of registers. These high-speed memory locations can be used to perform operations much faster than ordinary memory. Compilers optimize programs in order to perform operations using registers as often as possible. The primary disadvantage of registers is that they require instruction bits to select them. For example, five opcode bits can select 32 registers in an instruction. For architectures that specify two operands and a destination register, 15 bits of an instruction are consumed just to select registers. For this reason, very few CPU architectures include more than 32 registers.

If a register is in use by an instruction in the pipeline, that register is unavailable until the instruction has completed its modification of the register. Another instruction in the pipeline that uses the same register must wait until the register becomes available. Another technique to avoid stalling the CPU during this wait is called “register renaming.”2 A CPU may only have 16 registers defined by its instruction set, but in reality there might be 64 unnamed physical registers present.For example, the instruction decoder might be able to determine that register R4 is used by two different sections of a program, but one section does not depend on the results of the other. The instruction decoder could then rename one of the unnamed physical registers creating an additional temporary R4 register. When the execution sequence was completed, the temporary R4 register would be “retired.” Retired means the duplicate register would return to its unnamed state.

An even nastier problem occurs following a conditional branch; the deeper the pipeline, the nastier the problem.One Xeon version reportedly has a 31 deep pipeline.3 Each instruction takes 31 steps to complete. This means that the Xeon could possibly load 30 instructions into the pipeline, begin executing them, and have to stop. Upon fetching the 31st instruction a conditional branch that had been previously fetched could cause the program to jump to a different program area.

The situation is called “branch mis-prediction.” The entire pipeline must be reloaded and all the steps done on those 30 instructions will be lost.Furthermore, the power required to execute those 30 instructions will be wasted as heat.The situation will be even more wasteful if a memory access was required by those lost instructions.

Dozens of strategies have been invented to minimize the problems caused by branches within pipelines. The simplest branch prediction strategy is to assume that a branch will always be taken. This can be effective since branches are commonly found at the bottom of software loops, and loops usually run many iterations before completing.

All branches are not found in loops.Also loops can be nested such that sometimes the inner branch will not be taken but the outer branch will be. Sophisticated statistical circuitry can correctly predict many common patterns. That prediction circuitry increases logical complexity, requires thousands of transistors, and increases chip power dissipation.

To further consume transistors and dissipate power, some architectures do “speculative execution.”4 Such a CPU will fetch, decode, and execute instructions from multiple possible execution paths using multiple pipelines. When the state of the conditional branch is determined, the instructions from the unused execution path will be discarded and the correct pipeline selected. An aggressive speculative execution architecture can overcome inefficient branch prediction at the cost of lots of redundant circuitry.

Simultaneous Multithreading
A multi-issue deeply pipelined architecture trades off code acceleration for complexity and redundancy. Much of the redundant circuitry is idle during program execution.Sometimes this logic can be put to useful work with “Simultaneous Multithreading.”5 Intel calls their version “Hyper-threading.”6 Hyper-threading attempts to use the idle logic to run instructions from another program.For example, if one program is using the integer ALU and another one the floating point ALU, it might be possible to allocate steps of the pipeline to two or more independent programs.

Multithreading is implemented by creating multiple sets of registers and latches to hold the state of the multiple programs. The correct register set can be selected for the right pipeline resource by a very smart compiler. From the programmer’s perspective the multiple register sets appear to be multiple CPU cores. Unsophisticated computer buyers often view multithreading the same way. Multithreading sounds sort of like multi-core. Multi-core is good, so multithreading must be good too.

In reality, it is very difficult to find threads from multiple programs that can efficiently run on one execution unit. The improvement is very application dependent. Multithreading can actually slow down computer operation by causing additional cache misses requiring memory access.

According to Intel’s marketing description, Hyper-threading can:7

Run demanding applications simultaneously while maintaining system responsiveness
Provide faster response times for Internet and e-commerce applications, enhancing customer experiences
Increase the number of transactions that can be processed simultaneously
Utilize existing 32-bit applications technologies while maintaining 64-bit future readiness

A more objective source of Hyper-threading benchmarks found that in some cases, “… the slowdown due to Hyper-Threading was less than five percent.”8

A further analysis of the various problems and solutions regarding instruction level parallelism can be found in Ch 2 of John Hennessy and David Patterson’s book.9

What Does It All Mean?
The implementation of extensive CPU parallelism combined with Denard Scaling and Moore’s Law process enhancement delivered nearly half a century of unprecedented productivity growth directly tied to computer performance.

However that parallelism has reached an impenetrable wall of logical complexity and power consumption that renders further improvement incremental at best and may actually cause execution speeds to regress.

The modern microprocessor era kicked off with the penetrating 1980 RISC processor analysis by Patterson and Dave Ditzel.10 The insights of that study directly led to the SPARC CPU and indirectly to MIPS, the Pentium, and all of its descendants.

However the latest edition of Patterson’s book11 is definitely pessimistic:

“…the exploitation of ILP (instruction level parallelism) – namely, speculation – is inherently inefficient.”

“…we have reached – and, in some cases, possibly even surpassed – the point of diminishing returns in our attempt to exploit ILP.”

“…lower CPIs (clocks per instruction) are likely to lead to lower ratios of performance per watt, simple due to overhead.”

“The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate.”

“…the widest-issue processor…is the Itanium 2, but it also has the slowest clock rate, despite the fact that is consumes the most power!”

“…modern microprocessors are primarily power limited.”

To be properly and profoundly depressed regarding the state of computing, an engineer must read the recent works of Patterson, Kogge, Sterling, Dobberpuhl, Moore, Yelick, Colwell, or the several hundred or so lesser lights. These are some of the smartest individuals of the planet, and they essentially predict the end of computer improvement.

Patterson said it best in his recent interview with the New York Times regarding yet another dismal study. “I’m pretty sure it means we need to innovate, since we don’t want to die!”12

Another Path
The Venray Technology engineers, computer architects, and programmers have been confined by Patterson’s Walls for our entire careers. We have spent the past 7 years finding a path to break out.

In language that rings true 30 years later Patterson & Ditzel concluded their seminal work with these words, “While the trend towards architectural complexity may be one path towards improved computers, this paper proposes another path.”13

We at Venray Technology believe we may have found another path.

Related links:
The future of computers – Part 1: Multicore and the Memory Wall
Future of computers – Part 2: The Power Wall

About the author:
Russell Fish III’s three-decade career dates from the birth of the microprocessor. One or more of his designs are licensed into most computers, cell phones, and video games manufactured today.

Russell and Chuck Moore created the Sh-Boom Processor which was included in the IEEE’s “25 Microchips That Shook The World”. He has a BSEE from Georgia Tech and an MSEE from Arizona State.