performance - Why do compilers have to be "smarter" for RISC instruction set architecture

07
2014-07

Computernerd

Reduced Instruction Set Architecture (RISC) aims to reduce the number of instructions thereby improving performance . The only downside to this approach is that the compilers have to be "smarter" .

What does my lecturer mean when she said "compilers have to be smarter" and why is this so

Answers

Daniel R Hicks

RISC, when honestly stated, stands for "Reduced Instruction Set Complexity" -- The number of instruction is not necessarily reduced, but each instruction is simpler, in terms of the machine cycles required to execute it and in terms of the number of gates (or microcode store) devoted to implementing it.

The theory (which is at least partially realized) is that by reducing the amount of control logic, more chip space is available for registers and data path. Hence RISC machines typically have 2-4 times as many registers as their CISC counterparts.

This leaves the compiler to do the work of that omitted control logic, including "scheduling" operations (sequencing them) so that, say, you don't do two adds back-to-back but do an add then a shift (and on different registers) so both the adder and the shifter are optimally utilized. And the compiler must also manage the register set, to optimize movement into and out of registers, minimizing storage accesses. Plus the compiler must know how to best utilize the odd instructions (such as a "shift left one and mask with literal"), as these usually have some (perhaps strange) scenario where they are relatively powerful.

As a result, the instructions generated by a good RISC compiler are virtually impossible to decipher. Even if you know the instruction set well, figuring out that some value from half an hour ago is still in register 12 is difficult at best, even if it weren't for the convoluted shift and mask operations occurring all the time.

(For those who apparently don't believe I know what I'm talking about, I was first involved with on RISC with the IBM 801, in the early 70s, and I was on a first-name basis with George Radin and Marty Hopkins.)

ultrasawblade

Since there are less instructions in a RISC CPU, there's less of a chance that a single high-level statement will translate nicely to a single machine language opcode.

A synonym for RISC CPU is "load-store architecture." Basically, it means that RISC instructions that actually do work generally work on registers only. If you want to work on values stored in RAM, you have to issue explicit LOAD instructions, whereas CISC CPU's like x86 have instructions that automatically do that. RISC CPU's have historically had more registers than x86 - and good code will manage the available registers well to avoid unecessary memory accesses, meaning a compiler needs to take that into account.

Another thing is that RISC CPU's typically only provide the minimum "infrastructure" needed for linkage.

For example, x86 CPU's have a notion of a "stack", where you can push values, and later "pop" them off (there are PUSH and POP instructions). There is also a CALL instruction - it will push the current instruction pointer on the stack and then jump to the destination address - typically a subroutine or function. A RET instruction can then be later issued to pop that saved instruction pointer off and resume from the original function. Nesting subroutines is convenient, and you can use PUSH and POP to put parameters for subroutines easily.

On a MIPS, for example, all you have is a jal for "Jump and Link" - it puts the current instruction pointer in a register and then jumps to that address. If you want to do something like a stack or the x86 CALL instruction, you have to do that manually. This requires more intelligence from the compiler.

Fazer87

CISC (Complex Instruction Set Computing) Processors have a larger range of insturctions available to them than RISC (Reduced Instruction Set Computing) Processors.

An example of multiplication in CISC would be: MUL 1:3, 4:2 (Multiply 1:3 and 2:4). This command would load the value in position 1:3 in its register, load the value in 4:2, multiple them together and store it back to 1:3

RISC CPUs could have to:

LOAD A, 1:3
LOAD B, 4:2
PROD A, B
STORE 1:3, A

...4 RISC operations to 1 CISC Operation.

Because RISC requires more operations to even do the simplest multiplication calculation - imagine how much more work is required for things like a video rendering or gaming?

With this in mind - the compilers that build the software from the code entered by a programmer need to be "smarter" so that they know how to simplify complex pieces fo code and complex commands for RISC Architecture.

Hope this makes sense For further reading, it may be worth looking at: http://www.engineersgarage.com/articles/risc-and-cisc-architecture?page=5

CISC

In a CISC architecture (x86, 68000, VAX) one instruction is powerful, but it takes multiple cycles to process. In older architectures the number of cycles was fixed, nowadays the number of cycles per instruction usually depends on various factors (cache hit/miss, branch prediction, etc.). There are tables to look up that stuff. Often there are also facilitates to actually measure how many cycles a certain instruction under certain circumstances takes (see performance counters).

If you are interested in the details for Intel, the Intel 64 and IA-32 Optimization Reference Manual is a very good read.

RISC

RISC (ARM, PowerPC, SPARC) architecture means usually one very simple instruction takes only a few (often only one) cycle.

Superscalar

But regardless of CISC or RISC there is the superscalar architecture. The CPU is not processing one instruction after another but is working on many instructions simultaneously, very much like an assembly line.

The consequence is: If you simply look up the cycles for every instruction of your program and then add them all up you will end up with a number way to high. Suppose you have a single core RISC CPU. The time to process a single instruction can never be less than the time of one cycle, but the overall throughput may well be several instructions per cycle.

Kevin Panko

The way I like to think of it is with a laundry analogy. CPU instructions are like loads of laundry. You need to use both the washer and the dryer for each load. Let's say that each takes 30 minutes to run. That is the clock cycle. Old CPUs would run the washer, then run the dryer, taking 60 minutes (2 cycles) to finish each load of laundry, every time.

Pipelining: A pipeline is when you use both at the same time -- you wash a load, then while it is drying, you wash the next load. The first load takes 2 cycles to finish, but the second load is finished after 1 more cycle. So, most loads only need 1 cycle, except the first load.

Superscalar: Take all the laundry to the laundromat. Get 2 washers and load them both. When they are done, find 2 dryers and use them both. Now you can wash and dry 2 loads in 60 minutes. That is 2 loads in 2 cycles. Each load still takes 2 cycles, but you can do more of them now. Average time is now 1 load per cycle.

Superscalar with Pipelining: Wash the first 2 loads, then while these are drying, load up the washers with the next 2 loads. Now, the first 2 loads still take 2 cycles, and then the next 2 are finished after 1 more cycle. So, most of the time, you finish 2 loads in each cycle.

Multiple cores: Give half of your laundry to your mother, who also has 2 washers and 2 dryers. With both of you working together, you can get twice as much done. This is similar to superscalar, but slightly different. Instead of you having to move all laundry to and from each machine yourself, she can do that at the same time as you.

This is great, we can do eight times more laundry than before in the same amount of time, without having to create faster machines. (Double the clock speed: Washing machines that only need 15 minutes to run.)

Now, let's talk about how things go wrong:

Pipeline bubble: You have a stain that did not come out in the wash, so you decide to wash it again. Now the dryer is just sitting there, waiting for something to do.

Cache Miss: The truck that delivers the dirty laundry is stuck in traffic. Now you have 2 washers and 2 dryers, but you are getting no work done because you have to wait.

Depending on how often things go wrong, we will not be able to always get 4 loads done every cycle, so the actual amount of work done may vary.

Branch Prediction: Well, you start doing laundry on your clean clothes in case you stain them later so they will be clean already ... _{okay, this is where the analogy breaks down ...}

hyperslug

Not exactly. The cycle you're referring to is clock cycle and since most modern processors pipeline, it takes several clock cycles for 1 instruction to execute. (This is a good thing because it allows other instructions to begin execution even before the 1st instruction finishes.) Assuming the most ideal circumstance, it would probably be around 8 billions IPC, but all sorts of things happen like dependencies, bubbles in the pipeline, branches, etc. so it doesn't always work out.

Sorry, it's way too complicated for a straight answer. Jon Stokes does a good job of explaining it with this article.

dmckee

The days when one could look up (or even memorize) the cycle time for each instruction and know how many clocks it would take for a certain bit of code to finish are long past for high-end chips (but are still with us in some micro-controllers). A modern, general purpose CPU core may have multiple copies of several different execution units in multiple pipelines, accessing a multi-stage memory cache with its own logic, plus branch prediction and speculative execution capability. Having multiple core on a single die drags in cache consistence logic, and other complexities.

So the short answer is: more cores means more capacity to get things done, but not in a nice, predictable way.

Synetech

Ludwig explained the difference between CISC and RISC, but forgot to mention that while RISC instructions are simple and quick, they do little individually and so you must string several together to do the same thing as a single instruction in a CISC processor. As a result, some RISC instructions will be faster, other will not.

Joakim Elofsson

Cycles is more of a per core concept. Each core do there own cycles in parallel.

Home

performance - Why do compilers have to be "smarter" for RISC instruction set architecture

CISC

RISC

Superscalar