How does fabrication process and compiler quality affect CPU performance

07
2014-07
  • Computernerd

    I was reading up on my lectures and one of the slides listed down the factors affecting CPU performance , I cant understand how does fabrication process and compiler quality affect CPU performance

  • Answers
  • Fazer87

    Compiler quality is the easier one...

    Good compilers know how to transalte code to CPU instructions efficiently.

    Imagine you have a piece of software which does a simple math equation - say 1+1. A smartly compiled application will tell the CPU to add the numbers, store the answer and job is done. This canbe represented as:

    • Set memory 0 as 1
    • set memory 1 as 1
    • add memory 0 to 1
    • store in memory bank 0 ..simple!

    Bad compilers (and I've seen a few!) will do the same thing, but will issue loads of additional instructuins to do the same thing, which reduce performance and slow the application down. The same example:

    • Set memory 0
    • Set Memory 1
    • Set Memory 0 to 0
    • Set memory 1 to 0
    • Set memory 0 to 1
    • set memory 1 to 1
    • recall values from meory 0 and 1
    • add them together
    • store result in memory 0

    Now bear in mind that a complex application like a video editor, graphics application, game, even a word processor may need to do hundreds of thousands (if not tens of millions) of operatiosn just to launch! Thats the impace of a good compiler!

    Fabrication process is an extension of this in that fabrication is the "gluing" or multiple applications together through shared functions. If these are done well, less computing power is needed to accomplish the same end result.

  • Seasoned Advice (cooking)

    The quality (optimization ability) of the compiler determines how well the machine code maps to the hardware resources. Compiler optimizations can reduce the amount of work performed (e.g., unrolling a loop can reduce the number of branches, register allocation can reduce the number of memory accesses, inlining can remove call overhead and remove code that is unused by the specific caller), schedule the work to avoid waiting (e.g., scheduling loads earlier so that dependent instructions do not have to wait), exploit specialized instructions that do the work more efficiently (e.g., vectorization can use SIMD instructions), organize memory accesses to exploit cache behavior (e.g., transforming an array of structures into a structure of arrays when inner loops only touch a few members of the structure).

    (Some compiler optimizations apply to all or most hardware; others are more specific to particular hardware implementations. Also, even though hardware support for out-of-order execution improves the execution of less well scheduled code, good instruction scheduling can still provide a measurable, if small, benefit.)

    Fabrication process determines the energy use, switching speed, and area used by transistors (and similar characteristics of other components). Obviously transistors that switch faster allow for higher performance. Reducing the area per transistor allows more transistors to be used in an economically manufacturable chip (which can be translated into more performance) and can reduce communication time between components (e.g., latency of cache access is constrained by distance and not just transistor switching speed). Energy use constrains performance (to some degree the more power that must be delivered, the more "pins" [solder balls] must be used to deliver that power, reducing the number potentially available for communication off chip to memory, I/O, or other processors; extracting the waste heat also presents an economic limit). Lower switching energy means that more work can be done within a given power budget; lower idle ("leakage") power means that more transistors can be kept powered and ready to do work (this is perhaps particularly important for SRAM which must be always powered to retain state).


  • Related Question

    memory - How does the CPU write infomation to ram?
  • Questioner

    My question is, how does the CPU write data to ram?

    From what I understand, modern CPU's use different levels of cache to speed up ram access. The RAM gets a command for information and then sends a burst of data to the CPU which stores the required data (and a bunch of extra data that was close to the address the CPU wanted) into the highest level cache, the CPU then progressively asks the different caches to send smaller and smaller chunks of data down the levels of caches until it is in the level 1 cache which then gets read directly into a CPU register.

    How does this process work when the CPU writes to memory? Does the computer go backwards down the levels of cache (in reverse order as compared to read)? If so, what about synchronizing the information in the different caches with the main memory? Also, how is the speed of a write operation compared to a read one? What happens if I'm continuously writing to RAM, such as in the case of a bucket sort?

    Thanks in advance,

    -Faken

    Edit: I still haven't really gotten an answer which I can fully accept. I want to know especially about the synchronization part of the RAM write. I know that we write to the L1 cache directly from CPU and that data gets pushed down the cache levels as we synchronize the different levels of caches and eventually the main RAM gets synchronized with the highest tier cache. However, what i would like to know is WHEN do caches synchronize and scynocronize with main RAM and how fast are their speeds in relation to read commands.


  • Related Answers
  • Skizz

    Ah, this is one of those simple questions that have really complex answers. The simple answer is, well, it depends on how the write was done and what sort of caching there is. Here's a useful primer on how caches work.

    CPUs can write data in various ways. Without any caching, the data is stored in memory straightaway and the CPU waits for the write to complete. With caching, the CPU usually stores data in program order, i.e. if the program writes to address A then address B then the memory A will be written to before memory B, regardless of the caching. The caching only affects when the physical memory is updated, and this depends on the type of caching used (see the above link). Some CPUs can also store data non-temporally, that is, the writes can be re-ordered to make the most of memory bandwidth. So, writing to A, then B, then (A+1) could be reorderd to writing to A then A+1 in a single burst, then B.

    Another complication is when more than one CPU is present. Depending on the way the system is designed, writes by one CPU won't be seen by other CPUs because the data is still in the first CPUs cache (the cache is dirty). In multiple CPU systems, making each CPU's cache match what is in physical memory is termed cache consistancy. There are various ways this can be acheived.

    Of course, the above is geared towards Pentium processors. Other processors can do things in other ways. Take, for example, the PS3's Cell processor. The basic architecture of a Cell CPU is one PowerPC core with several Cell cores (on the PS3 there are eight cells one of which is always disabled to improve yields). Each cell has its own local memory, sort of an L1 cache which is never written to system RAM. Data can be transferred between this local RAM and system RAM using DMA (Direct Memory Access) transfers. The cell can access system RAM and the RAM of other cells using what appears to be normal reads and writes but this just triggers a DMA transfer (so it's slow and really should be avoided). The idea behind this system is that the game is not just one program, but many smaller programs that combine to do the same thing (if you know *nix then it's like piping command line programs to achieve more complex tasks).

    To sum up, writing to RAM used to be really simple in the days when CPU speed matched RAM speed, but as CPU speed increased and caches were introduced, the process became more complex with many different methods.

    Skizz

  • Am1rr3zA

    yes it's go backwards down the levels of cache and save to memory but the important note is in Multi Processing system the cache are shared between 2 or more processor(core) and the data must be consistent this was done by make shared cache for all multiprocessor or different cache but save consistency by use of Critical section (if data in one cache changed it force it to write in memory and update other cache)