VLIW DSPs can largely enhance the Instruction-Level Parallelism, providing the capacity to meet the performance and energy efficiency requirement of sensor-based systems. In this paper, we present our methods and experiences to develop system toolkit flows for a VLIW DSP, which is designed dedicated to sensor-based systems. Our system toolkit includes compiler, assembler, linker, debugger, and simulator. We have presented our experimental results in the compiler framework by incorporating several state-of-the-art optimization techniques for this VLIW DSP. The results indicate that our framework can largely enhance the performance and energy consumption against the code generated without it.
|Published (Last):||16 August 2008|
|PDF File Size:||8.27 Mb|
|ePub File Size:||11.82 Mb|
|Price:||Free* [*Free Regsitration Required]|
Intels vision for the evolution of architectural innovation and core competencies enabling that evolution is to achieve maximum parallelism, provide performance at its highest level. This paper will focus on comparative learn of advances in processor microarchitecture of Intel to implement instruction-level parallelism. Parallel processing has emerged as a key enabling technology that is driven by concurrent events in modern computers.
Parallel processing requires concurrent execution of many events in the computer. These concurrent events are attainable in a computer system at various processing levels. Parallelism can be applied at various levels of processing such as job, module and instruction. Instruction-level parallelism ILP realized by processor architecture that speed ups execution by causing individual machine operations to execute in parallel.
It is necessary to take decisions about executions of multiple operations handled by processor hardware. According to decision taking power by hardware, instruction-level parallelism architectures can be called as Superscalar.
Superscalar architectures use special hardware to analyze the instruction stream at execution time and to determine which operations in the instruction stream.
These operations can then be issued and executed concurrently. This papers focus on evolution of processors of Intel after and from Intel Pentium to Intel Sandy Bridge superscalar micro-architecture.
The central point of this paper is development of micro-architecture of superscalar design in different generations of Intel processors that implement instruction-level parallelism.
In a textbook on computer architecture by Blaauw and Brooks , authors defined the distinction between architecture and micro-architecture. Architecture defines the functional behavior of the processor. It specifies an instruction set that characterizes the functional behavior of an instruction set processor.
All software must be mapped to or encoded in this instruction set in order to execute by the processor. Every program is complied into a sequence of instructions in this instruction set. An implementation is a specific design of an architecture referred to as microarchitecture. Architecture can have many implementations in the lifetime of that ISA. All microarchitecture of architecture can execute any program encoded in that ISA.
Attributed associated with a micro-architecture include pipeline design, cache memories and branch predictors. Microarchitecture features are generally implemented in hardware and hidden from software. The Pentium processor was Intels first superscalar micro-architecture design following the popular i CPU family in The design started in early with the primary goal of maximizing performance while preserving.
Intels P6 micro-architecture was designed to outperform all other x86 CPUs by a significant margin in In , Intel launched Netburst microarchitecture of Intels new flagship Pentium 4 processor that is basis of a new family of processors from Intel stating with the Pentium 4. It implemented significantly higher clocks rates, internet audio and streaming video, image processing, speech recognition, 3D applications and games, multi-media and multitasking environment.
It includes streaming SIMD instructions that improve performance for multi-media, content creation, scientific and engineering applications. Gone are the complex instruction reorder buffers and register alias tables found in modern superscalar processors. In their place are more registers, more function units, and more branch predictors.
This design was big step forward for Intel in the workstation and server markets. In , Intel turned desktop to mobile with Pentium M. It is Intels first micro-architecture designed specifically for mobility.
It provides outstanding mobile performance and its dynamic power management enables energy saving for longer battery life. Intel first introduced Intel Core microarchitecture in with our 65nm silicon process technology.
The first generation of this multi-core optimized microarchitecture extended the energy-efficient philosophy first delivered in the mobile microarchitecture of the Intel Pentium M processor and enhanced it with many new, leading-edge microarchitecture innovations for industry-leading performance, greater energy efficiency and more responsive multitasking.
Intel Advanced Digital Media Boost. Processors based on Intel Core microarchitecture have delivered record-setting performance on leading industry benchmarks for desktop, mobile and mainstream server platforms.
In , a new microarchitecture codenamed Nehalem launched to rewriting the book on processor energy efficiency, performance and scalability. This next generation Intel microarchitecture Nehalem is a dynamically scalable and design-scalable microarchitecture. At runtime, it dynamically manages cores, threads, cache, interfaces and power to deliver outstanding energy efficiency and performance on demand.
At design time, it scales, enabling Intel to easily provide versions that are optimized for server, desktop and notebook market.
Intel delivered version differing in the number of cores, caches, interconnect capability and memory controller capability. Intels processor clock has tocked, delivering next generation architecture for PCs and servers in The new CPU is an evolutionary improvement over its predecessor, Nehalem, tweaking the branch predictor, register renaming, and instruction decoding.
The big changes in Sandy Bridge target multimedia applications such as 3D graphics, image processing, and video processing. The chip is Intels first to integrate the graphics processing unit GPU on the processor itself. This integration not only eliminates an external chip, but it improves graphics performance by more closely coupling the GPU and the CPU.
AVX is accelerating many 3D- graphics and imaging applications. The new processor also adds hard-wired video encoding. Sandy Bridge was first appearing in desktop and notebook processors that was announced in early and branded as 2nd generation Intel Core processors. Instruction-level Parallelism is discussed in next section. Superscalar ILP architectures are described in section 3. Innovation learn of micro-architectures is explained in section 4 and last end with conclusion appears in section 5.
Instruction Level Parallelism is the lowest level of parallelism. At instruction or statement level, a typical grain contains less than twenty instructions, called fine grain. Depending on individual programs, fine-grain parallelism at this level range from two to thousand. The advantage of fine-grain computation lays in the excess of parallelism..
ILP can be defined by various ways. Some are as follows. Instruction-level parallelism defined as amount of parallelism measured by the number of instructions that can be achieved by issue and execute multiple instructions concurrently. Instruction-level parallelism may be defined as the capability to exploring a sequential instruction stream, identify independent instructions, issue multiple instructions per cycle and send to several execution units in parallel to fully utilizing the available resource , , , , .
The outcome of instruction-level parallel execution is that multiple operations are simultaneously in execution. It is necessary to take decision about when and whether operation should be executed. Superscalar ILP architectures schedule the instructions at run-time or execution time and executed the multiple instructions at multiple execution units simultaneously.
Superscalar processors are based on sequential architecture. Superscalar machines incorporate multiple functional units to achieve greater concurrent processing of multiple instructions and higher execution throughput.
A superscalar processor makes a great effort to issue an instruction every cycle so as to execute many instructions in parallel, even though program is sequentially handed by the hardware. With every instruction that a superscalar processor issues, it must check the instructions operands interfere with the operands of any other instruction in flight.
Once an instruction is independent of all other ones in flight, the hardware must be also decide exactly when and on which available functional unit to execute the instruction. Figure 1 shows the superscalar execution. Superscalar processor rely on hardware for the scheduling the instructions which is called Dynamic instruction scheduling. Figure 1 shows structure of superscalar processor of degree 3 with multiple execution units.
In order to fully utilize a superscalar processor of degree m must issues m instructions per cycle to execute in parallel at all times. If ILP of m is not available, stalls and dead time will result where instructions are waited for results of previous instruction , , , , , .
Throughout history, new and improved technologies have transformed the human experience. In the 20th century, the pace of change sped up radically as we entered the computing age. For nearly 40 years Intel innovations have continuously created new possibilities in the lives of people around the world.
In , Intel co-founder Gordon Moore predicted that the number of transistors on a chip would double about every two years. Since then, Moores Law has fueled a technology revolution as Intel has exponentially increased the number of transistors integrated into it processors for greater performance and energy efficiency. Figure 2 shows the evolution of Intels micro-architecture from Pentium to Sandy Bridge. Intel started implementation of instruction-level parallelism in its first superscalar design with Pentium in The most important enhancements over the are the separate instruction and data caches, the dual integer pipelines the U-pipeline and the V-pipeline, as Intel calls them , branch prediction using the branch target buffer BTB , the pipelined floating-point unit, and the bit external data bus.
Even- parity checking is implemented for the data bus and the internal RAM arrays caches and TLBs , , , , . Pentium was the first high-performance microprocessor to include a system management mode like those found on power-miserly processors for notebooks and other battery- based applications; Intel was holding to its promise to include SMM on all new CPUs.
The integer data path is in the middle, while the floating-point data path is on the side opposite the data cache. In contrast to other superscalar designs, Pentiums integer data path is actually bigger than its FP data path. This is an indication of the extra logic associated with complex instruction support.
Intel came with P6 micro-architecture in which deep pipeline eliminates the cache-access bottlenecks that restrict its competitors to clock speeds of about MHz. In addition, the Intel design uses a closely coupled secondary cache to speed memory accesses, a critical issue for high-frequency CPUs.
Intel will combine the P6 CPU and a cache chip into a single PGA package, reducing the time needed for data to move from the cache to the processor. Like some of its competitors, the P6 translates x86 instructions into simple, fixed-length instructions that Intel calls micro-operations or uops pronounced youops. These uops are then executed in a decoupled superscalar core.
Intel has given the name dynamic execution to this particular combination of features, which is neither new nor unique, but highly effective in increasing x86 performance. The P6 also implements a new system bus with increased bandwidth compared to the Pentium bus.
The new bus is capable of supporting up to four P6 processors with no glue logic, reducing the cost of developing and building multiprocessor systems. This feature set makes the new processor particularly attractive for servers; it will also be used in high-end desktop PCs and, eventually, in mainstream PC products. The P6 team threw out most of the design techniques used by the and Pentium and started from a blank piece of paper to build a high-performance xcompatible processor.
A New Direction for Computer Architecture Research
In order to establish what was their first new ISA in 20 years and bring an entirely new product line to market, Intel made a massive investment in product definition, design, software development tools, OS, software industry partnerships, and marketing. To support this effort Intel created the largest design team in their history and a new marketing and industry enabling team completely separate from x The first Itanium processor, codenamed Merced , was released in The Itanium architecture is based on explicit instruction-level parallelism , in which the compiler decides which instructions to execute in parallel. This contrasts with superscalar architectures, which depend on the processor to manage instruction dependencies at runtime. In all Itanium models, up to and including Tukwila , cores execute up to six instructions per clock cycle. In , HP began to become concerned that reduced instruction set computing RISC architectures were approaching a processing limit at one instruction per cycle.
Micro-Architectures Evolution in Generations of Intel’s Processors
Kozyrakis David A. We assume a billion transistor implementation for the Trace and IA architecture. Table 1 summarizes the basic features of the billion transistor implementations for the proposed architectures as presented in the corresponding references. For the case of the Trace Processor and IA, descriptions of billion transistor implementations have not been presented, hence certain features are speculated. The first two architectures Advanced Superscalar and Superspeculative Architecture have very similar characteristics.
Computer Architecture: A Quantitative Approach International Stud
Intels vision for the evolution of architectural innovation and core competencies enabling that evolution is to achieve maximum parallelism, provide performance at its highest level. This paper will focus on comparative learn of advances in processor microarchitecture of Intel to implement instruction-level parallelism. Parallel processing has emerged as a key enabling technology that is driven by concurrent events in modern computers. Parallel processing requires concurrent execution of many events in the computer. These concurrent events are attainable in a computer system at various processing levels. Parallelism can be applied at various levels of processing such as job, module and instruction.