Documentos de Académico
Documentos de Profesional
Documentos de Cultura
ABSTRACT
Customized instructions (CIs) implemented using custom functional units (CFUs) have been proposed as a way of improving performance and energy eciency of software while minimizing cost of designing and verifying accelerators from scratch. However, previous work allows CIs to only communicate with the processor through registers or with limited memory operations. In this work we propose an architecture that allows CIs to seamlessly execute memory operations without any special synchronization operations to guarantee program order of instructions. Our results show that our architecture can provide 24% energy savings with 14% performance improvement for 2-issue and 4-issue superscalar processor cores. Categories and Subject Descriptors: C.1.0 [Processor Architectures]: General Keywords: ASIP, Custom instructions, Speculation. 1 2 3 l d [A] , R1 s t R2 , [ R1 ] CI
1. INTRODUCTION
Instruction-set customizations have been proposed in [1, 3, 6, 4, 10, 9] which allow certain patterns of operations to be executed eciently on CFUs added to the processor pipeline. Integration of CFUs with a superscalar pipeline provides additional opportunities : typical superscalar processors have hardware for speculatively executing instructions and rolling back and recovering to a correct state when there is mis-speculation. In our work we propose an architecture for integrating CFUs with the superscalar pipeline such that the CFUs can perform memory operations without depending on the compiler to synchronize accesses with the rest of the program.
access memory, the size of the computation pattern that can be implemented on a CFU is constrained. The primary problem with allowing CFUs to launch memory operations is to ensure that memory is updated and read in a consistent fashion with respect to other in-ight memory instructions in the superscalar pipeline. In the GARP [13], VEAL [8] and OneChip [7] systems, the compiler needs to insert some special instructions to ensure all preceding memory operations have completed before launching the custom instruction. Systems with architecturally visible storage (AVS) [5, 14] also depend on the compiler for synchronization. In this paper, our goal is to design an architecture such that custom instructions (CIs) with memory operations can execute seamlessly along with other instructions in the processor pipeline without any special synchronization operations. More precisely, we present an architecture for integrating CFUs with the processor pipeline with the following properties: (1) CFUs can launch memory operations to directly access the L1 D-cache of the processor. (2) No synchronization instructions need to be inserted before/after the CI ; this greatly reduces the burden on the compiler especially for applications with operations whose memory address cannot be determined beforehand. Our proposed microarchitecture ensures the correct ordering among the dierent memory operations.
3. CHALLENGES AND OUR PROPOSED SOLUTION FOR SUPPORTING MEMORY OPERATIONS IN CFUS
Note : For a more detailed explanation of our architecture, additional results and analysis, we refer the reader to our technical report [11]. In this section we explain the issues associated when CFUs connected to an out-of-order (OoO) core are allowed to launch memory operations directly and propose modications in the compilation ow and underlying microarchitecture of the core to support such CFUs.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. FPGA13, February 1113, 2013, Monterey, California, USA. Copyright 2013 ACM 978-1-4503-1887-7/13/02 ...$15.00.
231
Lifetime of a CI : A CI is essentially a set of operations grouped together by the compiler to be executed as a single instruction. The primary inputs for a CI always come from the registers of the processor; all input registers must be ready before a CI is ready to execute. Once a CI starts executing, it can issue a series of memory operations into the processors pipeline. Outputs of the CI can write to the processors registers or memory.
3.4
Since CIs can span multiple basic blocks in the program, the number of memory operations launched by a CI could vary across executions and need not be deterministic before the CI starts executing. This issue is solved by launching the maximum number of memory operations that a CI can execute during the decode stage. In the case where a particular memory operation is not executed, the CI supplies a dummy address for these operations, eectively turning them into nops.
4.
Figure 2 shows the basic layout of how our recongurable CFU units interact with the rest of the processor pipeline. We will briey explain each component of this interaction.
4.1
Decode stage
3.3
For the CI in Figure 1, assume that the rst write operation to address M 1 commits and updates memory. However, after this commit, it is determined that the write operation to M 3 fails because of a TLB translation fault. This would leave the memory in an inconsistent state since the write to address M 1 was committed. We overcome this problem by delaying the commit of all write memory operations launched by the CI until the successful completion of the CI (no TLB faults). The compiler inserts additional instructions that are executed in case a CI causes a TLB fault.
Since the number of ports of the register le (and other components in the microarchitecture of a superscalar processor such as the RAT, free list, reservation station) is limited, we split a complex CI into multiple simple operations which we call rops (similar to the micro-ops in x86 processors). Sreg-rops only read from registers, dreg-rops only write to registers and mem-rops read/write from/to memory. The opcode of the CI is used to determine the number and type (read/write) of mem-rops that the CI will launch. We introduce a SRAM table called CFU memop table to store the mapping between the CI opcode and the number of memory operations launched by the CI. This is a programmable table which is lled up when a new application is launched.
4.2
Dispatch stage
The dispatch stage is the last stage in the pipeline where instructions are processed in order (apart from the commit stage). Each instruction is assigned a sequence id or SID which represents the program order. The main job of the dispatch stage is to perform resource allocation i.e., create entries in the ROB (and assigning SIDs ) for each in-
232
struction/rop, the reservation station and in the LSQ for memory instructions/rops. Since mem-rops get their address (and data) operands directly from the CIs (and not the register le), each entry in the reservation table needs to be expanded to accommodate this information. We analyzed that for a modern superscalar processor with 256 physical registers and a 128-entry ROB, where CIs can have at most 16 mem-rops, the area of the reservation station in the baseline processor would be 0.17 mm2 , while the modied reservation station would occupy 0.21 mm2 at the 45 nm node (from McPAT [15]).
ing, we begin to see signicant performance improvements an average of around 14%. The key point in our approach is to compare the performance of a 2-issue core augmented with CFUs (column 2) with a 4-issue core (column 5). Our architecture can beat the performance of a 4-issue core when using a 2-issue core and CFUs. For benchmarks with significant ILP (libquantum, hmmer, sjeng, h264 ), the speed-up is reasonable. Benchmarks such as mcf, which have a large working set, see very little improvement mainly because they are limited by cache misses. Table 2 shows the energy consumption (normalized to the 2-issue core). Here, we see that having CFUs provides significant energy savings. On the average, we see a 24% energy reduction. Of the total energy savings, we observe that 41% of the total energy savings in our system comes from the reduced number of accesses to the I-cache, instruction buer and decode logic, 32% comes from reduced energy consumption of the ALUs (since many arithmetic operations are performed in the FPGA now) and register les. The remaining 27% is distributed among the reservation station, rename logic and ROB.
6. CONCLUSIONS
In this paper we present an architecture by which CIs can launch memory operations and execute speculatively when integrated with a superscalar processor pipeline. Our approach does not need any synchronization or detailed memory analysis by the compiler. Our experiments show that even for pointer-heavy benchmarks, our approach can provide an average of 24% energy savings and 14% performance improvement over software-only implementations.
7. ACKNOWLEDGMENT
The authors acknowledge the support of the GSRC, one of six research centers funded under the FCRP, a SRC entity and Xilinx.
5.
RESULTS 8. REFERENCES
[1] Altera NIOS-II processor. http://www.altera.com/ devices/processor/nios2/ni2-index.html. [2] The LLVM compilation infrastructure. http://llvm. org. [3] Xtensa customizable processor. http://www. tensilica.com/products/xtensa-customizable. [4] K. Atasu, C. Ozturan, G. Dundar, O. Mencer, and W. Luk. Chips: Custom hardware instruction processor synthesis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 27(3):528 541, march 2008. [5] P. Biswas, N. D. Dutt, L. Pozzi, and P. Ienne. Introduction of architecturally visible storage in instruction set extensions. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26(3):435 446, march 2007. [6] P. Brisk, A. Kaplan, and M. Sarrafzadeh. Area-ecient instruction set synthesis for recongurable system-on-chip designs. In Design Automation Conference, 2004. Proceedings. 41st, pages 395 400, july 2004. [7] J. E. Carrillo and P. Chow. The eect of recongurable units in superscalar processors. In Proceedings of the
We use the LLVM compiler framework [2] to analyze and extract CIs from the SPEC integer 2006 benchmark suite. Our baseline core is a 2-issue superscalar core with tightly integrated CFUs. Our chip is a tiled CMP system where each tile could be a core or a recongurable fabric tile with 2000 slices and 150 DSP blocks (density based on Virtex-6 numbers). We assume a 5-cycle, pipelined link between the FPGA fabric and core pipeline. We use the AutoESL HLS tool for synthesizing our CFUs and Xilinx XPower for energy numbers for the FPGA. We use Wisconsins GEMS simulation infrastructure [16] to model the performance of our system and McPAT [15] to estimate energy of the processor core. Our cores and the CFUs run at dierent frequencies the core runs at 2 GHz while the frequency of the CFU is provided by Xilinx ISE. To keep the interface logic as simple as possible, the CFU is operated at a clock period which is an integer multiple of the CPU clock period. For ve of our benchmarks, the CFUs selected by our compiler pass could operate at 200 MHz (1 FPGA cycle = 10 core cycles) while for the other two benchmarks, the CFUs operated at 125 MHz (1 FPGA cycle = 16 core cycles). Table 1 shows the performance when the CFUs are pipelined. The initiation interval of pipelining varies between 1 and 3 FPGA cycles (as determined by AutoPilot). With pipelin-
233
Table 1: Normalized performance (#cycles elapsed) with pipelined CFUs on FPGAs 2-issue/128 entries 2-issue/256 entries 4-issue/128 entries 4-issue/256 entries baseline CFU baseline CFU baseline CFU baseline CFU bzip2 1.000 0.924 0.999 0.924 0.958 0.885 0.956 0.884 libquantum 1.000 0.703 1.000 0.703 0.653 0.459 0.653 0.459 hmmer 1.000 0.781 1.000 0.781 0.775 0.605 0.775 0.605 mcf 1.000 0.938 1.000 0.938 0.973 0.913 0.973 0.913 gobmk 1.000 0.940 0.999 0.940 0.982 0.924 0.981 0.923 h264 1.000 0.802 0.998 0.801 0.765 0.613 0.763 0.612 sjeng 1.000 0.899 0.999 0.898 0.895 0.805 0.894 0.804 Average 1.000 0.855 0.999 0.855 0.857 0.743 0.857 0.743 Improvement(%) 14.460 14.461 13.277 13.278
Table 2: Normalized total energy consumption with pipelined CFUs 2-issue/128 entries 2-issue/256 entries 4-issue/128 entries baseline CFU baseline CFU baseline CFU bzip2 1.000 0.736 1.046 0.806 1.367 1.027 libquantum 1.000 0.746 1.058 0.809 1.044 0.787 hmmer 1.000 0.726 1.094 0.816 1.079 0.798 mcf 1.000 0.748 1.060 0.836 1.035 0.800 gobmk 1.000 0.734 1.011 0.788 1.391 1.060 h264 1.000 0.782 1.051 0.757 1.137 0.881 sjeng 1.000 0.731 1.029 0.769 1.290 0.992 Average 1.000 0.743 1.050 0.797 1.192 0.907 Improvement(%) 0.000 25.672 0.000 24.075 0.000 23.942
on FPGAs 4-issue/256 entries baseline CFU 1.452 1.098 1.141 0.825 1.291 1.007 1.123 0.876 1.412 1.102 1.224 0.923 1.343 1.001 1.284 0.976 0.000 23.948
[8]
[9]
[10]
[11]
[12]
[13]
2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, FPGA 01, pages 141150, New York, NY, USA, 2001. ACM. N. Clark, A. Hormati, and S. Mahlke. Veal: Virtualized execution accelerator for loops. In Computer Architecture, 2008. ISCA 08. 35th International Symposium on, pages 389 400, june 2008. N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner. Application-specic processing on a general-purpose core via transparent instruction set customization. In Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pages 30 40, dec. 2004. J. Cong, Y. Fan, G. Han, and Z. Zhang. Application-specic instruction generation for congurable processor architectures. In Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, FPGA 04, pages 183189, New York, NY, USA, 2004. ACM. J. Cong and K. Gururaj. Architecture support for custom instructions with memory operations. Technical Report 120019, University of California Los Angeles, November 2012. Q. Dinh, D. Chen, and M. D. F. Wong. Ecient asip design for congurable processors with ne-grained resource sharing. In Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays, FPGA 08, pages 99106, New York, NY, USA, 2008. ACM. J. Hauser and J. Wawrzynek. Garp: a MIPS processor with a recongurable coprocessor. In FPGAs for Custom Computing Machines, 1997. Proceedings., The
[14]
[15]
[16]
[17]
[18]
5th Annual IEEE Symposium on, pages 12 21, apr 1997. T. Kluter, S. Burri, P. Brisk, E. Charbon, and P. Ienne. Virtual ways: Ecient coherence for architecturally visible storage in automatic instruction set extensions. In HiPEAC, volume 5952 of Lecture Notes in Computer Science, pages 126140. Springer, 2010. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 469480, New York, NY, USA, 2009. ACM. M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacets general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News, 33:9299, November 2005. L. Pozzi and P. Ienne. Exploiting pipelining to relax register-le port constraints of instruction-set extensions. In Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, CASES 05, pages 210, New York, NY, USA, 2005. ACM. Z. Ye, A. Moshovos, S. Hauck, and P. Banerjee. Chimaera: a high-performance architecture with a tightly-coupled recongurable functional unit. In Computer Architecture, 2000. Proceedings of the 27th International Symposium on, pages 225 235, june 2000.
234