Está en la página 1de 4

Architecture Support for Custom Instructions with Memory Operations

Jason Cong cong@cs.ucla.edu Karthik Gururaj karthikg@cs.ucla.edu

Department of Computer Science, University of California Los Angeles

ABSTRACT
Customized instructions (CIs) implemented using custom functional units (CFUs) have been proposed as a way of improving performance and energy eciency of software while minimizing cost of designing and verifying accelerators from scratch. However, previous work allows CIs to only communicate with the processor through registers or with limited memory operations. In this work we propose an architecture that allows CIs to seamlessly execute memory operations without any special synchronization operations to guarantee program order of instructions. Our results show that our architecture can provide 24% energy savings with 14% performance improvement for 2-issue and 4-issue superscalar processor cores. Categories and Subject Descriptors: C.1.0 [Processor Architectures]: General Keywords: ASIP, Custom instructions, Speculation. 1 2 3 l d [A] , R1 s t R2 , [ R1 ] CI

Figure 1: Memory ordering example

1. INTRODUCTION
Instruction-set customizations have been proposed in [1, 3, 6, 4, 10, 9] which allow certain patterns of operations to be executed eciently on CFUs added to the processor pipeline. Integration of CFUs with a superscalar pipeline provides additional opportunities : typical superscalar processors have hardware for speculatively executing instructions and rolling back and recovering to a correct state when there is mis-speculation. In our work we propose an architecture for integrating CFUs with the superscalar pipeline such that the CFUs can perform memory operations without depending on the compiler to synchronize accesses with the rest of the program.

2. RELATED WORK AND OUR CONTRIBUTIONS


In [18, 4, 10, 9, 17, 12], the CFUs read (write) their inputs (outputs) directly from (to) the register le of the processor and cannot access memory. However, since the CFU cannot

access memory, the size of the computation pattern that can be implemented on a CFU is constrained. The primary problem with allowing CFUs to launch memory operations is to ensure that memory is updated and read in a consistent fashion with respect to other in-ight memory instructions in the superscalar pipeline. In the GARP [13], VEAL [8] and OneChip [7] systems, the compiler needs to insert some special instructions to ensure all preceding memory operations have completed before launching the custom instruction. Systems with architecturally visible storage (AVS) [5, 14] also depend on the compiler for synchronization. In this paper, our goal is to design an architecture such that custom instructions (CIs) with memory operations can execute seamlessly along with other instructions in the processor pipeline without any special synchronization operations. More precisely, we present an architecture for integrating CFUs with the processor pipeline with the following properties: (1) CFUs can launch memory operations to directly access the L1 D-cache of the processor. (2) No synchronization instructions need to be inserted before/after the CI ; this greatly reduces the burden on the compiler especially for applications with operations whose memory address cannot be determined beforehand. Our proposed microarchitecture ensures the correct ordering among the dierent memory operations.

3. CHALLENGES AND OUR PROPOSED SOLUTION FOR SUPPORTING MEMORY OPERATIONS IN CFUS
Note : For a more detailed explanation of our architecture, additional results and analysis, we refer the reader to our technical report [11]. In this section we explain the issues associated when CFUs connected to an out-of-order (OoO) core are allowed to launch memory operations directly and propose modications in the compilation ow and underlying microarchitecture of the core to support such CFUs.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. FPGA13, February 1113, 2013, Monterey, California, USA. Copyright 2013 ACM 978-1-4503-1887-7/13/02 ...$15.00.

231

Lifetime of a CI : A CI is essentially a set of operations grouped together by the compiler to be executed as a single instruction. The primary inputs for a CI always come from the registers of the processor; all input registers must be ready before a CI is ready to execute. Once a CI starts executing, it can issue a series of memory operations into the processors pipeline. Outputs of the CI can write to the processors registers or memory.

3.1 Issue 1: Maintaining program order for memory operations


Consider the simple example in Figure 1. The CI to be executed launches three memory operations: a read from location M 2 and writes to M 1 and M 3. For previous work [7, 14], the CI needs to at least wait until the addresses of all preceding memory operations are computed before proceeding; this is enforced by inserting synchronization instructions before the CI. We overcome this limitation by modifying the cores microarchitecture. The key dierence of our approach is that we make the memory operations launched by a CI visible to the OoO processors pipeline in program order . In the decode stage of the pipeline, in which program order is still maintained, the CI in Figure 1 will launch three memory operations (which we call mem-rops ) into the dispatch stage. In the dispatch stage, each operation is assigned a tag representing its program order. The OoO pipeline will assign one entry for the store and three entries for the CI in the LSQ. Even if the store instructions address is computed after the CI begins execution, the LSQ will check whether any successive operations have an address conict. In the case of a conict, the OoO pipeline will ensure that the CI is squashed and a pipeline ush occurs.

Figure 2: Layout of the processor pipeline with tightly integrated CFU

3.4

Issue 4: Handling a variable number of memory operations

3.2 Issue 2: Ordering of memory operations within a CI


For the CI in Figure 1, assume that the rst write operation to address M 1 and the read operation reading from M 2 overlap/conict. In the case of normal memory instructions, if the read executes before the write, the read will be squashed and re-executed in the OoO pipeline. However, the instruction stream of the program contains only the CI and not the individual memory operations. Re-execution will need to begin from the CI, but the same conict will occur again. To overcome this problem, we place a constraint during the CI compilation phase: the compiler cannot cluster a memory operation in a CI if there is a preceding memory write operation within the CI which may cause a conict. The compiler uses alias analysis (possibly in a conservative way) to satisfy this constraint. Memory dependences between dierent CIs are handled as described in Section 3.1.

Since CIs can span multiple basic blocks in the program, the number of memory operations launched by a CI could vary across executions and need not be deterministic before the CI starts executing. This issue is solved by launching the maximum number of memory operations that a CI can execute during the decode stage. In the case where a particular memory operation is not executed, the CI supplies a dummy address for these operations, eectively turning them into nops.

4.

DETAILS OF PROPOSED ARCHITECTURE

Figure 2 shows the basic layout of how our recongurable CFU units interact with the rest of the processor pipeline. We will briey explain each component of this interaction.

4.1

Decode stage

3.3

Issue 3: Possible partial commit to memory

For the CI in Figure 1, assume that the rst write operation to address M 1 commits and updates memory. However, after this commit, it is determined that the write operation to M 3 fails because of a TLB translation fault. This would leave the memory in an inconsistent state since the write to address M 1 was committed. We overcome this problem by delaying the commit of all write memory operations launched by the CI until the successful completion of the CI (no TLB faults). The compiler inserts additional instructions that are executed in case a CI causes a TLB fault.

Since the number of ports of the register le (and other components in the microarchitecture of a superscalar processor such as the RAT, free list, reservation station) is limited, we split a complex CI into multiple simple operations which we call rops (similar to the micro-ops in x86 processors). Sreg-rops only read from registers, dreg-rops only write to registers and mem-rops read/write from/to memory. The opcode of the CI is used to determine the number and type (read/write) of mem-rops that the CI will launch. We introduce a SRAM table called CFU memop table to store the mapping between the CI opcode and the number of memory operations launched by the CI. This is a programmable table which is lled up when a new application is launched.

4.2

Dispatch stage

The dispatch stage is the last stage in the pipeline where instructions are processed in order (apart from the commit stage). Each instruction is assigned a sequence id or SID which represents the program order. The main job of the dispatch stage is to perform resource allocation i.e., create entries in the ROB (and assigning SIDs ) for each in-

232

struction/rop, the reservation station and in the LSQ for memory instructions/rops. Since mem-rops get their address (and data) operands directly from the CIs (and not the register le), each entry in the reservation table needs to be expanded to accommodate this information. We analyzed that for a modern superscalar processor with 256 physical registers and a 128-entry ROB, where CIs can have at most 16 mem-rops, the area of the reservation station in the baseline processor would be 0.17 mm2 , while the modied reservation station would occupy 0.21 mm2 at the 45 nm node (from McPAT [15]).

4.3 Scheduler and execute stage


A CI is ready to execute when all the sreg-rops issued by it are ready. The scheduler decides which CFU to assign to a particular CI. Once the CFU has obtained all its register operands, it begins execution. When a memory operation is encountered, the CFU computes the address (and possibly data) for the operation and sends it over the bypass path to forward it to the mem-rop waiting in the RS. The mem-rop, using the address obtained from the CFU, proceeds with the memory operation in a manner identical to load/store instructions. Assuming there are no conicts in the LSQ or TLB misses, the memory operation completes and waits in the ROB for retirement. If the mem-rop is a memory read operation, the read value is forwarded to the CFU over the bypass paths. The CFU completes execution and forwards the results to dreg-rops waiting in the RS after which the dreg-rops wait in the ROB for retirement.

ing, we begin to see signicant performance improvements an average of around 14%. The key point in our approach is to compare the performance of a 2-issue core augmented with CFUs (column 2) with a 4-issue core (column 5). Our architecture can beat the performance of a 4-issue core when using a 2-issue core and CFUs. For benchmarks with significant ILP (libquantum, hmmer, sjeng, h264 ), the speed-up is reasonable. Benchmarks such as mcf, which have a large working set, see very little improvement mainly because they are limited by cache misses. Table 2 shows the energy consumption (normalized to the 2-issue core). Here, we see that having CFUs provides significant energy savings. On the average, we see a 24% energy reduction. Of the total energy savings, we observe that 41% of the total energy savings in our system comes from the reduced number of accesses to the I-cache, instruction buer and decode logic, 32% comes from reduced energy consumption of the ALUs (since many arithmetic operations are performed in the FPGA now) and register les. The remaining 27% is distributed among the reservation station, rename logic and ROB.

6. CONCLUSIONS
In this paper we present an architecture by which CIs can launch memory operations and execute speculatively when integrated with a superscalar processor pipeline. Our approach does not need any synchronization or detailed memory analysis by the compiler. Our experiments show that even for pointer-heavy benchmarks, our approach can provide an average of 24% energy savings and 14% performance improvement over software-only implementations.

4.4 Retire stage


Unlike store instructions, write memory mem-rops launched by CIs are allowed to update memory only when all the rops launched by the CI have retired because of reasons explained in Section 3.3.

7. ACKNOWLEDGMENT
The authors acknowledge the support of the GSRC, one of six research centers funded under the FCRP, a SRC entity and Xilinx.

5.

RESULTS 8. REFERENCES
[1] Altera NIOS-II processor. http://www.altera.com/ devices/processor/nios2/ni2-index.html. [2] The LLVM compilation infrastructure. http://llvm. org. [3] Xtensa customizable processor. http://www. tensilica.com/products/xtensa-customizable. [4] K. Atasu, C. Ozturan, G. Dundar, O. Mencer, and W. Luk. Chips: Custom hardware instruction processor synthesis. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 27(3):528 541, march 2008. [5] P. Biswas, N. D. Dutt, L. Pozzi, and P. Ienne. Introduction of architecturally visible storage in instruction set extensions. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 26(3):435 446, march 2007. [6] P. Brisk, A. Kaplan, and M. Sarrafzadeh. Area-ecient instruction set synthesis for recongurable system-on-chip designs. In Design Automation Conference, 2004. Proceedings. 41st, pages 395 400, july 2004. [7] J. E. Carrillo and P. Chow. The eect of recongurable units in superscalar processors. In Proceedings of the

We use the LLVM compiler framework [2] to analyze and extract CIs from the SPEC integer 2006 benchmark suite. Our baseline core is a 2-issue superscalar core with tightly integrated CFUs. Our chip is a tiled CMP system where each tile could be a core or a recongurable fabric tile with 2000 slices and 150 DSP blocks (density based on Virtex-6 numbers). We assume a 5-cycle, pipelined link between the FPGA fabric and core pipeline. We use the AutoESL HLS tool for synthesizing our CFUs and Xilinx XPower for energy numbers for the FPGA. We use Wisconsins GEMS simulation infrastructure [16] to model the performance of our system and McPAT [15] to estimate energy of the processor core. Our cores and the CFUs run at dierent frequencies the core runs at 2 GHz while the frequency of the CFU is provided by Xilinx ISE. To keep the interface logic as simple as possible, the CFU is operated at a clock period which is an integer multiple of the CPU clock period. For ve of our benchmarks, the CFUs selected by our compiler pass could operate at 200 MHz (1 FPGA cycle = 10 core cycles) while for the other two benchmarks, the CFUs operated at 125 MHz (1 FPGA cycle = 16 core cycles). Table 1 shows the performance when the CFUs are pipelined. The initiation interval of pipelining varies between 1 and 3 FPGA cycles (as determined by AutoPilot). With pipelin-

233

Table 1: Normalized performance (#cycles elapsed) with pipelined CFUs on FPGAs 2-issue/128 entries 2-issue/256 entries 4-issue/128 entries 4-issue/256 entries baseline CFU baseline CFU baseline CFU baseline CFU bzip2 1.000 0.924 0.999 0.924 0.958 0.885 0.956 0.884 libquantum 1.000 0.703 1.000 0.703 0.653 0.459 0.653 0.459 hmmer 1.000 0.781 1.000 0.781 0.775 0.605 0.775 0.605 mcf 1.000 0.938 1.000 0.938 0.973 0.913 0.973 0.913 gobmk 1.000 0.940 0.999 0.940 0.982 0.924 0.981 0.923 h264 1.000 0.802 0.998 0.801 0.765 0.613 0.763 0.612 sjeng 1.000 0.899 0.999 0.898 0.895 0.805 0.894 0.804 Average 1.000 0.855 0.999 0.855 0.857 0.743 0.857 0.743 Improvement(%) 14.460 14.461 13.277 13.278

Table 2: Normalized total energy consumption with pipelined CFUs 2-issue/128 entries 2-issue/256 entries 4-issue/128 entries baseline CFU baseline CFU baseline CFU bzip2 1.000 0.736 1.046 0.806 1.367 1.027 libquantum 1.000 0.746 1.058 0.809 1.044 0.787 hmmer 1.000 0.726 1.094 0.816 1.079 0.798 mcf 1.000 0.748 1.060 0.836 1.035 0.800 gobmk 1.000 0.734 1.011 0.788 1.391 1.060 h264 1.000 0.782 1.051 0.757 1.137 0.881 sjeng 1.000 0.731 1.029 0.769 1.290 0.992 Average 1.000 0.743 1.050 0.797 1.192 0.907 Improvement(%) 0.000 25.672 0.000 24.075 0.000 23.942

on FPGAs 4-issue/256 entries baseline CFU 1.452 1.098 1.141 0.825 1.291 1.007 1.123 0.876 1.412 1.102 1.224 0.923 1.343 1.001 1.284 0.976 0.000 23.948

[8]

[9]

[10]

[11]

[12]

[13]

2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, FPGA 01, pages 141150, New York, NY, USA, 2001. ACM. N. Clark, A. Hormati, and S. Mahlke. Veal: Virtualized execution accelerator for loops. In Computer Architecture, 2008. ISCA 08. 35th International Symposium on, pages 389 400, june 2008. N. Clark, M. Kudlur, H. Park, S. Mahlke, and K. Flautner. Application-specic processing on a general-purpose core via transparent instruction set customization. In Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pages 30 40, dec. 2004. J. Cong, Y. Fan, G. Han, and Z. Zhang. Application-specic instruction generation for congurable processor architectures. In Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, FPGA 04, pages 183189, New York, NY, USA, 2004. ACM. J. Cong and K. Gururaj. Architecture support for custom instructions with memory operations. Technical Report 120019, University of California Los Angeles, November 2012. Q. Dinh, D. Chen, and M. D. F. Wong. Ecient asip design for congurable processors with ne-grained resource sharing. In Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays, FPGA 08, pages 99106, New York, NY, USA, 2008. ACM. J. Hauser and J. Wawrzynek. Garp: a MIPS processor with a recongurable coprocessor. In FPGAs for Custom Computing Machines, 1997. Proceedings., The

[14]

[15]

[16]

[17]

[18]

5th Annual IEEE Symposium on, pages 12 21, apr 1997. T. Kluter, S. Burri, P. Brisk, E. Charbon, and P. Ienne. Virtual ways: Ecient coherence for architecturally visible storage in automatic instruction set extensions. In HiPEAC, volume 5952 of Lecture Notes in Computer Science, pages 126140. Springer, 2010. S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 469480, New York, NY, USA, 2009. ACM. M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacets general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput. Archit. News, 33:9299, November 2005. L. Pozzi and P. Ienne. Exploiting pipelining to relax register-le port constraints of instruction-set extensions. In Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, CASES 05, pages 210, New York, NY, USA, 2005. ACM. Z. Ye, A. Moshovos, S. Hauck, and P. Banerjee. Chimaera: a high-performance architecture with a tightly-coupled recongurable functional unit. In Computer Architecture, 2000. Proceedings of the 27th International Symposium on, pages 225 235, june 2000.

234

También podría gustarte