06112747

524
IEEE TRANSACTIONS ON COMPUTERS,
VOL. 62,
NO. 3,
MARCH 2013
NoC-Based FPGA Acceleration for Monte Carlo Simulations with Applications to SPECT Imaging
Phillip J. Kinsman, Student Member, IEEE, and Nicola Nicolici, Senior Member, IEEE
AbstractAs the number of transistors that are integrated onto a silicon die continues to increase, the compute power is becoming a commodity. This has enabled a whole host of new applications that rely on high-throughput computations. Recently, the need for faster and cost-effective applications in form-factor constrained environments has driven an interest in on-chip acceleration of algorithms based on Monte Carlo simulations. Though Field Programmable Gate Arrays (FPGAs), with hundreds of on-chip arithmetic units, show significant promise for accelerating these embarrassingly parallel simulations, a challenge exists in sharing access to simulation data among many concurrent experiments. This paper presents a compute architecture for accelerating Monte Carlo simulations based on the Network-on-Chip (NOC) paradigm for on-chip communication. We demonstrate through the complete implementation of a Monte Carlo-based image reconstruction algorithm for Single-Photon Emission Computed Tomography (SPECT) imaging that this complex problem can be accelerated by two orders of magnitude on even a modestly sized FPGA over a 2 GHz Intel Core 2 Duo Processor. The architecture and the methodology that we present in this paper is modular and hence it is scalable to problem instances of different sizes, with application to other domains that rely on Monte Carlo simulations. Index TermsNetwork-on-chip (NoC), field-programmable gate-array (FPGA), Monte Carlo (MC) simulation, nuclear medical imaging, single-photon emission computed tomography (SPECT)
1 INTRODUCTION
higher resolution can be reconstructed without paying a very costly time premium. Naturally, such a parallel solution could also exploit these VRTs That MC simulations are good candidates for parallelization is quite intuitive, as fundamentally they are comprised of huge numbers of independent experiments. This is, in fact, confirmed by the authors of [8], wherein they demonstrate that nearly linear speedup in the number of processors can be achieved for this application. Although they demonstrate a 32 speedup with a computing grid, this approach quickly becomes prohibitively expensive in terms of compute resources and energy consumption and suffers form-factor requirements that are undesirable for a clinical setting. This is indeed the motivation for pursuing acceleration on massively parallel, yet integrated, platforms in reduced form factors. There exists already a large body of work in accelerating MC simulations on both field-programmable gate arrays (FPGAs) [9], [14], [21], [25], [30], [31], [33], [36] and graphics processing units (GPUs) [3], [11], [13], [32], [35], [37]. A graphical depiction of the relationship of these works is given in Fig. 1. Though other examples of accelerating MC simulations through cluster computing do exist, work in [8] was chosen because it considers the same application discussed here. Interestingly, though the specific case studies are distinct, two of the GPU implementations [3], [32] are algorithmically similar to the application presented here. In both cases, the results presented are somewhat modest in comparison to the number of cores when contrasted with the results presented in Section 4 of this work. In order to understand the reason for this, it is necessary to investigate the application in further depth. This application falls into a class of Monte Carlo simulations in which all the experiments
Published by the IEEE Computer Society
INGLE-PHOTON Emission Computed Tomography (SPECT) is a medical imaging modality used clinically in a number of diagnostic applications, including detection of cardiac pathology, various forms of cancer, and certain degenerative brain diseases. Consequently, timely and accurate reconstruction of SPECT images is of critical importance. Although computationally efficient analytical solutions to this extremely complex problem do exist, they are highly susceptible to noise [15] and since image quality has direct implications to patient care, statistical reconstruction methods are typically favored despite their relatively long runtimes [4], [12]. The last decade has seen the development of numerous variance reduction techniques (VRTs) [4], [7], [18], which accelerate these statistical reconstruction methods by optimizing the algorithms of the Monte Carlo (MC) simulations at their core. These have been quite successful in bringing image reconstruction time into a reasonable range for relatively small images. For example, the work from [4] demonstrates the reconstruction of a 64 64 64 image on a dual-core processor in approximately half an hour. However, because of the underlying structure of MC simulations, we propose that by investigating solutions in parallel computing, images of
. The authors are with the Department of Electrical and Computer Engineering, McMaster University, 1280 Main Street West, Hamilton, ON L8S 4L8, Canada. E-mail: kinsmapj@grads.ece.mcmaster.ca, nicola@ece.mcmaster.ca. Manuscript received 25 May 2011; revised 1 Dec. 2011; accepted 6 Dec. 2011; published online 22 Dec. 2011. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-2011-05-0344. Digital Object Identifier no. 10.1109/TC.2011.250.
0018-9340/13/$31.00 2013 IEEE
KINSMAN AND NICOLICI: NOC-BASED FPGA ACCELERATION FOR MONTE CARLO SIMULATIONS WITH APPLICATIONS TO SPECT...
525
Fig. 1. Current approaches to MC acceleration.
share data from a common data set which is sufficiently large to prohibit complete reproduction for each processing node. Furthermore, in such experiments, the data access patterns are not known a priori. Other examples from this class include weather, environmental, and risk evaluation simulations. GPUs are known to provide substantial acceleration benefits to arithmetically intense algorithms that have structured data accesses and limited branching [16], [26]. Therefore, as suggested by the authors of [3] and [32], GPU implementations of this kind of simulation suffer from the frequency and randomness with which individual experiments retrieve data from memory. In contrast, the increased development effort for FPGAs repays the designer with complete flexibility in how memory is distributed among the processing engines. This, in combination with the ability to craft application-specific on-chip communication architectures, suggests FPGAs as the best platform for developing a scalable framework for running hundreds of concurrent experiments without data starvation. It is at this point, then, that we revisit the previously mentioned FPGA-based attempts at parallelizing Monte Carlo simulations. Of these, most do not fall into the category of shared data set Monte Carlo and are consequently able to be parallelized with no consideration given to interexperiment communication [14], [30], [31], [33], [36]. However, the remaining works merit a more in-depth treatment as they share fundamental algorithmic similarities to the application discussed here. One of the earliest works to consider the use of FPGAs to accelerate Monte Carlo simulations implemented a simplified version of the radiation transport problem [25]. In this case, the authors worked with a single-point source in an infinite medium of aluminum. Naturally, this completely circumvented the issue of storing a density map and consequently, this work should not be categorized with the large shared data set simulations considered here. The authors of [9] consider a Monte Carlo simulation for dose calculation in radiotherapy treatment. They address the issue of the large simulation data set by taking a pipelining rather than a multicore approach to acceleration. They implemented a single-core pipelined solution but the technical details from the paper suggest that limited acceleration can be achieved. The work presented in [21] also approaches acceleration through pipelining instead of parallelization. This paper clearly details an insightful and well-executed implementation and the results are very promising. However, their design methodology is application specific. In contrast, our
work presents a modular approach that can be reused in other application domains to ease the design effort. Furthermore, the architecture presented in [21] is not designed with a focus on scalability, while our approach enables the parameterized adjustment of the design size with results suggesting near-linear scaling of compute speed. Having justified a multicore approach to accelerating the application, the key problem to be addressed is how to arbitrate access to the large shared data set by possibly hundreds of processing units (PUs). The random nature of the experiments suggests the need for a flexible architecture that allows runtime configuration of the data transfers. Furthermore, when considering the more general applicability of this work to many instances of Monte Carlo simulations, it is highly desirable to create a solution that will scale easily to adapt to different problem sizes. Consequently, we adopted the Network-on-Chip design paradigm [6] to fully implement and verify a 128 processing unit network that accelerated the MC simulation central to SPECT imaging. We are not aware of any other works that have adopted an NoC approach to accelerating Monte Carlo simulations. The remainder of this paper is structured as follows: Section 2 describes the image reconstruction algorithm. Section 3 outlines the NoC-based parallel architecture, which was developed for accelerating the simulation, together with its implementation details. Section 4 details the results achieved on our implementation platform.
IMAGE RECONSTRUCTION ALGORITHM
Before introducing our new NoC-based accelerator architecture, this section presents the basic concepts used for image reconstruction in nuclear medical imaging and highlights the algorithmic patterns that have motivated our architectural decisions. The physical basis for SPECT imaging is the detection of gamma rays emitted by decaying isotopes, which have been injected into a subject [19]. Prior to detection, these gamma rays may undergo attenuation and a series of scatterings on their course of exit from the patient. This makes determination of the variable of interestnamely the distribution of the radioactive source within the patientnontrivial. As previously mentioned, the focus of this work is the accelerated simulation of this imaging process (see Fig. 2) using Monte Carlo methods, since this simulation is in the inner loop of a group of iterative reconstruction algorithms depicted in Fig. 3. It should be noted that the actual iterative refinement for image reconstruction is not addressed explicitly by this work. Fig. 2 depicts the simulation of the imaging process. The patient density scan is obtained with a CT scanner (typically, this occurs at the same time as the SPECT images are captured with a combined scanner) and is used in conjunction with the scatter model and the approximation of the source distribution to simulate the imaging process. These simulated images are compared to the measured images and used to iteratively refine the approximation of the source distribution as shown in Fig. 3. As a model for this simulation, we adopted the SIMIND Monte Carlo application by Michael Ljungberg from Lunds Universitet [19]. This FORTRAN software model takes a description of
526
VOL. 62,
NO. 3,
MARCH 2013
Fig. 3. Iterative reconstruction.
Fig. 2. Simulation of the imaging process.
the source distribution and a density map of the imaged subject (henceforth referred to as the phantom) and simulates the physical imaging process, from isotope decay to gamma ray detection in the crystal. The following terminology is adopted for the remainder of this paper: the simulation refers to the entire imaging process, as executed by the processing platform while the experiments are the algorithmic building blocks of the simulation, given in Fig. 4. In this application, it is typical for many millions of experiments to constitute the simulation and it is critical to recognize that all experiments are independent, sharing only information which remains constant for the duration of the simulation. Finally, the data structure for an experiment, representing the physical ray which travels through the phantom, is referred to as the photon and contains the position, trajectory, scatter count, energy, state, and weight (discussed in next paragraph) of the ray. The representation of this information is discussed in further depth in Section 3.3. The experiment begins with simulation of isotope decay, immediately followed by a random sampling of the number of photon scatters that will be simulated over the course of the experiment. If the sampled value is 0, i.e., the photon exits the phantom without scattering, then it is called a primary photon and it is probabilistically forced into the solid angle of the detector and attenuation to the edge of the phantom is
Fig. 4. Workflow for one experiment.
calculated. In order to understand the notion of a photon being probabilistically forced, it is necessary to describe the concept of Photon History Weight (PHW). This is a variance reduction technique which lightens computational load by assigning a weight to each photon as it is emitted. This weight is reduced in proportion to the probability of each photon interaction over the course of the experiment. Consequently, some information is contributed to the final
527
image by every experiment. This is in contrast to the traditional approach where photons are probabilistically eliminated at each interaction with only the surviving photons contributing to the resultant image. It is worth noting that the initial value of the weight is the same for each experiment and represents the average decay activity per photon as calculated by (1), where n is the number of photons emitted per decay, is the activity of the source in Bq, and N is the total number of photon histories to be simulated W0 n : N 1
Though several variance reduction techniques are commonly employed to decrease runtime [4], [7], [18], above we have summarized one representative example. If the photon is not a primary photon, then an emission trajectory is randomly sampled and its corresponding direction cosines are computed. Since it is common for a photon to move to multiple locations at different distances along a single trajectory, the compute complexity of updating the photon position is reduced by storing the direction of travel within the photon data structure as a set of cosines (U, V, and W). In this way, the trigonometric functions are executed once per direction change and position updating becomes a simple matter of adding a factor of the cosines to the current position. After the direction cosines are computed, the maximum distance to the next interaction site (Dmax ) is computed and then a photon path length is sampled (less than Dmax ). If the scatter order of the photon is less than the previously sampled maximum, then a scatter type is sampled, the interaction is simulated, and the loop repeats. Otherwise, a Compton scattering is forced in order to direct the photon into the solid angle of the detector and the photon attenuation and detection are simulated as with primary photons. The reader can obtain more detailed information regarding the specific implementation details from [19] or on the SIMIND website [20]. The focus of this work is primarily on the development of a parallel on-chip architecture that enables efficient use of all of the processing units. Therefore, specific algorithmic details will be discussed only when they provide insight into design considerations for the onchip communication networks, which enable the massive parallelism facilitated by the large number of processing units. The following paragraphs detail the parts of the algorithm which guided the major architectural decisions. Scatter simulation. At the site of an interaction in the phantom, one of two types of scatter can occur: Compton1 or Rayleigh2 (coherent). Random sampling dictates which of the two occurs but both have important implications for onchip memory usage. Both types of scattering use a samplereject method that requires up to hundreds of accesses to the scatter model. Despite the fact that it is a substantial amount of data (nearly 2 kB), the high frequency with which processing elements access this data insists on its being replicated so that each PU has unarbitrated access to a copy. Photon path length sampling. As mentioned above, when a photon is not primary, it is necessary to sample the
1. Inelastic photon-matter interaction decreasing ray energy and changing its direction. 2. Elastic photon-matter interaction changing ray direction.
location of the next photon interaction site. The site is determined by a combination of random sampling and comparison with density values along the photons trajectory. In the vast majority of cases, this can be accomplished by sampling only 1-3 values from the density map, keeping in mind that these values often lay far from the photons current position in the phantom. The implications of this become clear in Section 3.1 when traffic patterns are discussed. Photon attenuation calculation. As a photon exits the phantom, it experiences attenuation proportional to the integral of the density values along the exit path. This process is necessarily discretized, such that the photon takes small steps along the exit trajectory and reads the density at each point and in so doing, incrementally calculates the total attenuation. In the majority of cases, this process requires 20100 samples from the density map, though each sample exhibits high spatial locality to its preceding and following samples. The relevance of this matter is, again, further elaborated in the traffic patterns portion of Section 3.1. The following section describes how the memory access requirements detailed above led to the development of a general framework for accelerating shared data set Monte Carlo simulations.
NOC-BASED PARALLEL ARCHITECTURE
In this section, we detail our new architecture for accelerating Monte Carlo simulations. First, we describe the architecture and then we provide implementation details for the on-chip network and processing units illustrated in Fig. 5.
3.1 Architecture Description The most significant innovation presented in this work is the investigation into a general, scalable architecture for Monte Carlo simulations in which all the experiments share a data set which is too large to replicate for each processing unitin this case, the density map. The natural problem which arises in this situation is how to arbitrate access between the hundreds of PUs and the single copy of the data set so that each experiment can continue without being data starved. Although it is possible to position the data set centrally and arbitrate access to it, this comes at the price of higher data latency. This can be hidden to some extent by keeping many concurrent experiments at each PU but in such a data-centric simulation, the hardware and time cost of such frequent context switching could become prohibitive. Consequently, we investigated a NoC-based architecture that enabled distributing the data set among the processing units and implementing on-chip communication resources sufficient for each experiment to relocate itself to the PU containing the data it needs. This section describes the design process that was followed, while giving concrete details of the final implementation. In many cases, experimental profiling is used to justify a design decision. This profiling was performed on a network simulator developed in house specifically for this project in order to enable the rapid prototyping of various network configurations. The network simulator was developed in C using the highlevel synthesis library libHLS [22] developed at Northeastern University. It represents the network as a graph and performs cycle-accurate simulation of all network transfers,
528
VOL. 62,
NO. 3,
MARCH 2013
a few insights that apply to a broad spectrum of problems which fall into the shared data set MC family the ratio of shared data set accesses to arithmetic operations3 is quite high (in this case, approximately 1:1), . an experiment may access data from spatially distant points of the density map over its entire duration; however, the majority of data accesses in close temporal proximity also exhibit close spatial proximity, and . the majority of simulation time is spent on random sampling, with a comparatively small amount of time spent on complex arithmetic and trigonometric processing. The architectural decisions detailed below are made in a way that directly exploits the above computational patterns. Naturally, applications that deviate heavily from these patterns will fail to realize significant benefit from the architecture proposed here. Though the implementation work necessary to map a set of distinct applications to custom hardware (in order to empirically substantiate the applicability of this approach to other applications) is outside the scope of this work, a brief survey of applications of Monte Carlo simulations reveals a broad category of algorithms that bear strong similarities to the candidate for this case study. The reader is directed to the literature from weather simulation [10], computational biology [27], and particle physics [29]. With the above information at hand, it is suggested that relatively few computational resources are needed for the processing units and that the top design priority should be minimizing data access latency. In order to achieve this goal, each processing unit is allocated an equal portion of the density map. The on-chip communication can be implemented such that any experiment could access the necessary data with minimal latency. This, however, creates many more degrees of freedom than when the density map is preallocated. The following paragraphs address the questions which arose out of this decision. It should be noted that the results reached in the forthcoming discussion apply specifically to the SPECT simulation problem. However, the design patterns revealed by these results can be applied more generally to parallel Monte Carlo simulations where a common data set is shared among the many concurrent experiments: 1. Into what topology should the processing units be organized? Since the data accesses of the experiments are not known a priori, there can be no guarantee that any one experiment will not access a certain location in memory. That is, each experiment must be able to have access to every element of the shared data set. Therefore, every PU must be connected to every other PU. This motivated the decision to pair every processing unit to a switch that would connect it to the underlying communication fabric. Because of the massive parallelism that was targeted, the complexity of the switching fabric must be kept as low as possible. Therefore, a regular topology was selected to reduce the addressing logic in the switches. All network .
3. In this context, arithmetic operation refers to one of position update, trajectory rotation, sampling from a random distribution, scatter calculation as described in Section 2, or attenuation calculation.
Fig. 5. Top layer of the network.
with user-defined functions generating and processing the network traffic. Though there are a number of simulators available for exploration of the NoC design space, we chose to design our own because it allowed us to easily incorporate our specific application into the network simulation. This was important, first because the network traffic is difficult to model since it is heavily instance dependant and hence a much more accurate analysis of the network performance was possible by simulating the network in the context of the application than would have been using a generic traffic model. Furthermore, because the simulator integrated with our application, it was useful not only as a design exploration tool but also as an effective debugging tool. The size of this design meant that it was infeasible to perform complete functional hardware simulation for any more than a few milliseconds. However, our network simulator was designed to interface seamlessly with the hardware simulation through the Direct Programming Interface (DPI) offered by Verilog [28]. Consequently, after the hardware design of the network was functionally verified, while debugging the interaction between processing units the network transfers could be offloaded to our network simulator to be executed at a higher level and hence much more quickly. After investigating the computational patterns that underlie in the SIMIND Monte Carlo software, we reached
529
Fig. 6. Experiment orientation.
TABLE 1 Average Network Transactions per Photon
Fig. 7. Different types of network transfers.
dimensions are powers of 2 and the switches are organized into a torus with nearest neighbor connections to facilitate the even splitting of the density map. This matches well with intuition since the photons take linear paths through the phantom. However, this does present a slight complication, since phantoms are rarely rectangular. Consider a human subject: if we attempt to divide the subject evenly into rectangular segments, we are bound to have some segments be emptier than others (for example, the segments containing the areas above the shoulders). In this application, performance can be limited since, in general, a processing unit is only as busy as the amount of data it is able to provide to experiments. This issue is treated later in Section 3.3 when detection simulation is revisited. Having determined the general topology of the network, it was still unclear precisely how the processing elements should be distributed (e.g., for 128 PUs: 8 4 4; 2 2 32; 2 4 16, etc.). After extensive profiling of the model, patterns began to emerge from the data transfers. It was apparent that most of the photons showed preference to traveling along the Z direction. Fig. 6 lends some insight to this by showing the simulation orientation. Since the detector is along the positive Z axis, photons which are traveling in any direction other than along this axis have a lower chance of being detected. Consequently, they are statistically eliminated earlier in their simulation and so traffic in the Z-direction tends to be the most prominent. Intuitively, then, this would suggest that making the network shorter along the Z directionand consequently elongating each PUs segment of the density map along the Z axiscould reduce the number of costly network transfers. Indeed, experimental profiling determined 8 8 2 as the optimized network configuration for the problem at hand as demonstrated in Table 1. An argument could be made from these results for eliminating the Z-dimension of the network since the increase in network traffic (e.g., for 8 16 1 and 16 8 1 configurations)
could be compensated by the significant savings in hardware resources. Through investigating this tradeoff, it became clear that in the context of the overall simulation marginally more acceleration was possible with a network configuration of 8 8 2 and so this was the configuration used in the final implementation. An added benefit to this decision was that it demonstrated the implementation of a third network dimension, which could be used to realize more benefit in other applications in which the network traffic does not favor one dimension so heavily. Finally, it should be clarified that this particular configuration is by no means suggested as a general result; instead attention is called to the fact that a careful analysis of data transfer patterns can motivate an alternative configuration more suited to the application at hand and therefore yield an improvement in network throughput. 2. What data should travel through the network? If an experiment needs a set of data, there is a natural choice to be made of whether to fetch the data across the network to the experiment or to move the experiment across the network to the PU with the data it needs. Concisely, should the network transport density data or photons? This decision has serious implications for system performance, since it will dictate in large part the latency of accesses to the density map. As discussed in Section 2, a photon can make hundreds of accesses to the density map over its lifetime, so reducing this data access latency is key. Conceptually, the tradeoff is easily understood: moving photons across the network is more costly in terms of latency and network resources than moving density data but the spatial locality that consecutive fetches from the density map typically exhibit makes it possible to amortize this extra cost over a number of fetches (See Fig. 7). After extensive profiling, it was not obvious how to make a favorable decision for one model or the other. This is motivated by the fact that different parts of the photon processing have different data access patternse.g., scatter path sampling samples few elements over a large distance, while attenuation calculation samples many elements adjacent to each other. Out of this discussion, a key architectural insight was developed: the network can be built in a generic way such that it could carry either photons
530
VOL. 62,
NO. 3,
MARCH 2013
TABLE 2 Performance of Different Transfer Models
4. What switching and routing policies should be adopted? The size of the network again had significant influence on the decision for the routing and switching strategies that were selected. The following were the criteria, in order of importance: no data may be lost, i.e., no packets should be dropped, 2. the network should employ a deadlock free routing strategy, 3. the switching strategy should be simple to allow switches to be built with minimal resources, and 4. the average latency per transfer should be as small as possible. Conditions 1, 2, and 3 can be guaranteed by picking an appropriate turn-model routing strategy. We route the Z direction first and then employ a single right-turn routing strategy in the XY plane. One of the best strategies for guaranteeing no data loss in the network with relatively few resources is the Wormhole (WH) switching technique [17]. We have implemented a slightly modified version of WH that trades some network resources for overall network throughput, thus leaving latency unaffected. The precise details of this are given in the following section. 1.
or data samples and the decision of which to move is made based on the data access pattern, which can be obtained by profiling the application. Table 2 shows the superior latency performance of this hybrid transfer model for the simulation of a 64 64 image using the Zubal phantom density map [38]. As expected, moving the density map for each access has very poor average latency performance, since a network transfer must be issued for every single fetch, but the significantly smaller packet sizes help to reduce peak latency and network loading. Moving the photons results in much better average latency performance but presents a challenge with peak access latency. Since the network is more heavily loaded due to the larger packet size, packets are far more likely to be delayed. In contrast, the hybrid transport model gives the best average latency performance and maintains a much better network load. It is worth noting as well that the latency characteristics of the hybrid model are very amicable to latency hiding. Data accesses in the hybrid modelas well as in the photon transport model to some extentare characterized by a single long latency followed by many low-latency accesses. This greatly simplifies the scheduling for latency hiding as opposed to, for example, the density map transport model which has a medium-sized latency with every data access. 3. How should the communication resources be allocated? The allocation of network resources in this context refers primarily to bus width. The size of the network for this implementation and our choice of platform essentially led to an immediate decision for bus width. Since there are 128 network nodes, a bus width of at least 7 bits is necessary to ensure the entire destination address can be stored within one network transfer, or flow unit (flit)this is covered in more depth in the upcoming discussion of routing and switching. To simplify the hardware for building and decoding packets, flits have a width that is a power of 2. However, even with careful attention to reducing the hardware complexity of the switches, a bus size of 16 would not leave sufficient on-chip resources for the processing units, resulting in a choice of 8 for the bus width. Interestingly, it is again possible to take advantage of the application data flow patternsdiscussed in the context of topology, recall Fig. 6when allocating network resources. Since the primary flow of photons is in the positive Z direction toward the detector, it is worthwhile to ask whether it is necessary to allocate resources for bidirectional travel in the Z direction. Travel in the XY-plane is essentially isotropic, so loss of bidirectional travel in either X or Y severely restricts network flow but simulation indicates that more than 80 percent of transfers along Z are in the positive Z direction. For our choice of topology 8 8 2, the discussion is essentially irrelevant, since wraparound is implemented in all directions but for a larger network, some resources could be saved by eliminating one direction travel along the Z -axis.
3.2 Switch Structure The packet switches are a registered set of input ports with a number of arbitrated paths to the output ports. Ports are 10 bits wide8 data bits, 1 bit to indicate if a flit is valid or junk and 1 bit for flow control. In addition, there is 1 line from each input port to its feeding output port to indicate packet arrival (discussed at the end of this section). The purpose of the arbitration logic is to assign one, and only one, output port for as many input ports as possible on each clock cycle. In order to minimize the hardware cost of the arbitration unit and output ports, while still implementing the desired routing strategy, limited path switches were constructed. Fig. 10 shows the possible paths a packet can take through a switch. Packets from the PU can enter the network in any direction and naturally a packet arriving on any input port from the network can be directed out to the PU. As indicated above, the Z direction is routed first and as a result, only packets directly from the PU can exit the Z output port of a switch. Once the photon has reached its correct Z address, it must enter the XY plane in the correct direction. A packet may only make one right turn after entering the XY plane. For example, if a packet must move in the positive X direction and the negative Y direction, it must select the negative Y direction first. This facilitates deadlock-free routing with an extremely simple arbitration unit. Once an input port has been directed to an output port, it locks that port until the entire packet is transmittedindicated by the arrival of the footer. To demonstrate how this process occurs, part of a switch is shown in Fig. 9, with the full implementation being given in Fig. 10. If an output port is unlocked and two or more input ports compete for simultaneous access to it, the assignment is made with the following priority:
1. 2. 3. 4. Straight-through traffic (S), Other in-plane traffic (T), Out-of-plane traffic (Z), Processing unit (P).
531
Fig. 8. Wormhole switching.
Input ports which are not assigned are stalled until the desired output port opens. The mechanism for this stalling is very similar to the strategy employed by wormhole routing in which each input port of a switch has two or more flits of storage. If there is a contention for an output port, the lower priority packet buffers into the storage on the input port and when that buffer has only one space remaining, a stall signal is propagated back to the switch that is feeding that input port as in Fig. 8. Obviously, with a deeper buffer, the stalled packet will occupy less space in the actual network. This is an extremely effective low-resource switching method; however, the buffer must be at least two levels deep to give the stall signal a clock cycle to propagate back to the previous switch. Unfortunately, there were insufficient resources on chip to provide two buffers for every input port, so a slightly different approach was taken (see Fig. 9). Only one level of input buffering is allocated and if a header arrives in that buffer that cannot be routed, the input port is disabled. This causes dropping of payload flits to occur. To
Fig. 9. Modified wormhole switching.
resolve this, the packets are transmitted cyclically from the source PU. Once a packet has locked a path to the destination node, it propagates an arrival signal back to the source, which then transmits the remainder of the packet with a marker in the footer to allow the receiver to align the packet. Naturally, this cyclic transmission results in extra network transfers. Although this can waste some cycles in the source PUs network controller, network throughput is not negatively impacted since these redundant flits are in a portion of the network that is locked to any other traffic and in fact, the blocking of the network controller actually turns out to be a very effective and simple way to limit network congestion.
3.3 Processing Unit Structure Based on our choice for the implementation platform, more than 40 percent of the devices logic resources were consumed
Fig. 10. Routing configuration in a switch.
532
VOL. 62,
NO. 3,
MARCH 2013
Fig. 11. Newton-Rhapson calculator for processing units.
by the on-chip network. This motivated a PU design that is cost effective without impacting the accuracy of the results. Each PU has allocated to it, on average, two and a half 18-kbit BlockRAMs (as detailed in the experimental section, we have used a Xilinx FPGA). Of these, one is allocated for density map storage, one is allocated for a photon buffer, and one is shared with the neighboring PU, storing the scatter models and some simulation-constant data. Based heavily on the work from [2], a single NewtonRhapson calculator, shown in Fig. 11, forms the core of p the 1 N p processing unit. It is capable of computing D , D , 1 , and D D with extremely few hardware resources. This, in combination with the comparator and second devoted multiplier provides the computational framework necessary to perform scatter and attenuation calculations. However, there is still a need to perform trigonometric calculations for various other aspects of the simulation, most notably, photon trajectory calculation. The ratio of trigonometric calculations to those detailed above is relatively low, yet a trig unit is relatively expensive in terms of hardware resources. Consequently, the decision was made to share a single trigonometry unit, among an entire column of PUs (1 trig processor per 8 PUs). The trig unit is based on the well-documented CORDIC algorithm and the implementation was based heavily on the work from [1]. Arbitration is performed on a first-come-first-served basis and each CORDIC unit has one level of input buffering. Our profiling reveals that only 0.0027 percent of CORDIC commands arrive at a full CORDIC unit and need to be backed up into the network and hence we conclude that these units are not bottlenecks to computation. Each processing engine is also equipped with an linear-feedback shift register (LFSR) for pseudorandom number generation, where each unit is selectively seeded for diversity. We readily acknowledge that other methods for random number generation with better statistical properties do exist, however, our choice was motivated by the very tight resource constraints in this application and with the understanding that the focus of this work is on the communication infrastructure. The allocation of the resources for the arithmetic units required significant experimental profiling. It is critical in this application to preserve the accuracy of the reconstructed image, yet this must be balanced with the tight resource constraints. The resources required to implement a custom floating-point data path at each PU would drastically limit the number of PUs and hence the possible parallelism. Therefore, the application was mapped into fixed-point arithmetic but the evaluation of the impact of finite precision on the quality of the reconstructed image is complicated by
Fig. 12. Image SNR versus simulation size.
the random nature of the simulation. To overcome this, a twofold approach was taken to establish the appropriate data widths: in the first phase, the experiments were evaluated individually to understand the relative impact on accuracy of each variable. In the second phase, the experiments were evaluated in the context of the simulation. The signal-tonoise ratio (SNR) of the fixed-point imagestaking the floating-point images as referenceis insufficient to draw strong conclusions about the accuracy because of the simulation randomness and because the images are so heavily dependent on the simulation data set. Consequently, the convergence of the images with increasing simulation size was used to evaluate the simulation accuracy as shown in Fig. 12. Table 3 gives the bit widths that were empirically found to provide a reasonable balance between resource usage and reconstruction accuracy.
RESULTS
This paper has detailed the implementation of custom hardware for the simulation of SPECT imaging based on the SIMIND [20] Monte Carlo software. This section details the acceleration possible with our new architecture and is broken into three parts. First, the experimental setup is described, next the quality of the simulated images is discussed, and finally the acceleration results are given.
4.1 Experimental Setup The first step in developing a custom hardware architecture for accelerating this application was reimplementing in C the 14,000 lines of FORTRAN source we obtained for SIMIND. This enabled the restructuring of the program to enable easy extraction of profiling and debug data, such that a thorough understanding of the computational patterns could be achieved. Furthermore, the redevelopment was done with a heavy focus on performance and compute efficiency so that we would have a reasonable reference to which to compare our hardware design. This exercise resulted in an implementation that was already several
TABLE 3 Bit Allocation for Photon
533
TABLE 4 Device Utilization
TABLE 5 Quality of Simulated Images
times faster than the original FORTRAN and it should be noted that it is this C implementation against which all results in this section are reported. The NoC architecture which emerged from this process was modeled and verified in software before being implemented in hardware. The target platform was the Berkeley Emulation Engine 2 (BEE2) [5] on a single Xilinx Virtex-II Pro FPGA [34] with 74,448 Logic Cells and 738 kB of available on-chip memory. The utilization for an 8 8 2 network with a density map of size 64 64 64 is given in Table 4. The long-term goals of the project motivated the choice of the BEE2 platform, despite the use of only one out of the five available FPGAs. Ideally, the work will eventually be improved to utilize multiple FPGAs and the high-speed LVCMOS connections between FPGAs on the BEE2 provide the infrastructure necessary for this extension.
pixel errors. Results of the FPGA implementations reproduction of a suite of test images (such as Fig. 13) with respect to the reference implementation in double-precision floating-point arithmetic is given in Table 5. Table 5 also demonstrates an important trend that is expanded in Fig. 12, which shows the way in which the image generated by the software model tends toward converging with the image generated by the hardware design as the simulation size grows. This suggests that there is nothing inherent to the hardware simulation that causes a systemic inaccuracy in the generated image. That is, the random nature of the simulation appears to cancel errors which may arise from the implementation (i.e., through finite precision) as the simulation grows sufficiently large.
4.2 Image Quality Both signal-to-noise ratio and peak signal-to-noise ratio (PSNR) were used to judge the quality of the images reconstructed by the FPGA implementation. SNR is a commonly accepted metric in science for comparing the level of the true image to the level of background noise present, while PSNR is often used in multimedia applications to judge the human perception of image reproduction errorsmost commonly in connection with the performance of a compression codec. The SNR is defined as: P 2 Aref 2 SNRdB 10 log10 P 2 A Aref
and the PSNR is defined as: PSNRdB 10 log10 1 MSE 3
for a normalized image, where MSE is the mean-squared error and is calculated as the average of the squares of the
4.3 Acceleration Results Table 6 gives the acceleration results measured over the computational kernel for various sizes of networks with both the network and processing elements clocked at 200 MHz. These results are relative to the reference software implementation on a system running Mac OS X version 10.5.2 with a 2 GHz Intel Core 2 Duo Processor with 3 MB cache (CPU) and 2 GB of DDR3 RAM at 1,066 MHz. The most interesting column of this table is the trend in the speedup per core. In order to get an idea of how the processing unit matches up to the CPU used in the test, we compared a network of only one node to the CPU implementation and found that the PUs we designed process photons roughly 10 percent more slowly than the CPU. When the on-chip network size is expanded to 4 4 1, we see a drop in the speedup per core. This drop accounts for the latency overhead of the network. Again, when a third dimension is added to the network and the size is increased to 4 4 2, there is another, smaller drop in performance. However, from this point, the network can continue to expand noticing only very small decrease in acceleration per core. This means that this design can continue to scale onto larger FPGAs with nearly linear speedup by only increasing the number of cores. The emergence of new platform options for parallel development, such General-Purpose GPU-based computing, has made it possible to accelerate certain kinds of
TABLE 6 Acceleration Results
Fig. 13. Sample reconstructed image.
534
VOL. 62,
NO. 3,
MARCH 2013
TABLE 7 Idling Time for Photons and PUs
arithmetically intense applications with less development effort than creating a custom hardware architecture. As was discussed in the introduction, the fixed memory infrastructure in these platforms can limit the possible acceleration for applications such as SIMIND. Nonetheless, we found it profitable to explore the potential of other platforms and consequently we have implemented and optimized SIMIND using the Nvidia CUDA platform [24]. We had at our disposal a GeForce GTX 470 GPU card [23] that we chose as the target for our implementation. By crafting warpparallelized implementations of the major algorithmic events (scatter computation, attenuation calculation, etc.) and exploiting the texture memory for distributing access to the density map, we were able to achieve an average acceleration of 19:28 over the CPU implementation. This result emphasizes the merit of pursuing a custom hardware implementation, as only a fifth of the acceleration was achieved on a state-of-the-art GPU device in comparison to our FPGA implementation, which was deployed on a Xilinx device family that is not as recent as the GPU device family. In order to get a more detailed picture of how well the onchip network resources are keeping up with the PUs need for data, Table 7 shows how much time both the PU and the photons are, on average, spending idle. Note that a timewise division of the simulation is made into the first 10 percent, the middle 80 percent, and the last 10 percent with respect to the entire simulation duration. Note also that these numbers are not perfectly representative of the performance of the design, since there are a number of PUs on the periphery which have a lot of empty space in their density map and consequently have significantly higher idle time as a result of their limited role in the simulation, not as a result of data starvation. Finally, it should be mentioned that the PU idle time is given as a percentage of the simulation time, whereas a photons idle/network times are given as a percentage of its corresponding experiments lifetime. These numbers again match intuition very well. In the beginning of the simulation, many photons are being birthed and a critical mass of in-simulation photons is building. While that is happening, the full effect of most latencies will be borne by the PU. As the simulation progresses, PUs move to having 3-7 active experiments at one time, meaning that the effect of the latencies felt from data accesses where the density map is fetched (refer to Section 3.1 Point #3) moves to the experiments instead, so that the PUs can keep active. This explains the notable increase in experiment idle time; however, it must be remembered that experiment idle time does not necessarily cost simulation time. Rather it costs extra space in the processing buffer. Finally, toward the end of the simulation, the last number of photons are being cleared out and this causes an exaggeration of the same effect that was discussed for the beginning of the simulation. Also in
accordance with intuition, the amount of an experimentss duration that is spent with the photon in the network does not change substantially for the duration of the simulation. As a final note, if we would employ a larger FPGA, the design can scale through parameterization to improve speedup and resolution. If the memory capacity of the device is increased, the resolution of the image can be increased in direct proportion; although this would slow the simulation, the relative effect on runtime would be the same on the CPU and FPGA implementations. Similarly, the results suggest near linear scaling of acceleration with an increase in logic resources. Therefore, given an FPGA of 8-10 the logic and memory resources of the Virtex-II Pro, which is not uncommon for the large state-of-the-art devices, we would expect to achieve an 8 increase in resolution and approximately an additional 8 speedup over the CPU implementation.
CONCLUSIONS AND FUTURE WORK
In this paper, we have presented a NoC-based parallel architecture, and its FPGA implementation, for accelerating Monte Carlo simulations that share common data sets. The case study on SPECT imaging has shown that significant speedups can be achieved over single core implementations, without compromising the image reconstruction accuracy. The proposed architecture, because it is based on an NoC approach, is modular and hence it can easily be expanded to support FPGAs of larger capacity. Our ongoing work is focused on developing the communication interfaces that facilitate interchip communication for multiFPGA platforms; as well as on memory controllers that hide external data access latency when Monte Carlo simulations rely on very large data sets that need to be stored in onboard memories, and when the large number of distributed embedded memories in FPGAs need to be used only for caching purposes.
ACKNOWLEDGMENTS
It is with sincere gratitude that the authors acknowledge Troy Farncombe of the McMaster Institute of Applied Radiation Sciences and Aleksandar Jeremic from the Electrical and Computer Engineering Department at McMaster for introducing them to this problem. Their guidance at the inception of the project were invaluable to its success. They also extend their thanks to Jason Thong, Adam Kinsman, and Nathan Cox of McMasters Computer Aided Design and Test Laboratory for their involvement at various stages of the project. The work represented here has benefited greatly from their insights and contributions. Finally, they extend their appreciation to NSERC, CMC Microsystems, and McMaster University for providing the funds, equipment, and infrastructure that enabled this research.
REFERENCES
[1] F. Angarita, A. Perez-Pascual, T. Sansaloni, and J. Vails, Efficient FPGA Implementation of Cordic Algorithm for Circular and Linear Coordinates, Proc. Intl Conf. Field Programmable Logic and Applications, pp. 535-538, Aug. 2005. S. Aslan, E. Oruklu, and J. Saniie, Realization of Area Efficient QR Factorization Using Unified Division, Square Root, and Inverse Square Root Hardware, Proc. IEEE Intl Conf. Electro/ Information Technology (EIT 09), pp. 245-250, June 2009.
[2]
535
[3]
[4]
[5]
[6] [7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16] [17]
[18]
[19] [20] [21]
[22] [23] [24] [25]
A. Badal and A. Badano, Monte Carlo Simulation of X-Ray Imaging Using a Graphics Processing Unit, Proc. IEEE Nuclear Science Symp. Conf. Record (NSS/MIC), pp. 4081-4084, Oct. 2009. F.J. Beekman, H.W.A.M. de Jong, and S. van Geloven, Efficient Fully 3D Iterative SPECT Reconstruction with Monte Carlo-Based Scatter Compensation, IEEE Trans. Medical Imaging, vol. 21, no. 8, pp. 867-877, Aug. 2002. C. Chang, J. Wawrzynek, and R.W. Brodersen, BEE2: A HighEnd Reconfigurable Computing System, IEEE Design Test of Computers, vol. 22, no. 2, pp. 114-125, Mar. 2005. G. de Michelis and L. Benini, Networks on Chips: Technology and Tools. Morgan Kaufmann, 2006. T.C. de Wit, J. Xiao, and F.J. Beekman, Monte Carlo-Based Statistical SPECT Reconstruction: Influence of Number of Photon Tracks, IEEE Trans. Nuclear Science, vol. 52, no. 5, pp. 1365-1369, Oct. 2005. Y.K. Dewaraja, M. Ljungberg, A. Majumdar, A. Bose, and K.F. Koral, A Parallel Monte Carlo Code for Planar and SPECT Imaging: Implementation, Verification and Applications in 131-I SPECT, Proc. IEEE Nuclear Science Symp. Conf. Record, vol. 3, pp. 20/30-20/34, 2000. V. Fanti, R. Marzeddu, C. Pili, P. Randaccio, S. Siddhanta, J. Spiga, and A. Szostak, Dose Calculation for Radiotherapy Treatment Planning Using Monte Carlo Methods on FPGA Based Hardware, Proc. 16th IEEE-NPSS Real Time Conf. (RT 09), pp. 415-419, May 2009. M. Gebremichael, W.F. Krajewski, M. Morrissey, D. Langerud, G.J. Huffman, and R. Adler, Error Uncertainty Analysis of GPCP Monthly Rainfall Products: A Data-Based Simulation Study, J. Applied Meteorology, vol. 12, pp. 1837-1848, 2003. K. Gulati and S.P. Khatri, Accelerating Statistical Static Timing Analysis Using Graphics Processing Units, Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC 09), pp. 260265, Jan. 2009. B.F. Hutton, H.M. Hudson, and F.J. Beekman, A Clinical Perspective of Accelerated Statistical Reconstruction, European J. Nuclear Medicine, vol. 24, pp. 797-808, 1997. C. Jiang, P. Li, and Q. Luo, High Speed Parallel Processing of Biomedical Optics Data with PC Graphic Hardware, Proc. Asia, Comm. and Photonics Conf. and Exhibition (ACP), vol. 2009Supplement, pp. 1-7, Nov. 2009. A. Kaganov, P. Chow, and A. Lakhany, FPGA Acceleration of Monte-Carlo Based Credit Derivative Pricing, Proc. Intl Conf. Field Programmable Logic and Applications (FPL 08), pp. 329-334, Sept. 2008. C.-M. Kao and X. Pan, Evaluation of Analytical Methods for Fast and Accurate Image Reconstruction in 3D SPECT, Proc. IEEE Nuclear Science Symp. Conf. Record, vol. 3, pp. 1599-1603, 1998. D.B. Kirk and W.M.W. Hwu, Programming Massively Parallel ProcessorsA Hands-on Approach. Morgan Kaufmann, 2010. A. Leroy, J. Picalausa, and D. Milojevic, Quantitative Comparison of Switching Strategies for Networks on Chip, Proc. Third Southern Conf. Programmable Logic (SPL 07), pp. 57-62, Feb. 2007. S. Liu, M.A. King, A.B. Brill, M.G. Stabin, and T.H. Farncombe, Accelerated SPECT Monte Carlo Simulation Using Multiple Projection Sampling and Convolution-Based Forced Detection, IEEE Trans. Nuclear Science, vol. 55, no. 1, pp. 560-567, Feb. 2008. M. Ljungberg, Monte Carlo Calculations in Nuclear Medicine. Inst. of Physics Publishing, 1998. M. Ljungberg, The SIMIND Monte Carlo Program, http:// www.radfys.lu.se/simind/, Sept. 2010, Feb. 2010. J. Luu, K. Redmond, W.C.Y. Lo, P. Chow, L. Lilge, and J. Rose, FPGA-Based Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy, Proc. 17th IEEE Symp. Field Programmable Custom Computing Machines (FCCM 09), pp. 157-164, Apr. 2009. Northeastern Univ., LibHLS Home Page, http://www.ece.neu. edu/groups/rcl/libHLS/index.html, Sept. 2010, Apr. 2002. NVIDIA Corporation, NVIDIA GeForce GTX 480/470/465 GPU Datasheet, technical report, 2010. NVIDIA Corporation, NVIDIA CUDA C Programming Guide Version 4.0, technical report, Mar. 2011. A.S. Pasciak and J.R. Ford, A New High Speed Solution for the Evaluation of Monte Carlo Radiation Transport Computations, IEEE Trans. Nuclear Science, vol. 53, no. 2, pp. 491-499, Apr. 2006.
[26] GPU Gems 2Programming Techniques for High-Performance Graphics and General Purpose Computation, M. Pharr, ed., vol. 2. Addison-Wesley, 2005. [27] N. Salamin, T. Hodkinson, and V. Savolainen, Towards Building the Tree of Life: A Simulation Study for All Angiosperm Genera, Systematic Biology, vol. 54, pp. 183-196, 2005. [28] C. Spear, SystemVerilog for Verification: A Guide to Learning the Testbench Language Features. second ed. Springer, June 2007. [29] X-5 Monte Carlo Team, MCNPA General Monte Carlo Nparticle Transport Code, technical report, Los Alamos Natl Laboratory, 2003. [30] X. Tian and K. Benkrid, Design and Implementation of a High Performance Financial Monte-Carlo Simulation Engine on an FPGA Supercomputer, Proc. Intl Conf. ICECE Technology (FPT 08), pp. 81-88, Dec. 2008. [31] X. Tian and K. Benkrid, American Option Pricing on Reconfigurable Hardware Using Least-Squares Monte Carlo Method, Proc. Intl Conf. Field-Programmable Technology (FPT 09), pp. 263-270, Dec. 2009. [32] A. Wirth, A. Cserkaszky, B. Kari, D. Legrady, S. Feher, S. Czifrus, and B. Domonkos, Implementation of 3D Monte Carlo PET Reconstruction Algorithm on GPU, Proc. Intl Conf. Nuclear Science Symp. Conf. Record (NSS/MIC), pp. 4106-4109, Oct. 2009. [33] N.A. Woods and T. VanCourt, FPGA Acceleration of QuasiMonte Carlo in Finance, Proc. Intl Conf. Field Programmable Logic and Applications (FPL 08), pp. 335-340, Sept. 2008. [34] XILINX Inc., Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet,Technical Report DS083, June 2011. [35] L. Xu, M. Taufer, S. Collins, and D.G. Vlachos, Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs, Proc. IEEE Intl Symp. Parallel Distributed Processing (IPDPS), pp. 19, Apr. 2010. [36] Y. Yamaguchi, R. Azuma, A. Konagaya, and T. Yamamoto, An Approach for the High Speed Monte Carlo Simulation with FPGAToward a Whole Cell Simulation, Proc. IEEE Intl Symp. Micro-NanoMechatronics and Human Science, vol. 1, pp. 364-367, Dec. 2003. [37] S. Zhao and C. Zhou, Accelerating Spatial Clustering Detection of Epidemic Disease with Graphics Processing Unit, Proc. 18th Intl Conf. Geoinformatics, pp. 1-6, June 2010. [38] I. George Zubal, C.R. Harrell, E.O. Smith, and A.L. Smith, Two Dedicated Software, Voxel-Based Anthropomorphic (Torso and Head) Phantoms, Proc. Intl Workshop Voxel Phantom Development, vol. 6/7, July 1995. Phillip J. Kinsman received the BEng degree in electrical and biomedical engineering and MASc degree in computer engineering from McMaster University, Canada, in 2010 and 2011, respectively. His research interests are in the field of hardware acceleration for scientific computing. He is a student member of the IEEE.
Nicola Nicolici (S99-M00-SM11) received the Dipl.Ing degree in computer engineering from the University of Timisoara, Romania in 1997, and the PhD degree in electronics and computer science from the University of Southampton, United Kingdom in 2000. He is an associate professor in the Department of Electrical and Computer Engineering at McMaster University, Canada. His research interests are in the area of computer-aided design and test. He has authored a number of papers in this area and received the IEEE TTTC Beausang Award for the Best Student Paper at the International Test Conference (ITC 2000) and the Best Paper Award at the IEEE/ACM Design Automation and Test in Europe Conference (DATE 04). He is a senior member of the IEEE.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

06112747

Cargado por

Información del documento

Descripción original:

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

06112747

Cargado por

Copyright:

Formatos disponibles

524

IEEE TRANSACTIONS ON COMPUTERS,

Fig. 1. Current approaches to MC acceleration.

IMAGE RECONSTRUCTION ALGORITHM

IEEE TRANSACTIONS ON COMPUTERS,

Fig. 3. Iterative reconstruction.

Fig. 2. Simulation of the imaging process.

Fig. 4. Workflow for one experiment.

NOC-BASED PARALLEL ARCHITECTURE

IEEE TRANSACTIONS ON COMPUTERS,

Fig. 5. Top layer of the network.

Fig. 6. Experiment orientation.

TABLE 1 Average Network Transactions per Photon

Fig. 7. Different types of network transfers.

IEEE TRANSACTIONS ON COMPUTERS,

TABLE 2 Performance of Different Transfer Models

Fig. 8. Wormhole switching.

Fig. 9. Modified wormhole switching.

Fig. 10. Routing configuration in a switch.

IEEE TRANSACTIONS ON COMPUTERS,

Fig. 11. Newton-Rhapson calculator for processing units.

Fig. 12. Image SNR versus simulation size.

TABLE 4 Device Utilization

TABLE 5 Quality of Simulated Images

Fig. 13. Sample reconstructed image.

IEEE TRANSACTIONS ON COMPUTERS,

TABLE 7 Idling Time for Photons and PUs

CONCLUSIONS AND FUTURE WORK

[19] [20] [21]

[22] [23] [24] [25]

También podría gustarte