Field-programmable gate array accelerates FDTD calculations
Eric J. Kelmelis, James P. Durbano, Petersen F. Curt and Jiazong Zhang
Advanced scientific computing applications, such as physics-based simulation design tools, require substantially more computational power (and memory) than is available from standard desktop computers. This has led to the advent of clustered computing, a technique that involves distributing a single problem to a group of commodity PCs and allowing them to perform computations in parallel. This “more is better” approach is epitomized in modern supercomputer design, which typically consists of many standard microprocessors linked by high-speed interconnects. There are several drawbacks to this approach, which include high setup cost, maintenance requirements, and diminishing returns as more processing nodes are added.
Alternative platforms are emerging that are better suited for computationally intense algorithms. One of the most promising devices is the field-programmable gate array (FPGA)—a reconfigurable chip capable of being modified at the gate level so hardware architecture can be ideally created for a given algorithm. As FPGAs maintain several advantages over microprocessors, researchers are looking at this technology to meet the demanding computational needs of scientific applications, including the computing and memory-intensive finite-difference time-domain (FDTD) method.
FDTD is a widely used computational technique for providing accurate full-wave analysis of electromagnetic phenomena. However, the space-domain discretization and time-domain iterations are bottlenecks that limit the performance of FDTD simulations. One innovative solution developed by EM Photonics (Newark, DE) is the integration of FPGA technology into a single workstation for memory-intensive FDTD simulations.
Using FPGAs for scientific computations requires a three-step process. First, the given algorithm is analyzed to determine the computational bottleneck, which is typically the algorithm kernel and is where the majority of the computations are performed. Next, this computational kernel is mapped into the FPGA. To do so efficiently requires knowledge of the internal FPGA architecture, a hardware description language, and the physics of the underlying algorithm. It is critical that the algorithm be well understood in order to efficiently map it into hardware because code targeted for a microprocessor-based platform will rarely be optimized to run directly in the FPGA. Rather, the algorithm must be modified to take advantage of a fully customizable computational pipeline, which generally requires manipulating the core equations. To do this in the most efficient manner requires intimate knowledge of the algorithm itself, not simply the equations that describe it.
The FPGA-based implementation of FDTD made use of two key hardware optimizations: parallelism and caching.1 As the FPGA is configurable, we developed a massively parallel architecture to update multiple electromagnetic fields simultaneously. This is a clear advantage over standard microprocessor architectures, which primarily support a single computational pipeline. It is also preferable to multiprocessor systems, since all computational pipelines can be housed in a single chip. Furthermore, the parallel computational pipelines within the FPGA were designed specifically to perform the FDTD kernel operations. Thus, an ideal computational pipeline was developed with the exact logic blocks necessary to update the fields.
To keep parallel computation paths from stalling while waiting for data, it was necessary to develop a novel caching mechanism. This was accomplished by coupling a data prefetching technique with the FPGA’s internal memory. By caching data upon entering the FPGA and before returning to DRAM, it is possible to hide the DRAM latency and maintain high memory throughput. Currently, the FPGA system is capable of more than 42 million nodes processed per second (Mnps) on simulations upward of 270 million nodes.
All the necessary hardware components are housed on a single PCI card (see figure). The board then integrates with a standard PC similar to a video card or network adapter. The board becomes an “FDTD coprocessor” that accelerates FDTD calculations completely transparently to the software CAD tool users. From a design engineer perspective, when using a workstation that includes the FDTD accelerator card, the user can interact with the CAD software as he normally would but simulation times are significantly reduced and can be performed on much larger models.
Mapping the core FDTD equations into the FPGA provides significant computational performance due to the ability to create ideal computational paths, run numerous computations in parallel, and access a large memory space in an efficient manor. By housing the FPGA on a PCI card and including massive amounts of onboard memory, a single workstation can run simulations at the speed of PC clusters. An accelerated FDTD workstation based on this technology allows users to interact with their CAD software as normal; the only change is the ability to run larger simulations and in much less time.
Paired with Optiwave’s OptiFDTD simulation design suite, hardware FDTD benchmark simulations have been compared with single-PC FDTD simulations (see table). The PC used for benchmarks was a Pentium 4, 1.5 GHz, with 2 Gbytes of RAM. The comparison demonstrated 25 times speed improvement over optimized FDTD running on a standard PC. In addition, the FPGA-based solution solved problems more than five times larger than is possible on a desktop PC, without performance loss. FPGA-based simulations can prove to be of value to researchers in need of more computing power to meet their design needs.
Grid number | PML layer number | Iteration | FDTD card CPU time (s) |
---|---|---|---|
100*80*150 | 15 | 10,000 | 794 |
250*150*250 | 10 | 3,000 | 1,121 |
200*200*459 | 10 | 9,000 | 6,135 |
REFERENCES
1. J.P. Durbano et al., “Hardware Acceleration of the 3D Finite-Difference Time-Domain Method,” presented at IEEE AP-S Int’l. Symposium on Antennas and Propagation, 2004.
ERIC J. KELMELIS, JAMES P. DURBANO, and PETERSEN F. CURT are members of the Accelerated Computing Technologies division at EM Photonics, 51 East Main St., Suite 203, Newark, DE 19711. JIAZONG ZHANG is a Research Scientist at Optiwave Systems, 7 Capella Court, Ottawa, Ontario K2E 7X1 Canada; e-mail: [email protected].