The scientific computing community has been in close connection with high performance computing (HPC), which has been the privilege of a limited group of scientists. Recently, with the rapid development of graphics processing units (GPUs), the parallel processing power of high-performance computers has been extended to all desktop computers, reducing the costs of scientific calculations. In this paper, we develop a general-purpose Lattice Boltzmann code that runs on commercial computers with multiple heterogeneous devices that support the OpenCL specification. Several approaches to implementations of Lattice Boltzmann code on commercial computers with multiple devices have been explored. Simulation results for different code implementations on multiple devices were compared with each other, with the results obtained for the single device implementation, and with literature results. Simulation results for basic computer hardware platforms with multiple device implementations showed a significant speed improvement compared to simulation implemented on a single device. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay The computer processor industry was at an inflection point a few years ago, when improvements in CPU performance hit a serious frequency wall. Major processor vendors have started producing multi-core processors, and all major GPU vendors have turned to designing multi-core GPUs. With the development of multi-core and multi-core hardware architectures there has been an increase in numerical computer simulations in almost every area of science and engineering. Recently, the lattice Boltzmann method (LBM) has become an alternative method for computational fluid dynamics (CFD) and has demonstrated its ability to simulate a wide variety of fluid flows. LBM is computationally expensive and memory-intensive, but since it is explicit and the local property of the ruling equations (requires only nearest neighbor information), the method is very suitable for parallel computing using multi-core and multi-core hardware architectures. The graphics processing unit (GPU) is a massive multi-threaded architecture and is therefore widely used for graphical and now non-graphical computations. The main advantage of GPUs is their ability to perform significantly more floating point operations (FLOPs) per unit of time than CPUs. To unify the software development of different hardware devices (mainly GPUs), an effort has been made to establish a standard for OpenCL heterogeneous platform programming. There is a considerable cost associated with utilizing the full potential of today's modern multi-core CPUs and GPUs, sequential code must be (re)written to explicitly expose algorithmic parallelism. Various programming models have been established which are often vendor specific. The main goal of the present work is to implement the Lattice Boltzmann method according to the OpenCL specification, where the most computationally intensive parts of the algorithm are executed on multiple heterogeneous devices, which results in a simulation speedup compared to the single-device implementation. Furthermore, one of the goals is to demonstrate that using the Java programming language and OpenCL all available devices on commodity computer hardware can be leveraged to accelerate scientific simulations. Additionally, two different basic computer implementations with multiple heterogeneous devices and theirs are createdperformances are compared. Implementations are developed using: Java programming language for host (control program) and OpenCL specifications for kernel (written to parallelize parts of the algorithm on two or more heterogeneous devices). The link between the host (Java) and kernel (OpenCL) programs is made by the Java library (JOCL). The simulation was run on three different commodity hardware platforms. The performance of the implementations are compared, it is concluded that the implementations running on two or more OpenCL devices have better performance than the presented implementation running on only one device. Multi-GPU implementations of LBM using CUDA have been widely discussed in the literature. cavity flow implementation, using D3Q19 the lattice model, multiple relaxation time (MRT) approximation and CUDA are presented. The simulation was tested on a node consisting of six Tesla C1060s and the POSIX thread was used to implement parallelism. described cavity flow for various depth-to-width aspect ratios using the D3Q19 model and the MRT approximation. The simulation is parallelized using OpenMP and tested on a single-node multi-GPU system, consisting of three nVIDIA M2070 devices or three nVIDIA GTX560 devices. presented LBM implementation for fluid flow through porous media on multi-GPU also using CUDA and MPI. Some optimization strategies based on the structure and layout of the data are also proposed. The implementation was tested on a single-node cluster equipped with four Tesla C1060s. The authors adopted the Message Passing Interface (MPI) technique for GPU management for GPU clusters and explored accelerating the implementation of cavity flow using overlapping communication and computation. The D3Q19 model and the MRT approximation are also used in this reference. Xian described the CUDA implementation of flow around a sphere using the D3Q19 model and the MRT approximation. The code parallelism is based on the MPI library. Reducing the dimension of communication time is achieved by using the solution domain partitioning method or by using multi-stream computation and communication. A supercomputer equipped with 170 Tesla S1070 nodes (680 GPUs) is used for the calculation. implemented single-phase, multi-phase, and multi-component LBM on multi-GPU clusters using CUDA and OpenMP. Very few OpenCL implementations of LB codes have been described in the literature so far. It compares LBM CUDA and OpenCL implementations on a computing unit and shows that properly structured OpenCL code achieves performance levels close to those achieved by the CUDA architecture. To the best of the author's knowledge, there have been no published articles regarding the implementation of LBM using Java and OpenCL on multiple commercial computer devices.A. Lattice Boltzmann equation In the lattice Boltzmann method, the fluid motion is simulated by particle motion and collision on a uniform lattice, and the fluid is modeled by a single particle distribution function. The evolution of the distribution function is governed by a lattice Boltzmann equation: where is the distribution function for the particle with velocity in position and time, is the time increment and the collision operator. The equation above states that the particle distribution function streaming into the node near the next time step is the current particle distribution plus the collision operator. The streaming of a particle distribution function occurs over time over a distance that is the distance between lattice sites.The collision operator models the rate of change of the distribution function due to molecular collision. A collision model was proposed by (BGK) to simplify the analysis of the lattice Boltzmann equation. Using the LB-BGK approximation equation (1) can be written asThe above equation is a well-known LBGK model and is consistent with the Navier-Stokes equation for fluid flow in the limit of the small Mach number and the incompressible flow. In equation (2) is the local equilibrium distribution, and it is a single relaxation parameter associated with the local equilibrium collision relaxation. In the application, a lattice Boltzmann model must be chosen. Most research articles are conducted with the D2Q9 model. The D2Q9 model was also used in this work. The name implies that the model is two-dimensional and at each point of the lattice there are nine speeds (N=9) at which particles can travel. The equilibrium particle distribution function for the D2Q9 model is given by given byMacroscopic quantities and can be evaluated asThe macroscopic kinematic viscosity is given by Equation (2) is usually solved by assuming according to the following two steps where: denotes the distribution function after the collision, and is the value of the distribution function at the end of the streaming and collision operations. The third step in implementing LBM is determining the boundary conditions. In the present work, the rebound boundary condition was applied for walls because it has easy implementation and reasonable results in the simple and limited domain. The balance scheme was used for the movable lid. In this section, implementations of the Lattice Boltzmann method for multiple heterogeneous devices are shown. The main difference between these implementations is data transfer to and from heterogeneous OpenCL devices. Both implementations use the same OpenCL kernels. The D2Q9 model is used for data representation, the particle distribution functions are presented by nine arrays. Since OpenCL does not support two-dimensional arrays, data is mapped from the two-dimensional array to the one-dimensional array. The two-lattice algorithm is used for both implementations of the Lattice Boltzmann method. Since this algorithm handles data dependency by storing distribution values in duplicate lattices for the streaming phase, a phantom layer of arrays is created for the particle distribution functions. The created arrays are divided into subdomains, one for each device (multi-core/many-core). along the X direction. The size of the subdomain depends on the characteristics of each device (multi-core/many-core). The domain is split between devices (multi-core/many-core). Since edge information after the streaming phase must be exchanged between solver iterations, an additional ghost layer is created. This layer is used to exchange data about particle distribution functions between devices and contains only the boundary information that needs to be exchanged. This is done to minimize the amount of data copied from the device to the host and from the host to the next device and is used for each subdomain. Arrays containing input parameters (like: size for x-axis, size for y-axis, number of devices, u0, alpha...) are used by all devices, this data is not divided into subdomains as it needs to be sent to all the.
tags