Escolar Documentos
Profissional Documentos
Cultura Documentos
Parallel Processing
c Fluent Inc. September 29, 2006 31-1
Parallel Processing
CORTEX
Solver
File Input/Output
Disk
Data:
Cell
Face
Node
CORTEX
COMPUTE NODES
Compute Node 0 Compute Node 1
Data: Data:
Cell Cell
Face Face
Node FLUENT FLUENT Node
MPI MPI
Socket
MP
Data: FLUENT FLUENT Data:
MPI MPI
Cell Cell
Face Face
Node Node
31-2
c Fluent Inc. September 29, 2006
31.1 Introduction to Parallel Processing
The host distributes those commands to the other compute nodes via a socket inter-
connect to a single designated compute node called compute-node-0. This specialized
compute node distributes the host commands to the other compute nodes. Each compute
node simultaneously executes the same program on its own data set. Communication
from the compute nodes to the host is possible only through compute-node-0 and only
when all compute nodes have synchronized with each other.
Each compute node is virtually connected to every other compute node, and relies on
inter-process communication to perform such functions as sending and receiving arrays,
synchronizing, and performing global operations (such as summations over all cells).
Inter-process communication is managed by a message-passing library. For example,
the message-passing library could be a vendor implementation of the Message Passing
Interface (MPI) standard, as depicted in Figure 31.1.2.
All of the parallel FLUENT processes (as well as the serial process) are identified by
a unique integer ID. The host collects messages from compute-node-0 and performs
operations (such as printing, displaying messages, and writing to a file) on all of the
data, in the same way as the serial solver.
1. Start up the parallel solver. See Section 31.2: Starting Parallel FLUENT on a
Windows System and Section 31.3: Starting Parallel FLUENT on a Linux/UNIX
System for details.
2. Read your case file and have FLUENT partition the grid automatically upon loading
it. It is best to partition after the problem is set up, since partitioning has some
model dependencies (e.g., adaption on non-conformal interfaces, sliding-mesh and
shell-conduction encapsulation).
Note that there are other approaches for partitioning, including manual partitioning
in either the serial or the parallel solver. See Section 31.5: Partitioning the Grid
for details.
4. Calculate a solution. See Section 31.6: Checking and Improving Parallel Performance
for information on checking and improving the parallel performance.
c Fluent Inc. September 29, 2006 31-3
Parallel Processing
i See the separate installation instructions for more information about in-
stalling parallel FLUENT for Windows. The startup instructions below
assume that you have properly set up the necessary software, based on the
appropriate installation instructions.
Additional information about installation issues can also be found in the Frequently
Asked Questions section of the Fluent Inc. User Services Center (www.fluentusers.com).
• version must be replaced by the version of FLUENT you want to run (2d, 3d, 2ddp,
or 3ddp).
• -path\\computer name\share name specifies the computer name and the shared
network name for the Fluent.Inc directory in UNC form.
For example, if FLUENT has been installed on computer1 and shared as fluent.inc,
then you should replace share name by the UNC name for the shared directory,
\\computer1\fluent.inc.
31-4
c Fluent Inc. September 29, 2006
31.2 Starting Parallel FLUENT on a Windows System
• -mpi=mpi type (optional) specifies the type of MPI. If the option is not specified,
the default MPI for the given interconnect will be used (the use of the default MPI
is recommended). The available MPIs for Windows are shown in Table 31.2.2.
• -cnf=hosts file specifies the hosts file, which contains a list of the computers on
which you want to run the parallel job. If the hosts file is not located in the
directory where you are typing the startup command, you will need to supply the
full pathname to the file.
You can use a plain text editor such as Notepad to create the hosts file. The only
restriction on the filename is that there should be no spaces in it. For example,
hosts.txt is an acceptable hosts file name, but my hosts.txt is not.
Your hosts file (e.g., hosts.txt) might contain the following entries:
computer1
computer2
If a computer in the network is a multiprocessor, you can list it more than once.
For example, if computer1 has 2 CPUs, then, to take advantage of both CPUs, the
hosts.txt file should list computer1 twice:
computer1
computer1
computer2
• -tnprocs specifies the number of processes to use. When the -cnf option is present,
the hosts file argument is used to determine which computers to use for the parallel
job. For example, if there are 8 computers listed in the hosts file and you want to
run a job with 4 processes, set nprocs to 4 (i.e., -t4) and FLUENT will use the first
4 machines listed in the hosts file.
For example, the full command line to start a 3d parallel job on the first 4 computers
listed in a hosts file called hosts.txt is as follows:
c Fluent Inc. September 29, 2006 31-5
Parallel Processing
The default interconnect (ethernet) and the default communication library (mpich2)
will be used since these options are not specified.
i The first time that you try to run FLUENT in parallel, a separate Command
Prompt will open prompting you to verify the current Windows account
that you are logged into. Press the <Enter> key if the account is correct.
If you have a new account password, enter in your password and press
the <Enter> key, then verify your password and press the <Enter> key.
Once the username and password have been verified and encrypted into
the Windows Registry, then FLUENT parallel will launch.
The supported interconnects for dedicated parallel ntx86 and win64 Windows machines,
the associated MPIs for them, and the corresponding syntax are listed in Tables 31.2.1-
31.2.3:
(1) Used with Shared Memory Machine (SSM) where the memory is shared between the processors on
a single machine.
(2) Used with Distributed Memory Machine (DMM) where each processor has it’s own memory associated
with it.
31-6
c Fluent Inc. September 29, 2006
31.2 Starting Parallel FLUENT on a Windows System
c Fluent Inc. September 29, 2006 31-7
Parallel Processing
2. Under Options, select the interconnect or system in the Interconnect drop-down list.
The Default setting is recommended, because it selects the interconnect that should
provide the best overall parallel performance for your dedicated parallel machine.
For a symmetric multi-processor (SMP) system, the Default setting uses shared
memory for communication.
If you prefer to select a specific interconnect, you can choose either Ethernet/Shared
Memory MPI, Myrinet, Infiniband, or Ethernet via sockets. For more information
about these interconnects, see Table 31.2.1, Table 31.2.2, and Table 31.2.3.
4. (optional) Specify the name of a file containing a list of machines, one per line, in
the Hosts File field.
5. Click the Run button to start the parallel version. No additional setup is required
once the solver starts.
i The first time that you try to run FLUENT in parallel, a separate Command
Prompt will open prompting you to verify the current Windows account
that you are logged into. Press the <Enter> key if the account is correct.
If you have a new account password, enter in your password and press
the <Enter> key, then verify your password and press the <Enter> key.
Once the username and password have been verified and encrypted into
the Windows Registry, then FLUENT parallel will launch.
31-8
c Fluent Inc. September 29, 2006
31.2 Starting Parallel FLUENT on a Windows System
FLUENT_INC\fluent6.x\launcher\launcher.exe
where FLUENT INC is the root path to where FLUENT is installed, (i.e., usually the
FLUENT INC environment variable) and x indicates the release version of FLUENT).
c Fluent Inc. September 29, 2006 31-9
Parallel Processing
• Set options for your FLUENT executable, such as indicating a specific release or a
version number.
• Set parallel options, such as indicating the number of parallel processes (or if you
want to run a serial process), and an MPI type to use for parallel computations.
• Set additional options such as specifying the name and location of the current
working folder or a journal file.
When you are ready to launch your serial or parallel application, you can check the valid-
ity of the settings using the Check button (messages are displayed in the Log Information
window). When you are satisfied with the settings, click the Launch button to start the
parallel processes.
To return to your default settings for the Fluent Launcher, based on your current FLUENT
installation, click the Default button. The fields in the Fluent Launcher panel will return
to their original settings.
When you are finished using the Fluent Launcher, click the Close button. Any settings
that you have made in the panel are preserved when you re-open the Fluent Launcher.
31-10
c Fluent Inc. September 29, 2006
31.2 Starting Parallel FLUENT on a Windows System
Depending on what FLUENT releases are available in the Fluent.Inc Path, you can specify
the number associated with a given release in the Release list. The list is populated with
the FLUENT release numbers that are available in the Fluent Inc. Path field.
You can specify the dimensionality and the precision of the FLUENT product using the
Version list. There are four possible choices: 2d, 2ddp, 3d, or 3ddp. The 2d and 3d
options provide single-precision results for two-dimensional or three-dimensional prob-
lems, respectively. The 2ddp and 3ddp options provide double-precision results for two-
dimensional or three-dimensional problems, respectively.
You can specify the number of FLUENT processes in the Number of Processes field. You
can use the drop-down list to select from pre-set values of serial, 1, 2, 4, 8, 16,
32, or 64, or you can manually enter the number into the field yourself (e.g., 3, 10, etc.).
The range of parallel processes ranges from 1 to 1024. If Number of Processes is equal to
1, you might want to consider running the FLUENT job using the serial setting.
c Fluent Inc. September 29, 2006 31-11
Parallel Processing
You can specify the computer architecture using the Architecture drop-down list. De-
pending on the selected release, the available options are ntx86 and win64.
You can specify the MPI to use for the parallel computations using the MPI Types field.
The list of MPI types varies depending on the selected release and the selected architec-
ture. There are several options, based on the operating system of the parallel cluster.
For more information about the available MPI types, see Tables 31.2.1-31.2.2.
Specify the hosts file using the Machine List or File field. You can use the ... button to
browse for a hosts file, or you can enter the machine names directly into the text field.
Machine names can be separated either by a comma or a space.
You can specify the path of your current working directory using the Working Folder field
or click ... to browse through your directory structure. Note that a UNC path cannot be
set as a working folder.
You can specify the path and name of a journal file using the Journal File field or click
... to browse through your directory structure to locate the file. Using the journal file,
you can automatically load the case, compile any user-defined functions, iterate until the
solution converges, and write results to a output file.
31-12
c Fluent Inc. September 29, 2006
31.2 Starting Parallel FLUENT on a Windows System
Specifying Whether or Not to Use the Microsoft Job Scheduler (win64 MS MPI Only)
For the Windows 64-bit MS MPI only, you can specify that you want to use the Mi-
crosoft Job Scheduler (see Section 31.2.4: Starting Parallel FLUENT with the Microsoft
Job Scheduler (win64 Only)) by selecting the Use Microsoft Scheduler check box. Once
selected, you can then enter a machine name in the with Head Node text field. If you
are running FLUENT on the head node, then you can keep the field empty. This op-
tion translates into the proper parallel command line syntax for using the Microsoft Job
Scheduler.
If you are creating benchmark cases using parallel FLUENT, you can enable the Benchmark
check box. This option involves having several benchmarking-related files available on
your machine. If you are missing any of the files, the Fluent Launcher informs you of
which files you need and how to locate them.
c Fluent Inc. September 29, 2006 31-13
Parallel Processing
31.2.4 Starting Parallel FLUENT with the Microsoft Job Scheduler (win64
Only)
The Microsoft Job Scheduler allows you to manage multiple jobs and tasks, allocate
computer resources, send tasks to compute nodes, and monitor jobs, tasks, and compute
nodes.
FLUENT currently supports Windows XP as well as the Windows Server operating sys-
tem (win64 only). The Windows Server operating system includes a “compute cluster
package” (CCP) that combines the Microsoft MPI type (msmpi) and Microsoft Job Sched-
uler. FLUENT provides a means of using the Microsoft Job Scheduler using the following
flag in the parallel command:
-ccp head-node-name
where -ccp indicates the use of the compute cluster package, and head-node-name indi-
cates the name of the head node of the computer cluster.
For example, if you want to use the Job Scheduler, the corresponding command syntax
would be:
fluent 3d -t2 -ccp head-node-name
Likewise, if you do not want to use the Job Scheduler, the following command syntax
can be used with msmpi:
fluent 3d -t2 -pmsmpi -cnf=host
i The first time that you try to run FLUENT in parallel, a separate Command
Prompt will open prompting you to verify the current Windows account
that you are logged into. If you have a new account password, enter in your
password and press the <Enter> key. If you want FLUENT to remember
your password on this machine, press the Y key and press the <Enter> key.
Once the username and password have been verified and encrypted into
the Windows Registry, then FLUENT parallel will launch.
i If you do not want to use the Microsoft Job Scheduler, but you still want
to use msmpi, you will need to stop the Microsoft Compute Cluster MPI
Service through the Control Panel, and you need to start your own version
of SMPD (the process manager for msmpi on Windows) using the following
command on each host on which you want to run FLUENT:
start smpd -d 0
31-14
c Fluent Inc. September 29, 2006
31.3 Starting Parallel FLUENT on a Linux/UNIX System
• Section 31.3.3: Setting Up Your Remote Shell and Secure Shell Clients
• version must be replaced by the version of FLUENT you want to run (2d, 3d, 2ddp,
or 3ddp).
• -mpi=mpi type (optional) specifies the type of MPI. If the option is not specified,
the default MPI for the given interconnect will be used (the use of the default MPI
is recommended). The available MPIs for Linux/UNIX are shown in Table 31.3.2.
• -cnf=hosts file specifies the hosts file, which contains a list of the computers on
which you want to run the parallel job. If the hosts file is not located in the
directory where you are typing the startup command, you will need to supply the
full pathname to the file.
You can use a plain text editor to create the hosts file. The only restriction on
the filename is that there should be no spaces in it. For example, hosts.txt is an
acceptable hosts file name, but my hosts.txt is not.
c Fluent Inc. September 29, 2006 31-15
Parallel Processing
Your hosts file (e.g., hosts.txt) might contain the following entries:
computer1
computer2
If a computer in the network is a multiprocessor, you can list it more than once.
For example, if computer1 has 2 CPUs, then, to take advantage of both CPUs, the
hosts.txt file should list computer1 twice:
computer1
computer1
computer2
• -tnprocs specifies the number of processes to use. When the -cnf option is present,
the hosts file argument is used to determine which computers to use for the parallel
job. For example, if there are 10 computers listed in the hosts file and you want
to run a job with 5 processes, set nprocs to 5 (i.e., -t5) and FLUENT will use the
first 5 machines listed in the hosts file.
For example, to use the Myrinet interconnect, and to start the 3D solver with 4 compute
nodes on the machines defined in the text file called fluent.hosts, you can enter the
following in the command prompt:
fluent 3d -t4 -pmyrinet -cnf=fluent.hosts
Note that if the optional -cnf=hosts file is specified, a compute node will be spawned
on each machine listed in the file hosts file. (If you enter this optional argument, do not
include the square brackets.)
The supported interconnects for parallel Linux/UNIX machines are listed below (Ta-
ble 31.3.1, Table 31.3.2, and Table 31.3.3), along with their associated communication
libraries, the corresponding syntax, and the supported architectures:
31-16
c Fluent Inc. September 29, 2006
31.3 Starting Parallel FLUENT on a Linux/UNIX System
c Fluent Inc. September 29, 2006 31-17
Parallel Processing
31-18
c Fluent Inc. September 29, 2006
31.3 Starting Parallel FLUENT on a Linux/UNIX System
c Fluent Inc. September 29, 2006 31-19
Parallel Processing
31-20
c Fluent Inc. September 29, 2006
31.3 Starting Parallel FLUENT on a Linux/UNIX System
2. Under Options, select the interconnect or system in the Interconnect drop-down list.
The Default setting is recommended, because it selects the interconnect that should
provide the best overall parallel performance for your dedicated parallel machine.
For a symmetric multi-processor (SMP) system, the Default setting uses shared
memory for communication.
If you prefer to select a specific interconnect, you can choose either Ethernet/Shared
Memory MPI, Myrinet, Infiniband, Altix, Cray, or Ethernet via sockets. For more infor-
mation about these interconnects, see Table 31.3.1, Table 31.3.2, and Table 31.3.3.
4. (optional) Specify the name of a file containing a list of machines, one per line, in
the Hosts File field.
5. Click the Run button to start the parallel version. No additional setup is required
once the solver starts.
c Fluent Inc. September 29, 2006 31-21
Parallel Processing
1. Generate a public-private key pair using ssh-keygen (or using a graphical user
interface client). For example:
% ssk-keygen -t dsa
The client machine is now added to the access list and the user is no longer required to
type in a password each time. For additional information, consult your system adminis-
trator or refer to your system documentation.
31-22
c Fluent Inc. September 29, 2006
31.4 Checking Network Connectivity
Indicate the compute node ID for which connectivity information is desired in the Com-
pute Node field, and then click the Print button. Sample output for compute node 0 is
shown below:
------------------------------------------------------------------------------
ID Comm. Hostname O.S. PID Mach ID HW ID Name
------------------------------------------------------------------------------
host net balin Linux-32 17272 0 7 Fluent Host
n3 hp balin Linux-32 17307 1 10 Fluent Node
n2 hp filio Linux-32 17306 0 -1 Fluent Node
n1 hp bofur Linux-32 17305 0 1 Fluent Node
n0* hp balin Linux-32 17273 2 11 Fluent Node
O.S is the architecture, Comm. is the communication library (i.e., MPI type), PID is the
process ID number, Mach ID is the compute node ID, and HW ID is an identifier specific
to the interconnect used.
c Fluent Inc. September 29, 2006 31-23
Parallel Processing
i If your case file contains a mesh generated by the GAMBIT Hex Core mesh-
ing scheme or the TGrid Mesh/Hexcore menu option (hexcore mesh), you
must filter the mesh using the tpoly utility or TGrid prior to partitioning
the grid. See Section 31.5.2: Preparing Hexcore Meshes for Partitioning
for more information.
Note that the relative distribution of cells among compute nodes will be maintained
during grid adaption, except if non-conformal interfaces are present, so repartitioning
after adaption is not required. See Section 31.5.7: Load Distribution for more information.
31-24
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
If you use the serial solver to set up the problem before partitioning, the machine on
which you perform this task must have enough memory to read in the grid. If your
grid is too large to be read into the serial solver, you can read the unpartitioned grid
directly into the parallel solver (using the memory available in all the defined hosts)
and have it automatically partitioned. In this case you will set up the problem after an
initial partition has been made. You will then be able to manually repartition the case
if necessary. See Sections 31.5.3 and 31.5.4 for additional details and limitations, and
Section 31.5.6: Checking the Partitions for details about checking the partitions.
Domain
Before Partitioning
Interface
Boundary
c Fluent Inc. September 29, 2006 31-25
Parallel Processing
i The output case file resulting from a tpoly conversion only contains mesh
information. None of the solver-related data of the input file is retained.
To convert a file using the tpoly filter, before starting FLUENT, type the following:
You can also use TGrid to convert the transitional cells to polyhedra. You must either
read in or create the hexcore mesh in TGrid, and then save the mesh as a case file with
polyhedra. To do this, use the File/Write/Case... menu option, being sure to enable the
Write As Polyhedra option in the Select File dialog box.
Limitations
Converted hexcore meshes have the following limitations:
• The following grid manipulation tools are not available on polyhedral meshes:
– extrude-face-zone under the modify-zone option
– fuse
– skewness smoothing
– swapping (will not affect polyhedral cells)
• The polyhedral cells that result from the conversion are not eligible for adaption.
For more information about adaption, see Chapter 26: Adapting the Grid.
31-26
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
1. (optional) Set the partitioning parameters in the Auto Partition Grid panel (Fig-
ure 31.5.2).
Parallel −→Auto Partition...
If you are reading in a mesh file or a case file for which no partition information is
available, and you keep the Case File option turned on, FLUENT will partition the
grid using the method displayed in the Method drop-down list.
If you want to specify the partitioning method and associated options yourself, the
procedure is as follows:
(a) Turn off the Case File option. The other options in the panel will become
available.
(b) Select the bisection method in the Method drop-down list. The choices are
the techniques described in Section 31.5.5: Bisection Methods.
(c) You can choose to independently apply partitioning to each cell zone, or you
can allow partitions to cross zone boundaries using the Across Zones check
button. It is recommended that you not partition cells zones independently
c Fluent Inc. September 29, 2006 31-27
Parallel Processing
(by turning off the Across Zones check button) unless cells in different zones
will require significantly different amounts of computation during the solution
phase (e.g., if the domain contains both solid and fluid zones).
(d) If you have chosen the Principal Axes or Cartesian Axes method, you can improve
the partitioning by enabling the automatic testing of the different bisection
directions before the actual partitioning occurs. To use pretesting, turn on
the Pre-Test option. Pretesting is described in Section 31.5.5: Pretesting.
(e) Click OK.
If you have a case file where you have already partitioned the grid, and the number
of partitions divides evenly into the number of compute nodes, you can keep the
default selection of Case File in the Auto Partition Grid panel. This instructs FLUENT
to use the partitions in the case file.
31-28
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
1. Partition the grid using the default bisection method (Principal Axes) and optimiza-
tion (Smooth).
2. Examine the partition statistics, which are described in Section 31.5.6: Interpret-
ing Partition Statistics. Your aim is to achieve small values of Interface ratio
variation and Global interface ratio while maintaining a balanced load (Cell
variation). If the statistics are not acceptable, try one of the other bisection meth-
ods.
3. Once you determine the best bisection method for your problem, you can turn on
Pre-Test (see Section 31.5.5: Pretesting) to improve it further, if desired.
4. You can also improve the partitioning using the Merge optimization, if desired.
1. Select the bisection method in the Method drop-down list. The choices are the
techniques described in Section 31.5.5: Bisection Methods.
2. Set the desired number of grid partitions in the Number integer number field. You
can use the counter arrows to increase or decrease the value, instead of typing in
the box. The number of grid partitions must be an integral multiple of the number
of processors available for parallel computing.
c Fluent Inc. September 29, 2006 31-29
Parallel Processing
31-30
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
3. You can choose to independently apply partitioning to each cell zone, or you can
allow partitions to cross zone boundaries using the Across Zones check button. It is
recommended that you not partition cells zones independently (by turning off the
Across Zones check button) unless cells in different zones will require significantly
different amounts of computation during the solution phase (e.g., if the domain
contains both solid and fluid zones).
4. You can select Encapsulate Grid Interfaces if you would like the cells surrounding
all non-conformal grid interfaces in your mesh to reside in a single partition at all
times during the calculation. If your case file contains non-conformal interfaces
on which you plan to perform adaption during the calculation, you will have to
partition it in the serial solver, with the Encapsulate Grid Interfaces and Encapsulate
for Adaption options turned on.
5. If you have enabled the Encapsulate Grid Interfaces option in the serial solver, the
Encapsulate for Adaption option will also be available. When you select this op-
tion, additional layers of cells are encapsulated such that transfer of cells will be
unnecessary during parallel adaption.
6. You can activate and control the desired optimization methods (described in Sec-
tion 31.5.5: Optimizations) using the items under Optimizations. You can activate
the Merge and Smooth schemes by turning on the Do check button next to each
one. For each scheme, you can also set the number of Iterations. Each optimization
scheme will be applied until appropriate criteria are met, or the maximum number
of iterations has been executed. If the Iterations counter is set to 0, the optimization
scheme will be applied until completion, without limit on the maximum number of
iterations.
7. If you have chosen the Principal Axes or Cartesian Axes method, you can improve the
partitioning by enabling the automatic testing of the different bisection directions
before the actual partitioning occurs. To use pretesting, turn on the Pre-Test option.
Pretesting is described in Section 31.5.5: Pretesting.
8. In the Zones and/or Registers lists, select the zone(s) and/or register(s) for which
you want to partition. For most cases, you will select all Zones (the default) to
partition the entire domain. See below for details.
9. You can assign selected Zones and/or Registers to a specific partition ID by entering
a value for the Set Selected Zones and Registers to Partition ID. For example, if the
Number of partitions for your grid is 2, then you can only use IDs of 0 or 1. If
you have three partitions, then you can enter IDs of 0, 1, or 2. This can be useful
in situations where the gradient at a region is known to be high. In such cases,
you can mark the region or zone and set the marked cells to one of the partition
IDs, thus preventing the partition from going through that region. This in turn
will facilitate convergence. This is also useful in cases where mesh manipulation
c Fluent Inc. September 29, 2006 31-31
Parallel Processing
tools are not available in parallel. In this case, you can assign the related cells to
a particular ID so that the grid manipulation tools are now functional.
If you are running the parallel solver, and you have marked your region and assigned
an ID to the selected Zones and/or Registers, click the Use Stored Partitions button
to make the new partitions valid.
Refer to the example described later in this section for a demonstration of how
selected registers are assigned to a partition.
11. If you decide that the new partitions are better than the previous ones (if the grid
was already partitioned), click the Use Stored Partitions button to make the newly
stored cell partitions the active cell partitions. The active cell partition is used for
the current calculation, while the stored cell partition (the last partition performed)
is used when you save a case file.
12. When using the dynamic mesh model in your parallel simulations, the Partition
panel includes an Auto Repartition option and a Repartition Interval setting. These
parallel partitioning options are provided because FLUENT migrates cells when
local remeshing and smoothing is performed. Therefore, the partition interface be-
comes very wrinkled and the load balance may deteriorate. By default, the Auto
Repartition option is selected, where a percentage of interface faces and loads are au-
tomatically traced. When this option is selected, FLUENT automatically determines
the most appropriate repartition interval based on various simulation parameters.
Sometimes, using the Auto Repartition option provides insufficient results, therefore,
the Repartition Interval setting can be used. The Repartition Interval setting lets you
to specify the interval (in time steps or iterations respectively) when a repartition
is enforced. When repartitioning is not desired, then you can set the Repartition
Interval to zero.
i Note that when dynamic meshes and local remeshing is utilized, updated
meshes may be slightly different in parallel FLUENT (when compared to
serial FLUENT or when compared to a parallel solution created with a
different number of compute nodes), resulting in very small differences in
the solutions.
31-32
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
3. Display the grid with the Partitions option enabled in the Display Grid panel (Fig-
ure 31.5.5).
Grid
FLUENT 6.3 (2d, segregated, ske)
4. Adapt your region and mark your cells (see Section 26.7.3: Performing Region
Adaption). This creates a register.
c Fluent Inc. September 29, 2006 31-33
Parallel Processing
6. Keep the Set Selected Zones and Registers to Partition ID set to 0 and click the
corresponding button. This prints the following output to the FLUENT console
window:
31-34
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
7. Click the Use Stored Partitions button to make the new partitions valid. This
migrates the partitions to the compute-nodes. The following output is then printed
to the FLUENT console window:
----------------------------------------------------------------------
Collective Partition Statistics: Minimum Maximum Total
----------------------------------------------------------------------
Cell count 246 672 918
Mean cell count deviation -46.4% 46.4%
Partition boundary cell count 24 24 48
Partition boundary cell count ratio 3.6% 9.8% 5.2%
9. This time, set the Set Selected Zones and Registers to Partition ID to 1 and click the
corresponding button. This prints a report to the FLUENT console.
10. Click the Use Stored Partitions button to make the new partitions valid and to
migrate the partitions to the compute-nodes.
11. Display the grid (Figure 31.5.7). Notice now that the partition appears in a different
location as specified by your partition ID.
c Fluent Inc. September 29, 2006 31-35
Parallel Processing
Grid
FLUENT 6.3 (2d, segregated, ske)
Grid
FLUENT 6.3 (2d, segregated, ske)
31-36
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
The ability to restrict partitioning to cell zones or registers gives you the flexibility to
apply different partitioning strategies to subregions of a domain. For example, if your
geometry consists of a cylindrical plenum connected to a rectangular duct, you may
want to partition the plenum using the Cylindrical Axes method, and the duct using the
Cartesian Axes method.
If the plenum and the duct are contained in two different cell zones, you can select one
at a time and perform the desired partitioning, as described in Section 31.5.4: Using
the Partition Grid Panel. If they are not in two different cell zones, you can create a
cell register (basically a list of cells) for each region using the functions that are used
to mark cells for adaption. These functions allow you to mark cells based on physical
location, cell volume, gradient or isovalue of a particular variable, and other parameters.
See Chapter 26: Adapting the Grid for information about marking cells for adaption.
Section 26.11.1: Manipulating Adaption Registers provides information about manipu-
lating different registers to create new ones. Once you have created a register, you can
partition within it as described above.
i Note that partitioning within zones or registers is not available when Metis
is selected as the partition Method.
For dynamic mesh applications (see item 11 above), FLUENT stores the partition method
used to partition the respective zone. Therefore, if repartitioning is done, FLUENT uses
the same method that was used to partition the mesh.
As the grid is partitioned, information about the partitioning process will be printed in
the text (console) window. By default, the solver will print the number of partitions
created, the number of bisections performed, the time required for the partitioning, and
the minimum and maximum cell, face, interface, and face-ratio variations. (See Sec-
tion 31.5.6: Interpreting Partition Statistics for details.) If you increase the Verbosity to
2 from the default value of 1, the partition method used, the partition ID, number of
cells, faces, and interfaces, and the ratio of interfaces to faces for each partition will also
be printed in the console window. If you decrease the Verbosity to 0, only the number of
partitions created and the time required for the partitioning will be reported.
You can request a portion of this report to be printed again after the partitioning is
completed. When you click the Print Active Partitions or Print Stored Partitions button
in the parallel solver, FLUENT will print the partition ID, number of cells, faces, and
interfaces, and the ratio of interfaces to faces for each active or stored partition in the
console window. In addition, it will print the minimum and maximum cell, face, interface,
and face-ratio variations. In the serial solver, you will obtain the same information about
the stored partition when you click Print Partitions. See Section 31.5.6: Interpreting
c Fluent Inc. September 29, 2006 31-37
Parallel Processing
i Recall that to make the stored cell partitions the active cell partitions you
must click the Use Stored Partitions button. The active cell partition is
used for the current calculation, while the stored cell partition (the last
partition performed) is used when you save a case file.
If you change your mind about your partition parameter settings, you can easily return
to the default settings assigned by FLUENT by clicking on the Default button. When you
click the Default button, it will become the Reset button. The Reset button allows you
to return to the most recently saved settings (i.e., the values that were set before you
clicked on Default). After execution, the Reset button will become the Default button
again.
Balancing the partitions (equalizing the number of cells) ensures that each processor
has an equal load and that the partitions will be ready to communicate at about the
same time. Since communication between partitions can be a relatively time-consuming
process, minimizing the number of interfaces can reduce the time associated with this
data interchange. Minimizing the number of partition neighbors reduces the chances
for network and routing contentions. In addition, minimizing partition neighbors is
important on machines where the cost of initiating message passing is expensive compared
to the cost of sending longer messages. This is especially true for workstations connected
in a network.
The partitioning schemes in FLUENT use bisection algorithms to create the partitions, but
unlike other schemes which require the number of partitions to be a factor of two, these
schemes have no limitations on the number of partitions. For each available processor,
you will create the same number of partitions (i.e., the total number of partitions will be
an integral multiple of the number of processors).
31-38
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
Bisection Methods
The grid is partitioned using a bisection algorithm. The selected algorithm is applied to
the parent domain, and then recursively applied to the child subdomains. For example,
to divide the grid into four partitions, the solver will bisect the entire (parent) domain
into two child domains, and then repeat the bisection for each of the child domains,
yielding four partitions in total. To divide the grid into three partitions, the solver will
“bisect” the parent domain to create two partitions—one approximately twice as large
as the other—and then bisect the larger child domain again to create three partitions in
total.
The grid can be partitioned using one of the algorithms listed below. The most efficient
choice is problem-dependent, so you can try different methods until you find the one that
is best for your problem. See Section 31.5.4: Guidelines for Partitioning the Grid for
recommended partitioning strategies.
Cartesian Axes bisects the domain based on the Cartesian coordinates of the cells (see
Figure 31.5.8). It bisects the parent domain and all subsequent child subdomains
perpendicular to the coordinate direction with the longest extent of the active
domain. It is often referred to as coordinate bisection.
Cartesian Strip uses coordinate bisection but restricts all bisections to the Cartesian
direction of longest extent of the parent domain (see Figure 31.5.9). You can often
minimize the number of partition neighbors using this approach.
Cartesian X-, Y-, Z-Coordinate bisects the domain based on the selected Cartesian
coordinate. It bisects the parent domain and all subsequent child subdomains
perpendicular to the specified coordinate direction. (See Figure 31.5.9.)
Cartesian R Axes bisects the domain based on the shortest radial distance from the
cell centers to that Cartesian axis (x, y, or z) which produces the smallest interface
size. This method is available only in 3D.
Cartesian RX-, RY-, RZ-Coordinate bisects the domain based on the shortest ra-
dial distance from the cell centers to the selected Cartesian axis (x, y, or z). These
methods are available only in 3D.
Cylindrical Axes bisects the domain based on the cylindrical coordinates of the cells.
This method is available only in 3D.
Cylindrical R-, Theta-, Z-Coordinate bisects the domain based on the selected cylin-
drical coordinate. These methods are available only in 3D.
Metis uses the METIS software package for partitioning irregular graphs, developed by
Karypis and Kumar at the University of Minnesota and the Army HPC Research
Center. It uses a multilevel approach in which the vertices and edges on the fine
c Fluent Inc. September 29, 2006 31-39
Parallel Processing
graph are coalesced to form a coarse graph. The coarse graph is partitioned, and
then uncoarsened back to the original graph. During coarsening and uncoarsen-
ing, algorithms are applied to permit high-quality partitions. Detailed information
about METIS can be found in its manual [172].
i Note that when using the socket version (-pnet), the METIS partitioner
is not available. In this case, METIS partitioning can be obtained using
the partition filter, as described below.
Polar Axes bisects the domain based on the polar coordinates of the cells (see Fig-
ure 31.5.12). This method is available only in 2D.
Polar R-Coordinate, Polar Theta-Coordinate bisects the domain based on the se-
lected polar coordinate (see Figure 31.5.12). These methods are available only in
2D.
Principal Axes bisects the domain based on a coordinate frame aligned with the prin-
cipal axes of the domain (see Figure 31.5.10). This reduces to Cartesian bisection
when the principal axes are aligned with the Cartesian axes. The algorithm is also
referred to as moment, inertial, or moment-of-inertia partitioning.
This is the default bisection method in FLUENT.
Principal Strip uses moment bisection but restricts all bisections to the principal axis
of longest extent of the parent domain (see Figure 31.5.11). You can often minimize
the number of partition neighbors using this approach.
Principal X-, Y-, Z-Coordinate bisects the domain based on the selected principal
coordinate (see Figure 31.5.11).
Spherical Axes bisects the domain based on the spherical coordinates of the cells. This
method is available only in 3D.
Spherical Rho-, Theta-, Phi-Coordinate bisects the domain based on the selected
spherical coordinate. These methods are available only in 3D.
31-40
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
3.00e+00
2.25e+00
1.50e+00
7.50e-01
0.00e+00
3.00e+00
2.25e+00
1.50e+00
7.50e-01
0.00e+00
c Fluent Inc. September 29, 2006 31-41
Parallel Processing
3.00e+00
2.25e+00
1.50e+00
7.50e-01
0.00e+00
3.00e+00
2.25e+00
1.50e+00
7.50e-01
0.00e+00
31-42
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
3.00e+00
2.25e+00
1.50e+00
7.50e-01
0.00e+00
Figure 31.5.12: Partitions Created with the Polar Axes or Polar Theta-
Coordinate Method
Optimizations
Additional optimizations can be applied to improve the quality of the grid partitions.
The heuristic of bisecting perpendicular to the direction of longest domain extent is
not always the best choice for creating the smallest interface boundary. A “pre-testing”
operation (see Section 31.5.5: Pretesting) can be applied to automatically choose the best
direction before partitioning. In addition, the following iterative optimization schemes
exist:
Merge attempts to eliminate orphan clusters from each partition. An orphan cluster is
a group of cells with the common feature that each cell within the group has at least
one face which coincides with an interface boundary. (See Figure 31.5.14.) Orphan
clusters can degrade multigrid performance and lead to large communication costs.
In general, the Smooth and Merge schemes are relatively inexpensive optimization tools.
c Fluent Inc. September 29, 2006 31-43
Parallel Processing
31-44
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
Pretesting
If you choose the Principal Axes or Cartesian Axes method, you can improve the bisection
by testing different directions before performing the actual bisection. If you choose not
to use pretesting (the default), FLUENT will perform the bisection perpendicular to the
direction of longest domain extent.
If pretesting is enabled, it will occur automatically when you click the Partition button
in the Partition Grid panel, or when you read in the grid if you are using automatic
partitioning. The bisection algorithm will test all coordinate directions and choose the
one which yields the fewest partition interfaces for the final bisection.
Note that using pretesting will increase the time required for partitioning. For 2D prob-
lems partitioning will take 3 times as long as without pretesting, and for 3D problems it
will take 4 times as long.
i Direct import to the parallel solver through the partition filter requires
that the host machine has enough memory to run the filter for the specified
grid. If not, you will need to run the filter on a machine that does have
enough memory. You can either start the parallel solver on the machine
with enough memory and repeat the process described above, or run the
filter manually on the new machine and then read the partitioned grid into
the parallel solver on the host machine.
To manually partition a grid using the partition filter, enter the following command:
where input filename is the filename for the grid to be partitioned, partition count is
the number of partitions desired, and output filename is the filename for the parti-
tioned grid. You can then read the partitioned grid into the solver (using the standard
File/Read/Case... menu item) and proceed with the model definition and solution.
c Fluent Inc. September 29, 2006 31-45
Parallel Processing
31-46
c Fluent Inc. September 29, 2006
31.5 Partitioning the Grid
output below, partitions 0 and 3 have the minimum number of interface faces (10), and
partitions 1 and 2 have the maximum number of interface faces (19); hence the variation
is 10–19.
Your aim is to achieve small values of Interface ratio variation and Global interface
ratio while maintaining a balanced load (Cell variation).
>> Partitions:
P Cells I-Cells Cell Ratio Faces I-Faces Face Ratio Neighbors
0 134 10 0.075 217 10 0.046 1
1 137 19 0.139 222 19 0.086 2
2 134 19 0.142 218 19 0.087 2
3 137 10 0.073 223 10 0.045 1
------
Partition count = 4
Cell variation = (134 - 137)
Mean cell variation = ( -1.1% - 1.1%)
Intercell variation = (10 - 19)
Intercell ratio variation = ( 7.3% - 14.2%)
Global intercell ratio = 10.7%
Face variation = (217 - 223)
Interface variation = (10 - 19)
Interface ratio variation = ( 4.5% - 8.7%)
Global interface ratio = 3.4%
Neighbor variation = (1 - 2)
Note that partition IDs correspond directly to compute node IDs when a case file is read
into the parallel solver. When the number of partitions in a case file is larger than the
number of compute nodes, but is evenly divisible by the number of compute nodes, then
the distribution is such that partitions with IDs 0 to (M − 1) are mapped onto compute
node 0, partitions with IDs M to (2M − 1) onto compute node 1, etc., where M is equal
to the ratio of the number of partitions to the number of compute nodes.
c Fluent Inc. September 29, 2006 31-47
Parallel Processing
i If you have not already done so in the setup of your problem, you will need
to perform a solution initialization in order to use the Contours panel.
i If you adapt a grid that contains non-conformal interfaces, and you want
to rebalance the load on the compute nodes, you will have to save your case
and data files after adaption, read the case and data files into the serial
solver, repartition using the Encapsulate Grid Interfaces and Encapsulate for
Adaption options in the Partition Grid panel, and save case and data files
again. You will then be able to read the manually repartitioned case and
data files into the parallel solver, and continue the solution from where you
left it.
31-48
c Fluent Inc. September 29, 2006
31.6 Checking and Improving Parallel Performance
c Fluent Inc. September 29, 2006 31-49
Parallel Processing
The following example demonstrates how the current parallel statistics are displayed in
the console window:
• Average wall-clock time per iteration describes the average real (wall clock)
time per iteration.
• Global reductions per iteration describes the number of global reduction op-
erations (such as variable summations over all processes). This requires communi-
cation among all processes.
A global reduction is a collective operation over all processes for the given job that
reduces a vector quantity (the length given by the number of processes or nodes) to
a scalar quantity (e.g., taking the sum or maximum of a particular quantity). The
number of global reductions cannot be calculated from any other readily known
quantities. The number is generally dependent on the algorithm being used and
the problem being solved.
• Global reductions time per iteration describes the time per iteration for the
global reduction operations.
• Message count per iteration describes the number of messages sent between all
processes per iteration. This is important with regard to communication latency,
especially on high-latency interconnnects.
A message is defined as a single point-to-point, send-and-receive operation between
any two processes. (This excludes global, collective operations such as global re-
ductions.) In terms of domain decomposition, a message is passed from the process
31-50
c Fluent Inc. September 29, 2006
31.6 Checking and Improving Parallel Performance
• Data transfer per iteration describes the amount of data communicated be-
tween processors per iteration. This is important with respect to interconnect
bandwidth.
Data transfer per iteration is usually dependent on the algorithm being used and the
problem being solved. This number generally increases with increases in problem
size, number of partitions, and physics complexity.
The data transfer per iteration may provide some insight into the impact of com-
munication bandwidth (speed) on parallel performance. The precise impact is often
difficult to quantify because it is dependent on many things including: ratio of data
transfer to calculations, and ratio of communication bandwidth to CPU speed. The
unit of data transfer is a byte.
• LE solves per iteration describes the number of linear systems being solved
per iteration. This number is dependent on the physics (non-reacting versus react-
ing flow) and the algorithms (pressure-based versus density-based solver), but is
independent of mesh size. For the pressure-based solver, this is usually the number
of transport equations being solved (mass, momentum, energy, etc.).
• LE wall-clock time per iteration describes the time (wall-clock) spent doing
linear equation solvers (i.e., multigrid).
• LE global solves per iteration describes the number of solutions on the coarse
level of the AMG solver where the entire linear system has been pushed to a single
processor (n0). The system is pushed to a single processor to reduce the compu-
tation time during the solution on that level. Scaling generally is not adversely
affected because the number of unknowns is small on the coarser levels.
• LE global wall-clock time per iteration describes the time (wall-clock) per
iteration for the linear equation global solutions (see above).
• AMG cycles per iteration describes the average number of multigrid cycles (V,
W, flexible, etc.) per iteration.
c Fluent Inc. September 29, 2006 31-51
Parallel Processing
• Relaxation sweeps per iteration describes the number of relaxation sweeps (or
iterative solutions) on all levels for all equations per iteration. A relaxation sweep
is usually one iteration of Gauss-Siedel or ILU.
• Time-step wall-clock time per iteration describes the time per sub-iteration.
• Total CPU time describes the total CPU time used by all processes. This does
not include any wait time for load imbalances or for communications (other than
packing and unpacking local buffers).
The most relevant quantity is the Total wall clock time. This quantity can be used
to gauge the parallel performance (speedup and efficiency) by comparing this quantity to
that from the serial analysis (the command line should contain -t1 in order to obtain the
statistics from a serial analysis). In lieu of a serial analysis, an approximation of parallel
speedup may be found in the ratio of Total CPU time to Total wall clock time.
31-52
c Fluent Inc. September 29, 2006
31.6 Checking and Improving Parallel Performance
i Note that you will be unable to interrupt iterations until the end of each
report interval.
Load Balancing
A dynamic load balancing capability is available in FLUENT. The principal reason for
using parallel processing is to reduce the turnaround time of your simulation, ideally
by a factor proportional to the collective speed of the computing resources used. If, for
example, you were using four CPUs to solve your problem, then you would expect to
reduce the turnaround time by a factor of four. This is of course the ideal situation, and
assumes that there is very little communication needed among the CPUs, that the CPUs
are all of equal speed, and that the CPUs are dedicated to your job. In practice, this is
often not the case. For example, CPU speeds can vary if you are solving in parallel on
a cluster that includes nodes with different clock speeds, other jobs may be competing
for use of one or more of the CPUs, and network traffic either from within the parallel
solver or generated from external sources may delay some of the necessary communication
among the CPUs.
If you enable dynamic load balancing in FLUENT, the load across the computational and
networking resources will be monitored periodically. If the load balancer determines that
performance can be improved by redistributing the cells among the compute nodes, it
will automatically do so. There is a time penalty associated with load balancing itself,
and so it is disabled by default. If you will be using a dedicated homogeneous resource,
or if you are using a heterogeneous resource but have accounted for differences in CPU
speeds during partitioning by specifying a load distribution (see Section 31.5.7: Load
Distribution), then you may not need to use load balancing.
i Note that when the shell conduction model is used, you will not be able to
turn on load balancing.
c Fluent Inc. September 29, 2006 31-53
Parallel Processing
To enable and control FLUENT’s automatic load balancing feature, use the Load Balance
panel (Figure 31.6.1). Load balancing will automatically detect and analyze parallel
performance, and redistribute cells between the existing compute nodes to optimize it.
Parallel −→Load Balance...
2. Select the bisection method to create new grid partitions in the Partition Method
drop-down list. The choices are the techniques described in Section 31.5.5: Bisection
Methods. As part of the automatic load balancing procedure, the grid will be
repartitioned into several small partitions using the specified method. The resulting
partitions will then be distributed among the compute nodes to achieve a more
balanced load.
3. Specify the desired Balance Interval. When a value of 0 is specified, FLUENT will
internally determine the best value to use, initially using an interval of 25 iterations.
You can override this behavior by specifying a non-zero value. FLUENT will then
attempt to perform load balancing after every N iterations, where N is the specified
Balance Interval. You should be careful to select an interval that is large enough to
outweigh the cost of performing the load balancing operations.
31-54
c Fluent Inc. September 29, 2006
31.6 Checking and Improving Parallel Performance
Note that you can interrupt the calculation at any time, turn the load balancing feature
off (or on), and then continue the calculation.
i If problems arise in your computations due to adaption, you can turn off
the automatic load balancing, which occurs any time that mesh adaption
is performed in parallel.
To instruct the solver to skip the load balancing step, issue the following Scheme com-
mand:
(disable-load-balance-after-adaption)
(enable-load-balance-after-adaption)
c Fluent Inc. September 29, 2006 31-55
Parallel Processing
31-56
c Fluent Inc. September 29, 2006