Subversion Repositories shark

Rev

Details | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 pj 1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
2
<HTML>
3
<HEAD>
4
<!-- This HTML file has been created by texi2html 1.52
5
     from fftw.texi on 18 May 1999 -->
6
 
7
<TITLE>FFTW - Parallel FFTW</TITLE>
8
</HEAD>
9
<BODY TEXT="#000000" BGCOLOR="#FFFFFF">
10
Go to the <A HREF="fftw_1.html">first</A>, <A HREF="fftw_3.html">previous</A>, <A HREF="fftw_5.html">next</A>, <A HREF="fftw_10.html">last</A> section, <A HREF="fftw_toc.html">table of contents</A>.
11
<P><HR><P>
12
 
13
 
14
<H1><A NAME="SEC47">Parallel FFTW</A></H1>
15
 
16
<P>
17
<A NAME="IDX201"></A>
18
In this chapter we discuss the use of FFTW in a parallel environment,
19
documenting the different parallel libraries that we have provided.
20
(Users calling FFTW from a multi-threaded program should also consult
21
Section <A HREF="fftw_3.html#SEC46">Thread safety</A>.)  The FFTW package currently contains three parallel
22
transform implementations that leverage the uniprocessor FFTW code:
23
 
24
 
25
 
26
<UL>
27
 
28
<LI>
29
 
30
<A NAME="IDX202"></A>
31
The first set of routines utilizes shared-memory threads for parallel
32
one- and multi-dimensional transforms of both real and complex data.
33
Any program using FFTW can be trivially modified to use the
34
multi-threaded routines.  This code can use any common threads
35
implementation, including POSIX threads.  (POSIX threads are available
36
on most Unix variants, including Linux.)  These routines are located in
37
the <CODE>threads</CODE> directory, and are documented in Section <A HREF="fftw_4.html#SEC48">Multi-threaded FFTW</A>.
38
 
39
<LI>
40
 
41
<A NAME="IDX203"></A>
42
<A NAME="IDX204"></A>
43
The <CODE>mpi</CODE> directory contains multi-dimensional transforms
44
of real and complex data for parallel machines supporting MPI.  It also
45
includes parallel one-dimensional transforms for complex data.  The main
46
feature of this code is that it supports distributed-memory transforms,
47
so it runs on everything from workstation clusters to massively-parallel
48
supercomputers.  More information on MPI can be found at the
49
<A HREF="http://www.mcs.anl.gov/mpi">MPI home page</A>.  The FFTW MPI routines
50
are documented in Section <A HREF="fftw_4.html#SEC55">MPI FFTW</A>.
51
 
52
<LI>
53
 
54
<A NAME="IDX205"></A>
55
We also have an experimental parallel implementation written in Cilk, a
56
C-like parallel language developed at MIT and currently available for
57
several SMP platforms.  For more information on Cilk see
58
<A HREF="http://supertech.lcs.mit.edu/cilk">the Cilk home page</A>.  The FFTW
59
Cilk code can be found in the <CODE>cilk</CODE> directory, with parallelized
60
one- and multi-dimensional transforms of complex data.  The Cilk FFTW
61
routines are documented in <CODE>cilk/README</CODE>.
62
 
63
</UL>
64
 
65
 
66
 
67
<H2><A NAME="SEC48">Multi-threaded FFTW</A></H2>
68
 
69
<P>
70
<A NAME="IDX206"></A>
71
In this section we document the parallel FFTW routines for shared-memory
72
threads on SMP hardware.  These routines, which support parellel one-
73
and multi-dimensional transforms of both real and complex data, are the
74
easiest way to take advantage of multiple processors with FFTW.  They
75
work just like the corresponding uniprocessor transform routines, except
76
that they take the number of parallel threads to use as an extra
77
parameter.  Any program that uses the uniprocessor FFTW can be trivially
78
modified to use the multi-threaded FFTW.
79
 
80
 
81
 
82
 
83
<H3><A NAME="SEC49">Installation and Supported Hardware/Software</A></H3>
84
 
85
<P>
86
All of the FFTW threads code is located in the <CODE>threads</CODE>
87
subdirectory of the FFTW package.  On Unix systems, the FFTW threads
88
libraries and header files can be automatically configured, compiled,
89
and installed along with the uniprocessor FFTW libraries simply by
90
including <CODE>--enable-threads</CODE> in the flags to the <CODE>configure</CODE>
91
script (see Section <A HREF="fftw_6.html#SEC67">Installation on Unix</A>).  (Note also that the threads
92
routines, when enabled, are automatically tested by the <SAMP>`<CODE>make
93
check'</SAMP></CODE> self-tests.)
94
<A NAME="IDX207"></A>
95
 
96
 
97
<P>
98
The threads routines require your operating system to have some sort of
99
shared-memory threads support.  Specifically, the FFTW threads package
100
works with POSIX threads (available on most Unix variants, including
101
Linux), Solaris threads, <A HREF="http://www.be.com">BeOS</A> threads (tested
102
on BeOS DR8.2), Mach C threads (reported to work by users), and Win32
103
threads (reported to work by users).  (There is also untested code to
104
use MacOS MP threads.)  If you have a shared-memory machine that uses a
105
different threads API, it should be a simple matter of programming to
106
include support for it; see the file <CODE>fftw_threads-int.h</CODE> for more
107
detail.
108
 
109
 
110
<P>
111
SMP hardware is not required, although of course you need multiple
112
processors to get any benefit from the multithreaded transforms.
113
 
114
 
115
 
116
 
117
<H3><A NAME="SEC50">Usage of Multi-threaded FFTW</A></H3>
118
 
119
<P>
120
Here, it is assumed that the reader is already familiar with the usage
121
of the uniprocessor FFTW routines, described elsewhere in this manual.
122
We only describe what one has to change in order to use the
123
multi-threaded routines.
124
 
125
 
126
<P>
127
First, instead of including <CODE>&#60;fftw.h&#62;</CODE> or <CODE>&#60;rfftw.h&#62;</CODE>, you
128
should include the files <CODE>&#60;fftw_threads.h&#62;</CODE> or
129
<CODE>&#60;rfftw_threads.h&#62;</CODE>, respectively.
130
 
131
 
132
<P>
133
Second, before calling any FFTW routines, you should call the function:
134
 
135
 
136
 
137
<PRE>
138
int fftw_threads_init(void);
139
</PRE>
140
 
141
<P>
142
<A NAME="IDX208"></A>
143
 
144
 
145
<P>
146
This function, which should only be called once (probably in your
147
<CODE>main()</CODE> function), performs any one-time initialization required
148
to use threads on your system.  It returns zero if successful, and a
149
non-zero value if there was an error (in which case, something is
150
seriously wrong and you should probably exit the program).
151
 
152
 
153
<P>
154
Third, when you want to actually compute the transform, you should use
155
one of the following transform routines instead of the ordinary FFTW
156
functions:
157
 
158
 
159
 
160
<PRE>
161
fftw_threads(nthreads, plan, howmany, in, istride,
162
             idist, out, ostride, odist);
163
<A NAME="IDX209"></A>
164
fftw_threads_one(nthreads, plan, in, out);
165
<A NAME="IDX210"></A>
166
fftwnd_threads(nthreads, plan, howmany, in, istride,
167
               idist, out, ostride, odist);
168
<A NAME="IDX211"></A>
169
fftwnd_threads_one(nthreads, plan, in, out);
170
<A NAME="IDX212"></A>
171
rfftw_threads(nthreads, plan, howmany, in, istride,
172
              idist, out, ostride, odist);
173
<A NAME="IDX213"></A>
174
rfftw_threads_one(nthreads, plan, in, out);
175
<A NAME="IDX214"></A>
176
rfftwnd_threads_real_to_complex(nthreads, plan, howmany, in,
177
                                istride, idist, out, ostride, odist);
178
<A NAME="IDX215"></A>
179
rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);
180
<A NAME="IDX216"></A>
181
rfftwnd_threads_complex_to_real(nthreads, plan, howmany, in,
182
                                istride, idist, out, ostride, odist);
183
<A NAME="IDX217"></A>
184
rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);
185
<A NAME="IDX218"></A>
186
rfftwnd_threads_one_complex_to_real(nthreads, plan, in, out);
187
<A NAME="IDX219"></A></PRE>
188
 
189
<P>
190
All of these routines take exactly the same arguments and have exactly
191
the same effects as their uniprocessor counterparts (i.e. without the
192
<SAMP>`<CODE>_threads</CODE>'</SAMP>) <EM>except</EM> that they take one extra
193
parameter, <CODE>nthreads</CODE> (of type <CODE>int</CODE>), before the normal
194
parameters.<A NAME="DOCF5" HREF="fftw_foot.html#FOOT5">(5)</A>  The <CODE>nthreads</CODE>
195
parameter specifies the number of threads of execution to use when
196
performing the transform (actually, the maximum number of threads).
197
<A NAME="IDX220"></A>
198
 
199
 
200
<P>
201
For example, to parallelize a single one-dimensional transform of
202
complex data, instead of calling the uniprocessor <CODE>fftw_one(plan,
203
in, out)</CODE>, you would call <CODE>fftw_threads_one(nthreads, plan, in,
204
out)</CODE>.  Passing an <CODE>nthreads</CODE> of <CODE>1</CODE> means to use only one
205
thread (the main thread), and is equivalent to calling the uniprocessor
206
routine.  Passing an <CODE>nthreads</CODE> of <CODE>2</CODE> means that the
207
transform is potentially parallelized over two threads (and two
208
processors, if you have them), and so on.
209
 
210
 
211
<P>
212
These are the only changes you need to make to your source code.  Calls
213
to all other FFTW routines (plan creation, destruction, wisdom,
214
etcetera) are not parallelized and remain the same.  (The same plans and
215
wisdom are used by both uniprocessor and multi-threaded transforms.)
216
Your arrays are allocated and formatted in the same way, and so on.
217
 
218
 
219
<P>
220
Programs using the parallel complex transforms should be linked with
221
<CODE>-lfftw_threads -lfftw -lm</CODE> on Unix.  Programs using the parallel
222
real transforms should be linked with <CODE>-lrfftw_threads
223
-lfftw_threads -lrfftw -lfftw -lm</CODE>.  You will also need to link with
224
whatever library is responsible for threads on your system
225
(e.g. <CODE>-lpthread</CODE> on Linux).
226
<A NAME="IDX221"></A>
227
 
228
 
229
 
230
 
231
<H3><A NAME="SEC51">How Many Threads to Use?</A></H3>
232
 
233
<P>
234
<A NAME="IDX222"></A>
235
There is a fair amount of overhead involved in spawning and synchronizing
236
threads, so the optimal number of threads to use depends upon the size
237
of the transform as well as on the number of processors you have.
238
 
239
 
240
<P>
241
As a general rule, you don't want to use more threads than you have
242
processors.  (Using more threads will work, but there will be extra
243
overhead with no benefit.)  In fact, if the problem size is too small,
244
you may want to use fewer threads than you have processors.
245
 
246
 
247
<P>
248
You will have to experiment with your system to see what level of
249
parallelization is best for your problem size.  Useful tools to help you
250
do this are the test programs that are automatically compiled along with
251
the threads libraries, <CODE>fftw_threads_test</CODE> and
252
<CODE>rfftw_threads_test</CODE> (in the <CODE>threads</CODE> subdirectory).  These
253
<A NAME="IDX223"></A>
254
<A NAME="IDX224"></A>
255
take the same arguments as the other FFTW test programs (see
256
<CODE>tests/README</CODE>), except that they also take the number of threads
257
to use as a first argument, and report the parallel speedup in speed
258
tests.  For example,
259
 
260
 
261
 
262
<PRE>
263
fftw_threads_test 2 -s 128x128
264
</PRE>
265
 
266
<P>
267
will benchmark complex 128x128 transforms using two threads and report
268
the speedup relative to the uniprocessor transform.
269
<A NAME="IDX225"></A>
270
 
271
 
272
<P>
273
For instance, on a 4-processor 200MHz Pentium Pro system running Linux
274
2.2.0, we found that the "crossover" point at which 2 threads became
275
beneficial for complex transforms was about 4k points, while 4 threads
276
became beneficial at 8k points.
277
 
278
 
279
 
280
 
281
<H3><A NAME="SEC52">Using Multi-threaded FFTW in a Multi-threaded Program</A></H3>
282
 
283
<P>
284
<A NAME="IDX226"></A>
285
It is perfectly possible to use the multi-threaded FFTW routines from a
286
multi-threaded program (e.g. have multiple threads computing
287
multi-threaded transforms simultaneously).  If you have the processors,
288
more power to you!  However, the same restrictions apply as for the
289
uniprocessor FFTW routines (see Section <A HREF="fftw_3.html#SEC46">Thread safety</A>).  In particular, you
290
should recall that you may not create or destroy plans in parallel.
291
 
292
 
293
 
294
 
295
<H3><A NAME="SEC53">Tips for Optimal Threading</A></H3>
296
 
297
<P>
298
Not all transforms are equally well-parallelized by the multi-threaded
299
FFTW routines.  (This is merely a consequence of laziness on the part of
300
the implementors, and is not inherent to the algorithms employed.)
301
Mainly, the limitations are in the parallel one-dimensional transforms.
302
The things to avoid if you want optimal parallelization are as follows:
303
 
304
 
305
 
306
 
307
<H3><A NAME="SEC54">Parallelization deficiencies in one-dimensional transforms</A></H3>
308
 
309
 
310
<UL>
311
 
312
<LI>
313
 
314
Large prime factors can sometimes parallelize poorly.  Of course, you
315
should avoid these anyway if you want high performance.
316
 
317
<LI>
318
 
319
<A NAME="IDX227"></A>
320
Single in-place transforms don't parallelize completely.  (Multiple
321
in-place transforms, i.e. <CODE>howmany &#62; 1</CODE>, are fine.)  Again, you
322
should avoid these in any case if you want high performance, as they
323
require transforming to a scratch array and copying back.
324
 
325
<LI>
326
 
327
Single real-complex (<CODE>rfftw</CODE>) transforms don't parallelize
328
completely.  This is unfortunate, but parallelizing this correctly would
329
have involved a lot of extra code (and a much larger library).  You
330
still get some benefit from additional processors, but if you have a
331
very large number of processors you will probably be better off using
332
the parallel complex (<CODE>fftw</CODE>) transforms.  Note that
333
multi-dimensional real transforms or multiple one-dimensional real
334
transforms are fine.
335
 
336
</UL>
337
 
338
 
339
 
340
<H2><A NAME="SEC55">MPI FFTW</A></H2>
341
 
342
<P>
343
<A NAME="IDX228"></A>
344
This section describes the MPI FFTW routines for distributed-memory (and
345
shared-memory) machines supporting MPI (Message Passing Interface).  The
346
MPI routines are significantly different from the ordinary FFTW because
347
the transform data here are <EM>distributed</EM> over multiple processes,
348
so that each process gets only a portion of the array.
349
<A NAME="IDX229"></A>
350
Currently, multi-dimensional transforms of both real and complex data,
351
as well as one-dimensional transforms of complex data, are supported.
352
 
353
 
354
 
355
 
356
<H3><A NAME="SEC56">MPI FFTW Installation</A></H3>
357
 
358
<P>
359
The FFTW MPI library code is all located in the <CODE>mpi</CODE> subdirectoy
360
of the FFTW package (along with source code for test programs).  On Unix
361
systems, the FFTW MPI libraries and header files can be automatically
362
configured, compiled, and installed along with the uniprocessor FFTW
363
libraries simply by including <CODE>--enable-mpi</CODE> in the flags to the
364
<CODE>configure</CODE> script (see Section <A HREF="fftw_6.html#SEC67">Installation on Unix</A>).
365
<A NAME="IDX230"></A>
366
 
367
 
368
<P>
369
The only requirement of the FFTW MPI code is that you have the standard
370
MPI 1.1 (or later) libraries and header files installed on your system.
371
A free implementation of MPI is available from
372
<A HREF="http://www-unix.mcs.anl.gov/mpi/mpich/">the MPICH home page</A>.
373
 
374
 
375
<P>
376
Previous versions of the FFTW MPI routines have had an unfortunate
377
tendency to expose bugs in MPI implementations.  The current version has
378
been largely rewritten, and hopefully avoids some of the problems.  If
379
you run into difficulties, try passing the optional workspace to
380
<CODE>(r)fftwnd_mpi</CODE> (see below), as this allows us to use the standard
381
(and hopefully well-tested) <CODE>MPI_Alltoall</CODE> primitive for
382
<A NAME="IDX231"></A>
383
communications.  Please let us know (<A HREF="mailto:fftw@theory.lcs.mit.edu">fftw@theory.lcs.mit.edu</A>)
384
how things work out.
385
 
386
 
387
<P>
388
<A NAME="IDX232"></A>
389
<A NAME="IDX233"></A>
390
Several test programs are included in the <CODE>mpi</CODE> directory.  The
391
ones most useful to you are probably the <CODE>fftw_mpi_test</CODE> and
392
<CODE>rfftw_mpi_test</CODE> programs, which are run just like an ordinary MPI
393
program and accept the same parameters as the other FFTW test programs
394
(c.f. <CODE>tests/README</CODE>).  For example, <CODE>mpirun <I>...params...</I>
395
fftw_mpi_test -r 0</CODE> will run non-terminating complex-transform
396
correctness tests of random dimensions.  They can also do performance
397
benchmarks.
398
 
399
 
400
 
401
 
402
<H3><A NAME="SEC57">Usage of MPI FFTW for Complex Multi-dimensional Transforms</A></H3>
403
 
404
<P>
405
Usage of the MPI FFTW routines is similar to that of the uniprocessor
406
FFTW.  We assume that the reader already understands the usage of the
407
uniprocessor FFTW routines, described elsewhere in this manual.  Some
408
familiarity with MPI is also helpful.
409
 
410
 
411
<P>
412
A typical program performing a complex two-dimensional MPI transform
413
might look something like:
414
 
415
 
416
 
417
<PRE>
418
#include &#60;fftw_mpi.h&#62;
419
 
420
int main(int argc, char **argv)
421
{
422
      const int NX = ..., NY = ...;
423
      fftwnd_mpi_plan plan;
424
      fftw_complex *data;
425
 
426
      MPI_Init(&#38;argc,&#38;argv);
427
 
428
      plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD,
429
                                    NX, NY,
430
                                    FFTW_FORWARD, FFTW_ESTIMATE);
431
 
432
      ...allocate and initialize data...
433
 
434
      fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER);
435
 
436
      ...
437
 
438
      fftwnd_mpi_destroy_plan(plan);
439
      MPI_Finalize();
440
}
441
</PRE>
442
 
443
<P>
444
The calls to <CODE>MPI_Init</CODE> and <CODE>MPI_Finalize</CODE> are required in all
445
<A NAME="IDX234"></A>
446
<A NAME="IDX235"></A>
447
MPI programs; see the <A HREF="http://www.mcs.anl.gov/mpi/">MPI home page</A>
448
for more information.  Note that all of your processes run the program
449
in parallel, as a group; there is no explicit launching of
450
threads/processes in an MPI program.
451
 
452
 
453
<P>
454
<A NAME="IDX236"></A>
455
As in the ordinary FFTW, the first thing we do is to create a plan (of
456
type <CODE>fftwnd_mpi_plan</CODE>), using:
457
 
458
 
459
 
460
<PRE>
461
fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm,
462
                                       int nx, int ny,
463
                                       fftw_direction dir, int flags);
464
</PRE>
465
 
466
<P>
467
<A NAME="IDX237"></A>
468
<A NAME="IDX238"></A>
469
 
470
 
471
<P>
472
Except for the first argument, the parameters are identical to those of
473
<CODE>fftw2d_create_plan</CODE>.  (There are also analogous
474
<CODE>fftwnd_mpi_create_plan</CODE> and <CODE>fftw3d_mpi_create_plan</CODE>
475
functions.  Transforms of any rank greater than one are supported.)
476
<A NAME="IDX239"></A>
477
<A NAME="IDX240"></A>
478
The first argument is an MPI <EM>communicator</EM>, which specifies the
479
group of processes that are to be involved in the transform; the
480
standard constant <CODE>MPI_COMM_WORLD</CODE> indicates all available
481
processes.
482
<A NAME="IDX241"></A>
483
 
484
 
485
<P>
486
Next, one has to allocate and initialize the data.  This is somewhat
487
tricky, because the transform data is distributed across the processes
488
involved in the transform.  It is discussed in detail by the next
489
section (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>).
490
 
491
 
492
<P>
493
The actual computation of the transform is performed by the function
494
<CODE>fftwnd_mpi</CODE>, which differs somewhat from its uniprocessor
495
equivalent and is described by:
496
 
497
 
498
 
499
<PRE>
500
void fftwnd_mpi(fftwnd_mpi_plan p,
501
                int n_fields,
502
                fftw_complex *local_data, fftw_complex *work,
503
                fftwnd_mpi_output_order output_order);
504
</PRE>
505
 
506
<P>
507
<A NAME="IDX242"></A>
508
 
509
 
510
<P>
511
There are several things to notice here:
512
 
513
 
514
 
515
<UL>
516
 
517
<LI>
518
 
519
<A NAME="IDX243"></A>
520
First of all, all <CODE>fftw_mpi</CODE> transforms are in-place: the output is
521
in the <CODE>local_data</CODE> parameter, and there is no need to specify
522
<CODE>FFTW_IN_PLACE</CODE> in the plan flags.
523
 
524
<LI>
525
 
526
<A NAME="IDX244"></A>
527
<A NAME="IDX245"></A>
528
The MPI transforms also only support a limited subset of the
529
<CODE>howmany</CODE>/<CODE>stride</CODE>/<CODE>dist</CODE> functionality of the
530
uniprocessor routines: the <CODE>n_fields</CODE> parameter is equivalent to
531
<CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>.
532
(Conceptually, the <CODE>n_fields</CODE> parameter allows you to transform an
533
array of contiguous vectors, each with length <CODE>n_fields</CODE>.)
534
<CODE>n_fields</CODE> is <CODE>1</CODE> if you are only transforming a single,
535
ordinary array.
536
 
537
<LI>
538
 
539
The <CODE>work</CODE> parameter is an optional workspace.  If it is not
540
<CODE>NULL</CODE>, it should be exactly the same size as the <CODE>local_data</CODE>
541
array.  If it is provided, FFTW is able to use the built-in
542
<CODE>MPI_Alltoall</CODE> primitive for (often) greater efficiency at the
543
<A NAME="IDX246"></A>
544
expense of extra storage space.
545
 
546
<LI>
547
 
548
Finally, the last parameter specifies whether the output data has the
549
same ordering as the input data (<CODE>FFTW_NORMAL_ORDER</CODE>), or if it is
550
transposed (<CODE>FFTW_TRANSPOSED_ORDER</CODE>).  Leaving the data transposed
551
<A NAME="IDX247"></A>
552
<A NAME="IDX248"></A>
553
results in significant performance improvements due to a saved
554
communication step (needed to un-transpose the data).  Specifically, the
555
first two dimensions of the array are transposed, as is described in
556
more detail by the next section.
557
 
558
</UL>
559
 
560
<P>
561
<A NAME="IDX249"></A>
562
The output of <CODE>fftwnd_mpi</CODE> is identical to that of the
563
corresponding uniprocessor transform.  In particular, you should recall
564
our conventions for normalization and the sign of the transform
565
exponent.
566
 
567
 
568
<P>
569
The same plan can be used to compute many transforms of the same size.
570
After you are done with it, you should deallocate it by calling
571
<CODE>fftwnd_mpi_destroy_plan</CODE>.
572
<A NAME="IDX250"></A>
573
 
574
 
575
<P>
576
<A NAME="IDX251"></A>
577
<A NAME="IDX252"></A>
578
<B>Important:</B> The FFTW MPI routines must be called in the same order by
579
all processes involved in the transform.  You should assume that they
580
all are blocking, as if each contained a call to <CODE>MPI_Barrier</CODE>.
581
 
582
 
583
<P>
584
Programs using the FFTW MPI routines should be linked with
585
<CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever libraries
586
are required for MPI.
587
<A NAME="IDX253"></A>
588
 
589
 
590
 
591
 
592
<H3><A NAME="SEC58">MPI Data Layout</A></H3>
593
 
594
<P>
595
<A NAME="IDX254"></A>
596
<A NAME="IDX255"></A>
597
The transform data used by the MPI FFTW routines is <EM>distributed</EM>: a
598
distinct portion of it resides with each process involved in the
599
transform.  This allows the transform to be parallelized, for example,
600
over a cluster of workstations, each with its own separate memory, so
601
that you can take advantage of the total memory of all the processors
602
you are parallelizing over.
603
 
604
 
605
<P>
606
In particular, the array is divided according to the rows (first
607
dimension) of the data: each process gets a subset of the rows of the
608
<A NAME="IDX256"></A>
609
data.  (This is sometimes called a "slab decomposition.")  One
610
consequence of this is that you can't take advantage of more processors
611
than you have rows (e.g. <CODE>64x64x64</CODE> matrix can at most use 64
612
processors).  This isn't usually much of a limitation, however, as each
613
processor needs a fair amount of data in order for the
614
parallel-computation benefits to outweight the communications costs.
615
 
616
 
617
<P>
618
Below, the first dimension of the data will be referred to as
619
<SAMP>`<CODE>x</CODE>'</SAMP> and the second dimension as <SAMP>`<CODE>y</CODE>'</SAMP>.
620
 
621
 
622
<P>
623
FFTW supplies a routine to tell you exactly how much data resides on the
624
current process:
625
 
626
 
627
 
628
<PRE>
629
void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p,
630
                            int *local_nx,
631
                            int *local_x_start,
632
                            int *local_ny_after_transpose,
633
                            int *local_y_start_after_transpose,
634
                            int *total_local_size);
635
</PRE>
636
 
637
<P>
638
<A NAME="IDX257"></A>
639
 
640
 
641
<P>
642
Given a plan <CODE>p</CODE>, the other parameters of this routine are set to
643
values describing the required data layout, described below.
644
 
645
 
646
<P>
647
<CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elements
648
that you must allocate for your local data (and workspace, if you
649
choose).  (This value should, of course, be multiplied by
650
<CODE>n_fields</CODE> if that parameter to <CODE>fftwnd_mpi</CODE> is not <CODE>1</CODE>.)
651
 
652
 
653
<P>
654
The data on the current process has <CODE>local_nx</CODE> rows, starting at
655
row <CODE>local_x_start</CODE>.  If <CODE>fftwnd_mpi</CODE> is called with
656
<CODE>FFTW_TRANSPOSED_ORDER</CODE> output, then <CODE>y</CODE> will be the first
657
dimension of the output, and the local <CODE>y</CODE> extent will be given by
658
<CODE>local_ny_after_transpose</CODE> and
659
<CODE>local_y_start_after_transpose</CODE>.  Otherwise, the output has the
660
same dimensions and layout as the input.
661
 
662
 
663
<P>
664
For instance, suppose you want to transform three-dimensional data of
665
size <CODE>nx x ny x nz</CODE>.  Then, the current process will store a subset
666
of this data, of size <CODE>local_nx x ny x nz</CODE>, where the <CODE>x</CODE>
667
indices correspond to the range <CODE>local_x_start</CODE> to
668
<CODE>local_x_start+local_nx-1</CODE> in the "real" (i.e. logical) array.
669
If <CODE>fftwnd_mpi</CODE> is called with <CODE>FFTW_TRANSPOSED_ORDER</CODE> output,
670
<A NAME="IDX258"></A>
671
then the result will be a <CODE>ny x nx x nz</CODE> array, of which a
672
<CODE>local_ny_after_transpose x nx x nz</CODE> subset is stored on the
673
current process (corresponding to <CODE>y</CODE> values starting at
674
<CODE>local_y_start_after_transpose</CODE>).
675
 
676
 
677
<P>
678
The following is an example of allocating such a three-dimensional array
679
array (<CODE>local_data</CODE>) before the transform and initializing it to
680
some function <CODE>f(x,y,z)</CODE>:
681
 
682
 
683
 
684
<PRE>
685
        fftwnd_mpi_local_sizes(plan, &#38;local_nx, &#38;local_x_start,
686
                               &#38;local_ny_after_transpose,
687
                               &#38;local_y_start_after_transpose,
688
                               &#38;total_local_size);
689
 
690
        local_data = (fftw_complex*) malloc(sizeof(fftw_complex) *
691
                                            total_local_size);
692
 
693
        for (x = 0; x &#60; local_nx; ++x)
694
                for (y = 0; y &#60; ny; ++y)
695
                        for (z = 0; z &#60; nz; ++z)
696
                                local_data[(x*ny + y)*nz + z]
697
                                        = f(x + local_x_start, y, z);
698
</PRE>
699
 
700
<P>
701
Some important things to remember:
702
 
703
 
704
 
705
<UL>
706
 
707
<LI>
708
 
709
Although the local data is of dimensions <CODE>local_nx x ny x nz</CODE> in
710
the above example, do <EM>not</EM> allocate the array to be of size
711
<CODE>local_nx*ny*nz</CODE>.  Use <CODE>total_local_size</CODE> instead.
712
 
713
<LI>
714
 
715
The amount of data on each process will not necessarily be the same; in
716
fact, <CODE>local_nx</CODE> may even be zero for some processes.  (For
717
example, suppose you are doing a <CODE>6x6</CODE> transform on four
718
processors.  There is no way to effectively use the fourth processor in
719
a slab decomposition, so we leave it empty.  Proof left as an exercise
720
for the reader.)
721
 
722
<LI>
723
 
724
<A NAME="IDX259"></A>
725
All arrays are, of course, in row-major order (see Section <A HREF="fftw_2.html#SEC7">Multi-dimensional Array Format</A>).
726
 
727
<LI>
728
 
729
If you want to compute the inverse transform of the output of
730
<CODE>fftwnd_mpi</CODE>, the dimensions of the inverse transform are given by
731
the dimensions of the output of the forward transform.  For example, if
732
you are using <CODE>FFTW_TRANSPOSED_ORDER</CODE> output in the above example,
733
then the inverse plan should be created with dimensions <CODE>ny x nx x
734
nz</CODE>.
735
 
736
<LI>
737
 
738
The data layout only depends upon the dimensions of the array, not on
739
the plan, so you are guaranteed that different plans for the same size
740
(or inverse plans) will use the same (consistent) data layouts.
741
 
742
</UL>
743
 
744
 
745
 
746
<H3><A NAME="SEC59">Usage of MPI FFTW for Real Multi-dimensional Transforms</A></H3>
747
 
748
<P>
749
MPI transforms specialized for real data are also available, similiar to
750
the uniprocessor <CODE>rfftwnd</CODE> transforms.  Just as in the uniprocessor
751
case, the real-data MPI functions gain roughly a factor of two in speed
752
(and save a factor of two in space) at the expense of more complicated
753
data formats in the calling program.  Before reading this section, you
754
should definitely understand how to call the uniprocessor <CODE>rfftwnd</CODE>
755
functions and also the complex MPI FFTW functions.
756
 
757
 
758
<P>
759
The following is an example of a program using <CODE>rfftwnd_mpi</CODE>.  It
760
computes the size <CODE>nx x ny x nz</CODE> transform of a real function
761
<CODE>f(x,y,z)</CODE>, multiplies the imaginary part by <CODE>2</CODE> for fun, then
762
computes the inverse transform.  (We'll also use
763
<CODE>FFTW_TRANSPOSED_ORDER</CODE> output for the transform, and additionally
764
supply the optional workspace parameter to <CODE>rfftwnd_mpi</CODE>, just to
765
add a little spice.)
766
 
767
 
768
 
769
<PRE>
770
#include &#60;rfftw_mpi.h&#62;
771
 
772
int main(int argc, char **argv)
773
{
774
     const int nx = ..., ny = ..., nz = ...;
775
     int local_nx, local_x_start, local_ny_after_transpose,
776
         local_y_start_after_transpose, total_local_size;
777
     int x, y, z;
778
     rfftwnd_mpi_plan plan, iplan;
779
     fftw_real *data, *work;
780
     fftw_complex *cdata;
781
 
782
     MPI_Init(&#38;argc,&#38;argv);
783
 
784
     /* create the forward and backward plans: */
785
     plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
786
                                    nx, ny, nz,
787
                                    FFTW_REAL_TO_COMPLEX,
788
                                    FFTW_ESTIMATE);
789
<A NAME="IDX260"></A>     iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
790
      /* dim.'s of REAL data --&#62; */  nx, ny, nz,
791
                                     FFTW_COMPLEX_TO_REAL,
792
                                     FFTW_ESTIMATE);
793
 
794
     rfftwnd_mpi_local_sizes(plan, &#38;local_nx, &#38;local_x_start,
795
                            &#38;local_ny_after_transpose,
796
                            &#38;local_y_start_after_transpose,
797
                            &#38;total_local_size);
798
<A NAME="IDX261"></A>
799
     data = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);
800
 
801
     /* workspace is the same size as the data: */
802
     work = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size);
803
 
804
     /* initialize data to f(x,y,z): */
805
     for (x = 0; x &#60; local_nx; ++x)
806
             for (y = 0; y &#60; ny; ++y)
807
                     for (z = 0; z &#60; nz; ++z)
808
                             data[(x*ny + y) * (2*(nz/2+1)) + z]
809
                                     = f(x + local_x_start, y, z);
810
 
811
     /* Now, compute the forward transform: */
812
     rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER);
813
<A NAME="IDX262"></A>
814
     /* the data is now complex, so typecast a pointer: */
815
     cdata = (fftw_complex*) data;
816
 
817
     /* multiply imaginary part by 2, for fun:
818
        (note that the data is transposed) */
819
     for (y = 0; y &#60; local_ny_after_transpose; ++y)
820
             for (x = 0; x &#60; nx; ++x)
821
                     for (z = 0; z &#60; (nz/2+1); ++z)
822
                             cdata[(y*nx + x) * (nz/2+1) + z].im
823
                                     *= 2.0;
824
 
825
     /* Finally, compute the inverse transform; the result
826
        is transposed back to the original data layout: */
827
     rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER);
828
 
829
     free(data);
830
     free(work);
831
     rfftwnd_mpi_destroy_plan(plan);
832
<A NAME="IDX263"></A>     rfftwnd_mpi_destroy_plan(iplan);
833
     MPI_Finalize();
834
}
835
</PRE>
836
 
837
<P>
838
There's a lot of stuff in this example, but it's all just what you would
839
have guessed, right?  We replaced all the <CODE>fftwnd_mpi*</CODE> functions
840
by <CODE>rfftwnd_mpi*</CODE>, but otherwise the parameters were pretty much
841
the same.  The data layout distributed among the processes just like for
842
the complex transforms (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>), but in addition the
843
final dimension is padded just like it is for the uniprocessor in-place
844
real transforms (see Section <A HREF="fftw_3.html#SEC37">Array Dimensions for Real Multi-dimensional Transforms</A>).
845
<A NAME="IDX264"></A>
846
In particular, the <CODE>z</CODE> dimension of the real input data is padded
847
to a size <CODE>2*(nz/2+1)</CODE>, and after the transform it contains
848
<CODE>nz/2+1</CODE> complex values.
849
<A NAME="IDX265"></A>
850
<A NAME="IDX266"></A>
851
 
852
 
853
<P>
854
Some other important things to know about the real MPI transforms:
855
 
856
 
857
 
858
<UL>
859
 
860
<LI>
861
 
862
As for the uniprocessor <CODE>rfftwnd_create_plan</CODE>, the dimensions
863
passed for the <CODE>FFTW_COMPLEX_TO_REAL</CODE> plan are those of the
864
<EM>real</EM> data.  In particular, even when <CODE>FFTW_TRANSPOSED_ORDER</CODE>
865
<A NAME="IDX267"></A>
866
<A NAME="IDX268"></A>
867
is used as in this case, the dimensions are those of the (untransposed)
868
real output, not the (transposed) complex input.  (For the complex MPI
869
transforms, on the other hand, the dimensions are always those of the
870
input array.)
871
 
872
<LI>
873
 
874
The output ordering of the transform (<CODE>FFTW_TRANSPOSED_ORDER</CODE> or
875
<CODE>FFTW_TRANSPOSED_ORDER</CODE>) <EM>must</EM> be the same for both forward
876
and backward transforms.  (This is not required in the complex case.)
877
 
878
<LI>
879
 
880
<CODE>total_local_size</CODE> is the required size in <CODE>fftw_real</CODE> values,
881
not <CODE>fftw_complex</CODE> values as it is for the complex transforms.
882
 
883
<LI>
884
 
885
<CODE>local_ny_after_transpose</CODE> and <CODE>local_y_start_after_transpose</CODE>
886
describe the portion of the array after the transform; that is, they are
887
indices in the complex array for an <CODE>FFTW_REAL_TO_COMPLEX</CODE> transform
888
and in the real array for an <CODE>FFTW_COMPLEX_TO_REAL</CODE> transform.
889
 
890
<LI>
891
 
892
<CODE>rfftwnd_mpi</CODE> always expects <CODE>fftw_real*</CODE> array arguments, but
893
of course these pointers can refer to either real or complex arrays,
894
depending upon which side of the transform you are on.  Just as for
895
in-place uniprocessor real transforms (and also in the example above),
896
this is most easily handled by typecasting to a complex pointer when
897
handling the complex data.
898
 
899
<LI>
900
 
901
As with the complex transforms, there are also
902
<CODE>rfftwnd_create_plan</CODE> and <CODE>rfftw2d_create_plan</CODE> functions, and
903
any rank greater than one is supported.
904
<A NAME="IDX269"></A>
905
<A NAME="IDX270"></A>
906
 
907
</UL>
908
 
909
<P>
910
Programs using the MPI FFTW real transforms should link with
911
<CODE>-lrfftw_mpi -lfftw_mpi -lrfftw -lfftw -lm</CODE> on Unix.
912
<A NAME="IDX271"></A>
913
 
914
 
915
 
916
 
917
<H3><A NAME="SEC60">Usage of MPI FFTW for Complex One-dimensional Transforms</A></H3>
918
 
919
<P>
920
The MPI FFTW also includes routines for parallel one-dimensional
921
transforms of complex data (only).  Although the speedup is generally
922
worse than it is for the multi-dimensional routines,<A NAME="DOCF6" HREF="fftw_foot.html#FOOT6">(6)</A> these distributed-memory one-dimensional transforms are
923
especially useful for performing one-dimensional transforms that don't
924
fit into the memory of a single machine.
925
 
926
 
927
<P>
928
The usage of these routines is straightforward, and is similar to that
929
of the multi-dimensional MPI transform functions.  You first include the
930
header <CODE>&#60;fftw_mpi.h&#62;</CODE> and then create a plan by calling:
931
 
932
 
933
 
934
<PRE>
935
fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n,
936
                                   fftw_direction dir, int flags);
937
</PRE>
938
 
939
<P>
940
<A NAME="IDX272"></A>
941
<A NAME="IDX273"></A>
942
 
943
 
944
<P>
945
The last three arguments are the same as for <CODE>fftw_create_plan</CODE>
946
(except that all MPI transforms are automatically <CODE>FFTW_IN_PLACE</CODE>).
947
The first argument specifies the group of processes you are using, and
948
is usually <CODE>MPI_COMM_WORLD</CODE> (all processes).
949
<A NAME="IDX274"></A>
950
A plan can be used for many transforms of the same size, and is
951
destroyed when you are done with it by calling
952
<CODE>fftw_mpi_destroy_plan(plan)</CODE>.
953
<A NAME="IDX275"></A>
954
 
955
 
956
<P>
957
If you don't care about the ordering of the input or output data of the
958
transform, you can include <CODE>FFTW_SCRAMBLED_INPUT</CODE> and/or
959
<CODE>FFTW_SCRAMBLED_OUTPUT</CODE> in the <CODE>flags</CODE>.
960
<A NAME="IDX276"></A>
961
<A NAME="IDX277"></A>
962
<A NAME="IDX278"></A>
963
These save some communications at the expense of having the input and/or
964
output reordered in an undocumented way.  For example, if you are
965
performing an FFT-based convolution, you might use
966
<CODE>FFTW_SCRAMBLED_OUTPUT</CODE> for the forward transform and
967
<CODE>FFTW_SCRAMBLED_INPUT</CODE> for the inverse transform.
968
 
969
 
970
<P>
971
The transform itself is computed by:
972
 
973
 
974
 
975
<PRE>
976
void fftw_mpi(fftw_mpi_plan p, int n_fields,
977
              fftw_complex *local_data, fftw_complex *work);
978
</PRE>
979
 
980
<P>
981
<A NAME="IDX279"></A>
982
 
983
 
984
<P>
985
<A NAME="IDX280"></A>
986
<CODE>n_fields</CODE>, as in <CODE>fftwnd_mpi</CODE>, is equivalent to
987
<CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>, and
988
should be <CODE>1</CODE> when you are computing the transform of a single
989
array.  <CODE>local_data</CODE> contains the portion of the array local to the
990
current process, described below.  <CODE>work</CODE> is either <CODE>NULL</CODE> or
991
an array exactly the same size as <CODE>local_data</CODE>; in the latter case,
992
FFTW can use the <CODE>MPI_Alltoall</CODE> communications primitive which is
993
(usually) faster at the expense of extra storage.  Upon return,
994
<CODE>local_data</CODE> contains the portion of the output local to the
995
current process (see below).
996
<A NAME="IDX281"></A>
997
 
998
 
999
<P>
1000
<A NAME="IDX282"></A>
1001
To find out what portion of the array is stored local to the current
1002
process, you call the following routine:
1003
 
1004
 
1005
 
1006
<PRE>
1007
void fftw_mpi_local_sizes(fftw_mpi_plan p,
1008
                          int *local_n, int *local_start,
1009
                          int *local_n_after_transform,
1010
                          int *local_start_after_transform,
1011
                          int *total_local_size);
1012
</PRE>
1013
 
1014
<P>
1015
<A NAME="IDX283"></A>
1016
 
1017
 
1018
<P>
1019
<CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elements
1020
you should actually allocate for <CODE>local_data</CODE> (and <CODE>work</CODE>).
1021
<CODE>local_n</CODE> and <CODE>local_start</CODE> indicate that the current process
1022
stores <CODE>local_n</CODE> elements corresponding to the indices
1023
<CODE>local_start</CODE> to <CODE>local_start+local_n-1</CODE> in the "real"
1024
array.  <EM>After the transform, the process may store a different
1025
portion of the array.</EM>  The portion of the data stored on the process
1026
after the transform is given by <CODE>local_n_after_transform</CODE> and
1027
<CODE>local_start_after_transform</CODE>.  This data is exactly the same as a
1028
contiguous segment of the corresponding uniprocessor transform output
1029
(i.e. an in-order sequence of sequential frequency bins).
1030
 
1031
 
1032
<P>
1033
Note that, if you compute both a forward and a backward transform of the
1034
same size, the local sizes are guaranteed to be consistent.  That is,
1035
the local size after the forward transform will be the same as the local
1036
size before the backward transform, and vice versa.
1037
 
1038
 
1039
<P>
1040
Programs using the FFTW MPI routines should be linked with
1041
<CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever libraries
1042
are required for MPI.
1043
<A NAME="IDX284"></A>
1044
 
1045
 
1046
 
1047
 
1048
<H3><A NAME="SEC61">MPI Tips</A></H3>
1049
 
1050
<P>
1051
There are several things you should consider in order to get the best
1052
performance out of the MPI FFTW routines.
1053
 
1054
 
1055
<P>
1056
<A NAME="IDX285"></A>
1057
First, if possible, the first and second dimensions of your data should
1058
be divisible by the number of processes you are using.  (If only one can
1059
be divisible, then you should choose the first dimension.)  This allows
1060
the computational load to be spread evenly among the processes, and also
1061
reduces the communications complexity and overhead.  In the
1062
one-dimensional transform case, the size of the transform should ideally
1063
be divisible by the <EM>square</EM> of the number of processors.
1064
 
1065
 
1066
<P>
1067
<A NAME="IDX286"></A>
1068
Second, you should consider using the <CODE>FFTW_TRANSPOSED_ORDER</CODE>
1069
output format if it is not too burdensome.  The speed gains from
1070
communications savings are usually substantial.
1071
 
1072
 
1073
<P>
1074
Third, you should consider allocating a workspace for
1075
<CODE>(r)fftw(nd)_mpi</CODE>, as this can often
1076
(but not always) improve performance (at the cost of extra storage).
1077
 
1078
 
1079
<P>
1080
Fourth, you should experiment with the best number of processors to use
1081
for your problem.  (There comes a point of diminishing returns, when the
1082
communications costs outweigh the computational benefits.<A NAME="DOCF7" HREF="fftw_foot.html#FOOT7">(7)</A>)  The <CODE>fftw_mpi_test</CODE> program can output helpful performance
1083
benchmarks.
1084
<A NAME="IDX287"></A>
1085
<A NAME="IDX288"></A>
1086
It accepts the same parameters as the uniprocessor test programs
1087
(c.f. <CODE>tests/README</CODE>) and is run like an ordinary MPI program.  For
1088
example, <CODE>mpirun -np 4 fftw_mpi_test -s 128x128x128</CODE> will benchmark
1089
a <CODE>128x128x128</CODE> transform on four processors, reporting timings and
1090
parallel speedups for all variants of <CODE>fftwnd_mpi</CODE> (transposed,
1091
with workspace, etcetera).  (Note also that there is the
1092
<CODE>rfftw_mpi_test</CODE> program for the real transforms.)
1093
<A NAME="IDX289"></A>
1094
 
1095
 
1096
<P><HR><P>
1097
Go to the <A HREF="fftw_1.html">first</A>, <A HREF="fftw_3.html">previous</A>, <A HREF="fftw_5.html">next</A>, <A HREF="fftw_10.html">last</A> section, <A HREF="fftw_toc.html">table of contents</A>.
1098
</BODY>
1099
</HTML>