Rev 3 | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
2 | pj | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> |
2 | <HTML> |
||
3 | <HEAD> |
||
4 | <!-- This HTML file has been created by texi2html 1.52 |
||
5 | from fftw.texi on 18 May 1999 --> |
||
6 | |||
7 | <TITLE>FFTW - Parallel FFTW</TITLE> |
||
8 | </HEAD> |
||
9 | <BODY TEXT="#000000" BGCOLOR="#FFFFFF"> |
||
10 | Go to the <A HREF="fftw_1.html">first</A>, <A HREF="fftw_3.html">previous</A>, <A HREF="fftw_5.html">next</A>, <A HREF="fftw_10.html">last</A> section, <A HREF="fftw_toc.html">table of contents</A>. |
||
11 | <P><HR><P> |
||
12 | |||
13 | |||
14 | <H1><A NAME="SEC47">Parallel FFTW</A></H1> |
||
15 | |||
16 | <P> |
||
17 | <A NAME="IDX201"></A> |
||
18 | In this chapter we discuss the use of FFTW in a parallel environment, |
||
19 | documenting the different parallel libraries that we have provided. |
||
20 | (Users calling FFTW from a multi-threaded program should also consult |
||
21 | Section <A HREF="fftw_3.html#SEC46">Thread safety</A>.) The FFTW package currently contains three parallel |
||
22 | transform implementations that leverage the uniprocessor FFTW code: |
||
23 | |||
24 | |||
25 | |||
26 | <UL> |
||
27 | |||
28 | <LI> |
||
29 | |||
30 | <A NAME="IDX202"></A> |
||
31 | The first set of routines utilizes shared-memory threads for parallel |
||
32 | one- and multi-dimensional transforms of both real and complex data. |
||
33 | Any program using FFTW can be trivially modified to use the |
||
34 | multi-threaded routines. This code can use any common threads |
||
35 | implementation, including POSIX threads. (POSIX threads are available |
||
36 | on most Unix variants, including Linux.) These routines are located in |
||
37 | the <CODE>threads</CODE> directory, and are documented in Section <A HREF="fftw_4.html#SEC48">Multi-threaded FFTW</A>. |
||
38 | |||
39 | <LI> |
||
40 | |||
41 | <A NAME="IDX203"></A> |
||
42 | <A NAME="IDX204"></A> |
||
43 | The <CODE>mpi</CODE> directory contains multi-dimensional transforms |
||
44 | of real and complex data for parallel machines supporting MPI. It also |
||
45 | includes parallel one-dimensional transforms for complex data. The main |
||
46 | feature of this code is that it supports distributed-memory transforms, |
||
47 | so it runs on everything from workstation clusters to massively-parallel |
||
48 | supercomputers. More information on MPI can be found at the |
||
49 | <A HREF="http://www.mcs.anl.gov/mpi">MPI home page</A>. The FFTW MPI routines |
||
50 | are documented in Section <A HREF="fftw_4.html#SEC55">MPI FFTW</A>. |
||
51 | |||
52 | <LI> |
||
53 | |||
54 | <A NAME="IDX205"></A> |
||
55 | We also have an experimental parallel implementation written in Cilk, a |
||
56 | C-like parallel language developed at MIT and currently available for |
||
57 | several SMP platforms. For more information on Cilk see |
||
58 | <A HREF="http://supertech.lcs.mit.edu/cilk">the Cilk home page</A>. The FFTW |
||
59 | Cilk code can be found in the <CODE>cilk</CODE> directory, with parallelized |
||
60 | one- and multi-dimensional transforms of complex data. The Cilk FFTW |
||
61 | routines are documented in <CODE>cilk/README</CODE>. |
||
62 | |||
63 | </UL> |
||
64 | |||
65 | |||
66 | |||
67 | <H2><A NAME="SEC48">Multi-threaded FFTW</A></H2> |
||
68 | |||
69 | <P> |
||
70 | <A NAME="IDX206"></A> |
||
71 | In this section we document the parallel FFTW routines for shared-memory |
||
72 | threads on SMP hardware. These routines, which support parellel one- |
||
73 | and multi-dimensional transforms of both real and complex data, are the |
||
74 | easiest way to take advantage of multiple processors with FFTW. They |
||
75 | work just like the corresponding uniprocessor transform routines, except |
||
76 | that they take the number of parallel threads to use as an extra |
||
77 | parameter. Any program that uses the uniprocessor FFTW can be trivially |
||
78 | modified to use the multi-threaded FFTW. |
||
79 | |||
80 | |||
81 | |||
82 | |||
83 | <H3><A NAME="SEC49">Installation and Supported Hardware/Software</A></H3> |
||
84 | |||
85 | <P> |
||
86 | All of the FFTW threads code is located in the <CODE>threads</CODE> |
||
87 | subdirectory of the FFTW package. On Unix systems, the FFTW threads |
||
88 | libraries and header files can be automatically configured, compiled, |
||
89 | and installed along with the uniprocessor FFTW libraries simply by |
||
90 | including <CODE>--enable-threads</CODE> in the flags to the <CODE>configure</CODE> |
||
91 | script (see Section <A HREF="fftw_6.html#SEC67">Installation on Unix</A>). (Note also that the threads |
||
92 | routines, when enabled, are automatically tested by the <SAMP>`<CODE>make |
||
93 | check'</SAMP></CODE> self-tests.) |
||
94 | <A NAME="IDX207"></A> |
||
95 | |||
96 | |||
97 | <P> |
||
98 | The threads routines require your operating system to have some sort of |
||
99 | shared-memory threads support. Specifically, the FFTW threads package |
||
100 | works with POSIX threads (available on most Unix variants, including |
||
101 | Linux), Solaris threads, <A HREF="http://www.be.com">BeOS</A> threads (tested |
||
102 | on BeOS DR8.2), Mach C threads (reported to work by users), and Win32 |
||
103 | threads (reported to work by users). (There is also untested code to |
||
104 | use MacOS MP threads.) If you have a shared-memory machine that uses a |
||
105 | different threads API, it should be a simple matter of programming to |
||
106 | include support for it; see the file <CODE>fftw_threads-int.h</CODE> for more |
||
107 | detail. |
||
108 | |||
109 | |||
110 | <P> |
||
111 | SMP hardware is not required, although of course you need multiple |
||
112 | processors to get any benefit from the multithreaded transforms. |
||
113 | |||
114 | |||
115 | |||
116 | |||
117 | <H3><A NAME="SEC50">Usage of Multi-threaded FFTW</A></H3> |
||
118 | |||
119 | <P> |
||
120 | Here, it is assumed that the reader is already familiar with the usage |
||
121 | of the uniprocessor FFTW routines, described elsewhere in this manual. |
||
122 | We only describe what one has to change in order to use the |
||
123 | multi-threaded routines. |
||
124 | |||
125 | |||
126 | <P> |
||
127 | First, instead of including <CODE><fftw.h></CODE> or <CODE><rfftw.h></CODE>, you |
||
128 | should include the files <CODE><fftw_threads.h></CODE> or |
||
129 | <CODE><rfftw_threads.h></CODE>, respectively. |
||
130 | |||
131 | |||
132 | <P> |
||
133 | Second, before calling any FFTW routines, you should call the function: |
||
134 | |||
135 | |||
136 | |||
137 | <PRE> |
||
138 | int fftw_threads_init(void); |
||
139 | </PRE> |
||
140 | |||
141 | <P> |
||
142 | <A NAME="IDX208"></A> |
||
143 | |||
144 | |||
145 | <P> |
||
146 | This function, which should only be called once (probably in your |
||
147 | <CODE>main()</CODE> function), performs any one-time initialization required |
||
148 | to use threads on your system. It returns zero if successful, and a |
||
149 | non-zero value if there was an error (in which case, something is |
||
150 | seriously wrong and you should probably exit the program). |
||
151 | |||
152 | |||
153 | <P> |
||
154 | Third, when you want to actually compute the transform, you should use |
||
155 | one of the following transform routines instead of the ordinary FFTW |
||
156 | functions: |
||
157 | |||
158 | |||
159 | |||
160 | <PRE> |
||
161 | fftw_threads(nthreads, plan, howmany, in, istride, |
||
162 | idist, out, ostride, odist); |
||
163 | <A NAME="IDX209"></A> |
||
164 | fftw_threads_one(nthreads, plan, in, out); |
||
165 | <A NAME="IDX210"></A> |
||
166 | fftwnd_threads(nthreads, plan, howmany, in, istride, |
||
167 | idist, out, ostride, odist); |
||
168 | <A NAME="IDX211"></A> |
||
169 | fftwnd_threads_one(nthreads, plan, in, out); |
||
170 | <A NAME="IDX212"></A> |
||
171 | rfftw_threads(nthreads, plan, howmany, in, istride, |
||
172 | idist, out, ostride, odist); |
||
173 | <A NAME="IDX213"></A> |
||
174 | rfftw_threads_one(nthreads, plan, in, out); |
||
175 | <A NAME="IDX214"></A> |
||
176 | rfftwnd_threads_real_to_complex(nthreads, plan, howmany, in, |
||
177 | istride, idist, out, ostride, odist); |
||
178 | <A NAME="IDX215"></A> |
||
179 | rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out); |
||
180 | <A NAME="IDX216"></A> |
||
181 | rfftwnd_threads_complex_to_real(nthreads, plan, howmany, in, |
||
182 | istride, idist, out, ostride, odist); |
||
183 | <A NAME="IDX217"></A> |
||
184 | rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out); |
||
185 | <A NAME="IDX218"></A> |
||
186 | rfftwnd_threads_one_complex_to_real(nthreads, plan, in, out); |
||
187 | <A NAME="IDX219"></A></PRE> |
||
188 | |||
189 | <P> |
||
190 | All of these routines take exactly the same arguments and have exactly |
||
191 | the same effects as their uniprocessor counterparts (i.e. without the |
||
192 | <SAMP>`<CODE>_threads</CODE>'</SAMP>) <EM>except</EM> that they take one extra |
||
193 | parameter, <CODE>nthreads</CODE> (of type <CODE>int</CODE>), before the normal |
||
194 | parameters.<A NAME="DOCF5" HREF="fftw_foot.html#FOOT5">(5)</A> The <CODE>nthreads</CODE> |
||
195 | parameter specifies the number of threads of execution to use when |
||
196 | performing the transform (actually, the maximum number of threads). |
||
197 | <A NAME="IDX220"></A> |
||
198 | |||
199 | |||
200 | <P> |
||
201 | For example, to parallelize a single one-dimensional transform of |
||
202 | complex data, instead of calling the uniprocessor <CODE>fftw_one(plan, |
||
203 | in, out)</CODE>, you would call <CODE>fftw_threads_one(nthreads, plan, in, |
||
204 | out)</CODE>. Passing an <CODE>nthreads</CODE> of <CODE>1</CODE> means to use only one |
||
205 | thread (the main thread), and is equivalent to calling the uniprocessor |
||
206 | routine. Passing an <CODE>nthreads</CODE> of <CODE>2</CODE> means that the |
||
207 | transform is potentially parallelized over two threads (and two |
||
208 | processors, if you have them), and so on. |
||
209 | |||
210 | |||
211 | <P> |
||
212 | These are the only changes you need to make to your source code. Calls |
||
213 | to all other FFTW routines (plan creation, destruction, wisdom, |
||
214 | etcetera) are not parallelized and remain the same. (The same plans and |
||
215 | wisdom are used by both uniprocessor and multi-threaded transforms.) |
||
216 | Your arrays are allocated and formatted in the same way, and so on. |
||
217 | |||
218 | |||
219 | <P> |
||
220 | Programs using the parallel complex transforms should be linked with |
||
221 | <CODE>-lfftw_threads -lfftw -lm</CODE> on Unix. Programs using the parallel |
||
222 | real transforms should be linked with <CODE>-lrfftw_threads |
||
223 | -lfftw_threads -lrfftw -lfftw -lm</CODE>. You will also need to link with |
||
224 | whatever library is responsible for threads on your system |
||
225 | (e.g. <CODE>-lpthread</CODE> on Linux). |
||
226 | <A NAME="IDX221"></A> |
||
227 | |||
228 | |||
229 | |||
230 | |||
231 | <H3><A NAME="SEC51">How Many Threads to Use?</A></H3> |
||
232 | |||
233 | <P> |
||
234 | <A NAME="IDX222"></A> |
||
235 | There is a fair amount of overhead involved in spawning and synchronizing |
||
236 | threads, so the optimal number of threads to use depends upon the size |
||
237 | of the transform as well as on the number of processors you have. |
||
238 | |||
239 | |||
240 | <P> |
||
241 | As a general rule, you don't want to use more threads than you have |
||
242 | processors. (Using more threads will work, but there will be extra |
||
243 | overhead with no benefit.) In fact, if the problem size is too small, |
||
244 | you may want to use fewer threads than you have processors. |
||
245 | |||
246 | |||
247 | <P> |
||
248 | You will have to experiment with your system to see what level of |
||
249 | parallelization is best for your problem size. Useful tools to help you |
||
250 | do this are the test programs that are automatically compiled along with |
||
251 | the threads libraries, <CODE>fftw_threads_test</CODE> and |
||
252 | <CODE>rfftw_threads_test</CODE> (in the <CODE>threads</CODE> subdirectory). These |
||
253 | <A NAME="IDX223"></A> |
||
254 | <A NAME="IDX224"></A> |
||
255 | take the same arguments as the other FFTW test programs (see |
||
256 | <CODE>tests/README</CODE>), except that they also take the number of threads |
||
257 | to use as a first argument, and report the parallel speedup in speed |
||
258 | tests. For example, |
||
259 | |||
260 | |||
261 | |||
262 | <PRE> |
||
263 | fftw_threads_test 2 -s 128x128 |
||
264 | </PRE> |
||
265 | |||
266 | <P> |
||
267 | will benchmark complex 128x128 transforms using two threads and report |
||
268 | the speedup relative to the uniprocessor transform. |
||
269 | <A NAME="IDX225"></A> |
||
270 | |||
271 | |||
272 | <P> |
||
273 | For instance, on a 4-processor 200MHz Pentium Pro system running Linux |
||
274 | 2.2.0, we found that the "crossover" point at which 2 threads became |
||
275 | beneficial for complex transforms was about 4k points, while 4 threads |
||
276 | became beneficial at 8k points. |
||
277 | |||
278 | |||
279 | |||
280 | |||
281 | <H3><A NAME="SEC52">Using Multi-threaded FFTW in a Multi-threaded Program</A></H3> |
||
282 | |||
283 | <P> |
||
284 | <A NAME="IDX226"></A> |
||
285 | It is perfectly possible to use the multi-threaded FFTW routines from a |
||
286 | multi-threaded program (e.g. have multiple threads computing |
||
287 | multi-threaded transforms simultaneously). If you have the processors, |
||
288 | more power to you! However, the same restrictions apply as for the |
||
289 | uniprocessor FFTW routines (see Section <A HREF="fftw_3.html#SEC46">Thread safety</A>). In particular, you |
||
290 | should recall that you may not create or destroy plans in parallel. |
||
291 | |||
292 | |||
293 | |||
294 | |||
295 | <H3><A NAME="SEC53">Tips for Optimal Threading</A></H3> |
||
296 | |||
297 | <P> |
||
298 | Not all transforms are equally well-parallelized by the multi-threaded |
||
299 | FFTW routines. (This is merely a consequence of laziness on the part of |
||
300 | the implementors, and is not inherent to the algorithms employed.) |
||
301 | Mainly, the limitations are in the parallel one-dimensional transforms. |
||
302 | The things to avoid if you want optimal parallelization are as follows: |
||
303 | |||
304 | |||
305 | |||
306 | |||
307 | <H3><A NAME="SEC54">Parallelization deficiencies in one-dimensional transforms</A></H3> |
||
308 | |||
309 | |||
310 | <UL> |
||
311 | |||
312 | <LI> |
||
313 | |||
314 | Large prime factors can sometimes parallelize poorly. Of course, you |
||
315 | should avoid these anyway if you want high performance. |
||
316 | |||
317 | <LI> |
||
318 | |||
319 | <A NAME="IDX227"></A> |
||
320 | Single in-place transforms don't parallelize completely. (Multiple |
||
321 | in-place transforms, i.e. <CODE>howmany > 1</CODE>, are fine.) Again, you |
||
322 | should avoid these in any case if you want high performance, as they |
||
323 | require transforming to a scratch array and copying back. |
||
324 | |||
325 | <LI> |
||
326 | |||
327 | Single real-complex (<CODE>rfftw</CODE>) transforms don't parallelize |
||
328 | completely. This is unfortunate, but parallelizing this correctly would |
||
329 | have involved a lot of extra code (and a much larger library). You |
||
330 | still get some benefit from additional processors, but if you have a |
||
331 | very large number of processors you will probably be better off using |
||
332 | the parallel complex (<CODE>fftw</CODE>) transforms. Note that |
||
333 | multi-dimensional real transforms or multiple one-dimensional real |
||
334 | transforms are fine. |
||
335 | |||
336 | </UL> |
||
337 | |||
338 | |||
339 | |||
340 | <H2><A NAME="SEC55">MPI FFTW</A></H2> |
||
341 | |||
342 | <P> |
||
343 | <A NAME="IDX228"></A> |
||
344 | This section describes the MPI FFTW routines for distributed-memory (and |
||
345 | shared-memory) machines supporting MPI (Message Passing Interface). The |
||
346 | MPI routines are significantly different from the ordinary FFTW because |
||
347 | the transform data here are <EM>distributed</EM> over multiple processes, |
||
348 | so that each process gets only a portion of the array. |
||
349 | <A NAME="IDX229"></A> |
||
350 | Currently, multi-dimensional transforms of both real and complex data, |
||
351 | as well as one-dimensional transforms of complex data, are supported. |
||
352 | |||
353 | |||
354 | |||
355 | |||
356 | <H3><A NAME="SEC56">MPI FFTW Installation</A></H3> |
||
357 | |||
358 | <P> |
||
359 | The FFTW MPI library code is all located in the <CODE>mpi</CODE> subdirectoy |
||
360 | of the FFTW package (along with source code for test programs). On Unix |
||
361 | systems, the FFTW MPI libraries and header files can be automatically |
||
362 | configured, compiled, and installed along with the uniprocessor FFTW |
||
363 | libraries simply by including <CODE>--enable-mpi</CODE> in the flags to the |
||
364 | <CODE>configure</CODE> script (see Section <A HREF="fftw_6.html#SEC67">Installation on Unix</A>). |
||
365 | <A NAME="IDX230"></A> |
||
366 | |||
367 | |||
368 | <P> |
||
369 | The only requirement of the FFTW MPI code is that you have the standard |
||
370 | MPI 1.1 (or later) libraries and header files installed on your system. |
||
371 | A free implementation of MPI is available from |
||
372 | <A HREF="http://www-unix.mcs.anl.gov/mpi/mpich/">the MPICH home page</A>. |
||
373 | |||
374 | |||
375 | <P> |
||
376 | Previous versions of the FFTW MPI routines have had an unfortunate |
||
377 | tendency to expose bugs in MPI implementations. The current version has |
||
378 | been largely rewritten, and hopefully avoids some of the problems. If |
||
379 | you run into difficulties, try passing the optional workspace to |
||
380 | <CODE>(r)fftwnd_mpi</CODE> (see below), as this allows us to use the standard |
||
381 | (and hopefully well-tested) <CODE>MPI_Alltoall</CODE> primitive for |
||
382 | <A NAME="IDX231"></A> |
||
383 | communications. Please let us know (<A HREF="mailto:fftw@theory.lcs.mit.edu">fftw@theory.lcs.mit.edu</A>) |
||
384 | how things work out. |
||
385 | |||
386 | |||
387 | <P> |
||
388 | <A NAME="IDX232"></A> |
||
389 | <A NAME="IDX233"></A> |
||
390 | Several test programs are included in the <CODE>mpi</CODE> directory. The |
||
391 | ones most useful to you are probably the <CODE>fftw_mpi_test</CODE> and |
||
392 | <CODE>rfftw_mpi_test</CODE> programs, which are run just like an ordinary MPI |
||
393 | program and accept the same parameters as the other FFTW test programs |
||
394 | (c.f. <CODE>tests/README</CODE>). For example, <CODE>mpirun <I>...params...</I> |
||
395 | fftw_mpi_test -r 0</CODE> will run non-terminating complex-transform |
||
396 | correctness tests of random dimensions. They can also do performance |
||
397 | benchmarks. |
||
398 | |||
399 | |||
400 | |||
401 | |||
402 | <H3><A NAME="SEC57">Usage of MPI FFTW for Complex Multi-dimensional Transforms</A></H3> |
||
403 | |||
404 | <P> |
||
405 | Usage of the MPI FFTW routines is similar to that of the uniprocessor |
||
406 | FFTW. We assume that the reader already understands the usage of the |
||
407 | uniprocessor FFTW routines, described elsewhere in this manual. Some |
||
408 | familiarity with MPI is also helpful. |
||
409 | |||
410 | |||
411 | <P> |
||
412 | A typical program performing a complex two-dimensional MPI transform |
||
413 | might look something like: |
||
414 | |||
415 | |||
416 | |||
417 | <PRE> |
||
418 | #include <fftw_mpi.h> |
||
419 | |||
420 | int main(int argc, char **argv) |
||
421 | { |
||
422 | const int NX = ..., NY = ...; |
||
423 | fftwnd_mpi_plan plan; |
||
424 | fftw_complex *data; |
||
425 | |||
426 | MPI_Init(&argc,&argv); |
||
427 | |||
428 | plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD, |
||
429 | NX, NY, |
||
430 | FFTW_FORWARD, FFTW_ESTIMATE); |
||
431 | |||
432 | ...allocate and initialize data... |
||
433 | |||
434 | fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER); |
||
435 | |||
436 | ... |
||
437 | |||
438 | fftwnd_mpi_destroy_plan(plan); |
||
439 | MPI_Finalize(); |
||
440 | } |
||
441 | </PRE> |
||
442 | |||
443 | <P> |
||
444 | The calls to <CODE>MPI_Init</CODE> and <CODE>MPI_Finalize</CODE> are required in all |
||
445 | <A NAME="IDX234"></A> |
||
446 | <A NAME="IDX235"></A> |
||
447 | MPI programs; see the <A HREF="http://www.mcs.anl.gov/mpi/">MPI home page</A> |
||
448 | for more information. Note that all of your processes run the program |
||
449 | in parallel, as a group; there is no explicit launching of |
||
450 | threads/processes in an MPI program. |
||
451 | |||
452 | |||
453 | <P> |
||
454 | <A NAME="IDX236"></A> |
||
455 | As in the ordinary FFTW, the first thing we do is to create a plan (of |
||
456 | type <CODE>fftwnd_mpi_plan</CODE>), using: |
||
457 | |||
458 | |||
459 | |||
460 | <PRE> |
||
461 | fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm, |
||
462 | int nx, int ny, |
||
463 | fftw_direction dir, int flags); |
||
464 | </PRE> |
||
465 | |||
466 | <P> |
||
467 | <A NAME="IDX237"></A> |
||
468 | <A NAME="IDX238"></A> |
||
469 | |||
470 | |||
471 | <P> |
||
472 | Except for the first argument, the parameters are identical to those of |
||
473 | <CODE>fftw2d_create_plan</CODE>. (There are also analogous |
||
474 | <CODE>fftwnd_mpi_create_plan</CODE> and <CODE>fftw3d_mpi_create_plan</CODE> |
||
475 | functions. Transforms of any rank greater than one are supported.) |
||
476 | <A NAME="IDX239"></A> |
||
477 | <A NAME="IDX240"></A> |
||
478 | The first argument is an MPI <EM>communicator</EM>, which specifies the |
||
479 | group of processes that are to be involved in the transform; the |
||
480 | standard constant <CODE>MPI_COMM_WORLD</CODE> indicates all available |
||
481 | processes. |
||
482 | <A NAME="IDX241"></A> |
||
483 | |||
484 | |||
485 | <P> |
||
486 | Next, one has to allocate and initialize the data. This is somewhat |
||
487 | tricky, because the transform data is distributed across the processes |
||
488 | involved in the transform. It is discussed in detail by the next |
||
489 | section (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>). |
||
490 | |||
491 | |||
492 | <P> |
||
493 | The actual computation of the transform is performed by the function |
||
494 | <CODE>fftwnd_mpi</CODE>, which differs somewhat from its uniprocessor |
||
495 | equivalent and is described by: |
||
496 | |||
497 | |||
498 | |||
499 | <PRE> |
||
500 | void fftwnd_mpi(fftwnd_mpi_plan p, |
||
501 | int n_fields, |
||
502 | fftw_complex *local_data, fftw_complex *work, |
||
503 | fftwnd_mpi_output_order output_order); |
||
504 | </PRE> |
||
505 | |||
506 | <P> |
||
507 | <A NAME="IDX242"></A> |
||
508 | |||
509 | |||
510 | <P> |
||
511 | There are several things to notice here: |
||
512 | |||
513 | |||
514 | |||
515 | <UL> |
||
516 | |||
517 | <LI> |
||
518 | |||
519 | <A NAME="IDX243"></A> |
||
520 | First of all, all <CODE>fftw_mpi</CODE> transforms are in-place: the output is |
||
521 | in the <CODE>local_data</CODE> parameter, and there is no need to specify |
||
522 | <CODE>FFTW_IN_PLACE</CODE> in the plan flags. |
||
523 | |||
524 | <LI> |
||
525 | |||
526 | <A NAME="IDX244"></A> |
||
527 | <A NAME="IDX245"></A> |
||
528 | The MPI transforms also only support a limited subset of the |
||
529 | <CODE>howmany</CODE>/<CODE>stride</CODE>/<CODE>dist</CODE> functionality of the |
||
530 | uniprocessor routines: the <CODE>n_fields</CODE> parameter is equivalent to |
||
531 | <CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>. |
||
532 | (Conceptually, the <CODE>n_fields</CODE> parameter allows you to transform an |
||
533 | array of contiguous vectors, each with length <CODE>n_fields</CODE>.) |
||
534 | <CODE>n_fields</CODE> is <CODE>1</CODE> if you are only transforming a single, |
||
535 | ordinary array. |
||
536 | |||
537 | <LI> |
||
538 | |||
539 | The <CODE>work</CODE> parameter is an optional workspace. If it is not |
||
540 | <CODE>NULL</CODE>, it should be exactly the same size as the <CODE>local_data</CODE> |
||
541 | array. If it is provided, FFTW is able to use the built-in |
||
542 | <CODE>MPI_Alltoall</CODE> primitive for (often) greater efficiency at the |
||
543 | <A NAME="IDX246"></A> |
||
544 | expense of extra storage space. |
||
545 | |||
546 | <LI> |
||
547 | |||
548 | Finally, the last parameter specifies whether the output data has the |
||
549 | same ordering as the input data (<CODE>FFTW_NORMAL_ORDER</CODE>), or if it is |
||
550 | transposed (<CODE>FFTW_TRANSPOSED_ORDER</CODE>). Leaving the data transposed |
||
551 | <A NAME="IDX247"></A> |
||
552 | <A NAME="IDX248"></A> |
||
553 | results in significant performance improvements due to a saved |
||
554 | communication step (needed to un-transpose the data). Specifically, the |
||
555 | first two dimensions of the array are transposed, as is described in |
||
556 | more detail by the next section. |
||
557 | |||
558 | </UL> |
||
559 | |||
560 | <P> |
||
561 | <A NAME="IDX249"></A> |
||
562 | The output of <CODE>fftwnd_mpi</CODE> is identical to that of the |
||
563 | corresponding uniprocessor transform. In particular, you should recall |
||
564 | our conventions for normalization and the sign of the transform |
||
565 | exponent. |
||
566 | |||
567 | |||
568 | <P> |
||
569 | The same plan can be used to compute many transforms of the same size. |
||
570 | After you are done with it, you should deallocate it by calling |
||
571 | <CODE>fftwnd_mpi_destroy_plan</CODE>. |
||
572 | <A NAME="IDX250"></A> |
||
573 | |||
574 | |||
575 | <P> |
||
576 | <A NAME="IDX251"></A> |
||
577 | <A NAME="IDX252"></A> |
||
578 | <B>Important:</B> The FFTW MPI routines must be called in the same order by |
||
579 | all processes involved in the transform. You should assume that they |
||
580 | all are blocking, as if each contained a call to <CODE>MPI_Barrier</CODE>. |
||
581 | |||
582 | |||
583 | <P> |
||
584 | Programs using the FFTW MPI routines should be linked with |
||
585 | <CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever libraries |
||
586 | are required for MPI. |
||
587 | <A NAME="IDX253"></A> |
||
588 | |||
589 | |||
590 | |||
591 | |||
592 | <H3><A NAME="SEC58">MPI Data Layout</A></H3> |
||
593 | |||
594 | <P> |
||
595 | <A NAME="IDX254"></A> |
||
596 | <A NAME="IDX255"></A> |
||
597 | The transform data used by the MPI FFTW routines is <EM>distributed</EM>: a |
||
598 | distinct portion of it resides with each process involved in the |
||
599 | transform. This allows the transform to be parallelized, for example, |
||
600 | over a cluster of workstations, each with its own separate memory, so |
||
601 | that you can take advantage of the total memory of all the processors |
||
602 | you are parallelizing over. |
||
603 | |||
604 | |||
605 | <P> |
||
606 | In particular, the array is divided according to the rows (first |
||
607 | dimension) of the data: each process gets a subset of the rows of the |
||
608 | <A NAME="IDX256"></A> |
||
609 | data. (This is sometimes called a "slab decomposition.") One |
||
610 | consequence of this is that you can't take advantage of more processors |
||
611 | than you have rows (e.g. <CODE>64x64x64</CODE> matrix can at most use 64 |
||
612 | processors). This isn't usually much of a limitation, however, as each |
||
613 | processor needs a fair amount of data in order for the |
||
614 | parallel-computation benefits to outweight the communications costs. |
||
615 | |||
616 | |||
617 | <P> |
||
618 | Below, the first dimension of the data will be referred to as |
||
619 | <SAMP>`<CODE>x</CODE>'</SAMP> and the second dimension as <SAMP>`<CODE>y</CODE>'</SAMP>. |
||
620 | |||
621 | |||
622 | <P> |
||
623 | FFTW supplies a routine to tell you exactly how much data resides on the |
||
624 | current process: |
||
625 | |||
626 | |||
627 | |||
628 | <PRE> |
||
629 | void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p, |
||
630 | int *local_nx, |
||
631 | int *local_x_start, |
||
632 | int *local_ny_after_transpose, |
||
633 | int *local_y_start_after_transpose, |
||
634 | int *total_local_size); |
||
635 | </PRE> |
||
636 | |||
637 | <P> |
||
638 | <A NAME="IDX257"></A> |
||
639 | |||
640 | |||
641 | <P> |
||
642 | Given a plan <CODE>p</CODE>, the other parameters of this routine are set to |
||
643 | values describing the required data layout, described below. |
||
644 | |||
645 | |||
646 | <P> |
||
647 | <CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elements |
||
648 | that you must allocate for your local data (and workspace, if you |
||
649 | choose). (This value should, of course, be multiplied by |
||
650 | <CODE>n_fields</CODE> if that parameter to <CODE>fftwnd_mpi</CODE> is not <CODE>1</CODE>.) |
||
651 | |||
652 | |||
653 | <P> |
||
654 | The data on the current process has <CODE>local_nx</CODE> rows, starting at |
||
655 | row <CODE>local_x_start</CODE>. If <CODE>fftwnd_mpi</CODE> is called with |
||
656 | <CODE>FFTW_TRANSPOSED_ORDER</CODE> output, then <CODE>y</CODE> will be the first |
||
657 | dimension of the output, and the local <CODE>y</CODE> extent will be given by |
||
658 | <CODE>local_ny_after_transpose</CODE> and |
||
659 | <CODE>local_y_start_after_transpose</CODE>. Otherwise, the output has the |
||
660 | same dimensions and layout as the input. |
||
661 | |||
662 | |||
663 | <P> |
||
664 | For instance, suppose you want to transform three-dimensional data of |
||
665 | size <CODE>nx x ny x nz</CODE>. Then, the current process will store a subset |
||
666 | of this data, of size <CODE>local_nx x ny x nz</CODE>, where the <CODE>x</CODE> |
||
667 | indices correspond to the range <CODE>local_x_start</CODE> to |
||
668 | <CODE>local_x_start+local_nx-1</CODE> in the "real" (i.e. logical) array. |
||
669 | If <CODE>fftwnd_mpi</CODE> is called with <CODE>FFTW_TRANSPOSED_ORDER</CODE> output, |
||
670 | <A NAME="IDX258"></A> |
||
671 | then the result will be a <CODE>ny x nx x nz</CODE> array, of which a |
||
672 | <CODE>local_ny_after_transpose x nx x nz</CODE> subset is stored on the |
||
673 | current process (corresponding to <CODE>y</CODE> values starting at |
||
674 | <CODE>local_y_start_after_transpose</CODE>). |
||
675 | |||
676 | |||
677 | <P> |
||
678 | The following is an example of allocating such a three-dimensional array |
||
679 | array (<CODE>local_data</CODE>) before the transform and initializing it to |
||
680 | some function <CODE>f(x,y,z)</CODE>: |
||
681 | |||
682 | |||
683 | |||
684 | <PRE> |
||
685 | fftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start, |
||
686 | &local_ny_after_transpose, |
||
687 | &local_y_start_after_transpose, |
||
688 | &total_local_size); |
||
689 | |||
690 | local_data = (fftw_complex*) malloc(sizeof(fftw_complex) * |
||
691 | total_local_size); |
||
692 | |||
693 | for (x = 0; x < local_nx; ++x) |
||
694 | for (y = 0; y < ny; ++y) |
||
695 | for (z = 0; z < nz; ++z) |
||
696 | local_data[(x*ny + y)*nz + z] |
||
697 | = f(x + local_x_start, y, z); |
||
698 | </PRE> |
||
699 | |||
700 | <P> |
||
701 | Some important things to remember: |
||
702 | |||
703 | |||
704 | |||
705 | <UL> |
||
706 | |||
707 | <LI> |
||
708 | |||
709 | Although the local data is of dimensions <CODE>local_nx x ny x nz</CODE> in |
||
710 | the above example, do <EM>not</EM> allocate the array to be of size |
||
711 | <CODE>local_nx*ny*nz</CODE>. Use <CODE>total_local_size</CODE> instead. |
||
712 | |||
713 | <LI> |
||
714 | |||
715 | The amount of data on each process will not necessarily be the same; in |
||
716 | fact, <CODE>local_nx</CODE> may even be zero for some processes. (For |
||
717 | example, suppose you are doing a <CODE>6x6</CODE> transform on four |
||
718 | processors. There is no way to effectively use the fourth processor in |
||
719 | a slab decomposition, so we leave it empty. Proof left as an exercise |
||
720 | for the reader.) |
||
721 | |||
722 | <LI> |
||
723 | |||
724 | <A NAME="IDX259"></A> |
||
725 | All arrays are, of course, in row-major order (see Section <A HREF="fftw_2.html#SEC7">Multi-dimensional Array Format</A>). |
||
726 | |||
727 | <LI> |
||
728 | |||
729 | If you want to compute the inverse transform of the output of |
||
730 | <CODE>fftwnd_mpi</CODE>, the dimensions of the inverse transform are given by |
||
731 | the dimensions of the output of the forward transform. For example, if |
||
732 | you are using <CODE>FFTW_TRANSPOSED_ORDER</CODE> output in the above example, |
||
733 | then the inverse plan should be created with dimensions <CODE>ny x nx x |
||
734 | nz</CODE>. |
||
735 | |||
736 | <LI> |
||
737 | |||
738 | The data layout only depends upon the dimensions of the array, not on |
||
739 | the plan, so you are guaranteed that different plans for the same size |
||
740 | (or inverse plans) will use the same (consistent) data layouts. |
||
741 | |||
742 | </UL> |
||
743 | |||
744 | |||
745 | |||
746 | <H3><A NAME="SEC59">Usage of MPI FFTW for Real Multi-dimensional Transforms</A></H3> |
||
747 | |||
748 | <P> |
||
749 | MPI transforms specialized for real data are also available, similiar to |
||
750 | the uniprocessor <CODE>rfftwnd</CODE> transforms. Just as in the uniprocessor |
||
751 | case, the real-data MPI functions gain roughly a factor of two in speed |
||
752 | (and save a factor of two in space) at the expense of more complicated |
||
753 | data formats in the calling program. Before reading this section, you |
||
754 | should definitely understand how to call the uniprocessor <CODE>rfftwnd</CODE> |
||
755 | functions and also the complex MPI FFTW functions. |
||
756 | |||
757 | |||
758 | <P> |
||
759 | The following is an example of a program using <CODE>rfftwnd_mpi</CODE>. It |
||
760 | computes the size <CODE>nx x ny x nz</CODE> transform of a real function |
||
761 | <CODE>f(x,y,z)</CODE>, multiplies the imaginary part by <CODE>2</CODE> for fun, then |
||
762 | computes the inverse transform. (We'll also use |
||
763 | <CODE>FFTW_TRANSPOSED_ORDER</CODE> output for the transform, and additionally |
||
764 | supply the optional workspace parameter to <CODE>rfftwnd_mpi</CODE>, just to |
||
765 | add a little spice.) |
||
766 | |||
767 | |||
768 | |||
769 | <PRE> |
||
770 | #include <rfftw_mpi.h> |
||
771 | |||
772 | int main(int argc, char **argv) |
||
773 | { |
||
774 | const int nx = ..., ny = ..., nz = ...; |
||
775 | int local_nx, local_x_start, local_ny_after_transpose, |
||
776 | local_y_start_after_transpose, total_local_size; |
||
777 | int x, y, z; |
||
778 | rfftwnd_mpi_plan plan, iplan; |
||
779 | fftw_real *data, *work; |
||
780 | fftw_complex *cdata; |
||
781 | |||
782 | MPI_Init(&argc,&argv); |
||
783 | |||
784 | /* create the forward and backward plans: */ |
||
785 | plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD, |
||
786 | nx, ny, nz, |
||
787 | FFTW_REAL_TO_COMPLEX, |
||
788 | FFTW_ESTIMATE); |
||
789 | <A NAME="IDX260"></A> iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD, |
||
790 | /* dim.'s of REAL data --> */ nx, ny, nz, |
||
791 | FFTW_COMPLEX_TO_REAL, |
||
792 | FFTW_ESTIMATE); |
||
793 | |||
794 | rfftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start, |
||
795 | &local_ny_after_transpose, |
||
796 | &local_y_start_after_transpose, |
||
797 | &total_local_size); |
||
798 | <A NAME="IDX261"></A> |
||
799 | data = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size); |
||
800 | |||
801 | /* workspace is the same size as the data: */ |
||
802 | work = (fftw_real*) malloc(sizeof(fftw_real) * total_local_size); |
||
803 | |||
804 | /* initialize data to f(x,y,z): */ |
||
805 | for (x = 0; x < local_nx; ++x) |
||
806 | for (y = 0; y < ny; ++y) |
||
807 | for (z = 0; z < nz; ++z) |
||
808 | data[(x*ny + y) * (2*(nz/2+1)) + z] |
||
809 | = f(x + local_x_start, y, z); |
||
810 | |||
811 | /* Now, compute the forward transform: */ |
||
812 | rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER); |
||
813 | <A NAME="IDX262"></A> |
||
814 | /* the data is now complex, so typecast a pointer: */ |
||
815 | cdata = (fftw_complex*) data; |
||
816 | |||
817 | /* multiply imaginary part by 2, for fun: |
||
818 | (note that the data is transposed) */ |
||
819 | for (y = 0; y < local_ny_after_transpose; ++y) |
||
820 | for (x = 0; x < nx; ++x) |
||
821 | for (z = 0; z < (nz/2+1); ++z) |
||
822 | cdata[(y*nx + x) * (nz/2+1) + z].im |
||
823 | *= 2.0; |
||
824 | |||
825 | /* Finally, compute the inverse transform; the result |
||
826 | is transposed back to the original data layout: */ |
||
827 | rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER); |
||
828 | |||
829 | free(data); |
||
830 | free(work); |
||
831 | rfftwnd_mpi_destroy_plan(plan); |
||
832 | <A NAME="IDX263"></A> rfftwnd_mpi_destroy_plan(iplan); |
||
833 | MPI_Finalize(); |
||
834 | } |
||
835 | </PRE> |
||
836 | |||
837 | <P> |
||
838 | There's a lot of stuff in this example, but it's all just what you would |
||
839 | have guessed, right? We replaced all the <CODE>fftwnd_mpi*</CODE> functions |
||
840 | by <CODE>rfftwnd_mpi*</CODE>, but otherwise the parameters were pretty much |
||
841 | the same. The data layout distributed among the processes just like for |
||
842 | the complex transforms (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>), but in addition the |
||
843 | final dimension is padded just like it is for the uniprocessor in-place |
||
844 | real transforms (see Section <A HREF="fftw_3.html#SEC37">Array Dimensions for Real Multi-dimensional Transforms</A>). |
||
845 | <A NAME="IDX264"></A> |
||
846 | In particular, the <CODE>z</CODE> dimension of the real input data is padded |
||
847 | to a size <CODE>2*(nz/2+1)</CODE>, and after the transform it contains |
||
848 | <CODE>nz/2+1</CODE> complex values. |
||
849 | <A NAME="IDX265"></A> |
||
850 | <A NAME="IDX266"></A> |
||
851 | |||
852 | |||
853 | <P> |
||
854 | Some other important things to know about the real MPI transforms: |
||
855 | |||
856 | |||
857 | |||
858 | <UL> |
||
859 | |||
860 | <LI> |
||
861 | |||
862 | As for the uniprocessor <CODE>rfftwnd_create_plan</CODE>, the dimensions |
||
863 | passed for the <CODE>FFTW_COMPLEX_TO_REAL</CODE> plan are those of the |
||
864 | <EM>real</EM> data. In particular, even when <CODE>FFTW_TRANSPOSED_ORDER</CODE> |
||
865 | <A NAME="IDX267"></A> |
||
866 | <A NAME="IDX268"></A> |
||
867 | is used as in this case, the dimensions are those of the (untransposed) |
||
868 | real output, not the (transposed) complex input. (For the complex MPI |
||
869 | transforms, on the other hand, the dimensions are always those of the |
||
870 | input array.) |
||
871 | |||
872 | <LI> |
||
873 | |||
874 | The output ordering of the transform (<CODE>FFTW_TRANSPOSED_ORDER</CODE> or |
||
875 | <CODE>FFTW_TRANSPOSED_ORDER</CODE>) <EM>must</EM> be the same for both forward |
||
876 | and backward transforms. (This is not required in the complex case.) |
||
877 | |||
878 | <LI> |
||
879 | |||
880 | <CODE>total_local_size</CODE> is the required size in <CODE>fftw_real</CODE> values, |
||
881 | not <CODE>fftw_complex</CODE> values as it is for the complex transforms. |
||
882 | |||
883 | <LI> |
||
884 | |||
885 | <CODE>local_ny_after_transpose</CODE> and <CODE>local_y_start_after_transpose</CODE> |
||
886 | describe the portion of the array after the transform; that is, they are |
||
887 | indices in the complex array for an <CODE>FFTW_REAL_TO_COMPLEX</CODE> transform |
||
888 | and in the real array for an <CODE>FFTW_COMPLEX_TO_REAL</CODE> transform. |
||
889 | |||
890 | <LI> |
||
891 | |||
892 | <CODE>rfftwnd_mpi</CODE> always expects <CODE>fftw_real*</CODE> array arguments, but |
||
893 | of course these pointers can refer to either real or complex arrays, |
||
894 | depending upon which side of the transform you are on. Just as for |
||
895 | in-place uniprocessor real transforms (and also in the example above), |
||
896 | this is most easily handled by typecasting to a complex pointer when |
||
897 | handling the complex data. |
||
898 | |||
899 | <LI> |
||
900 | |||
901 | As with the complex transforms, there are also |
||
902 | <CODE>rfftwnd_create_plan</CODE> and <CODE>rfftw2d_create_plan</CODE> functions, and |
||
903 | any rank greater than one is supported. |
||
904 | <A NAME="IDX269"></A> |
||
905 | <A NAME="IDX270"></A> |
||
906 | |||
907 | </UL> |
||
908 | |||
909 | <P> |
||
910 | Programs using the MPI FFTW real transforms should link with |
||
911 | <CODE>-lrfftw_mpi -lfftw_mpi -lrfftw -lfftw -lm</CODE> on Unix. |
||
912 | <A NAME="IDX271"></A> |
||
913 | |||
914 | |||
915 | |||
916 | |||
917 | <H3><A NAME="SEC60">Usage of MPI FFTW for Complex One-dimensional Transforms</A></H3> |
||
918 | |||
919 | <P> |
||
920 | The MPI FFTW also includes routines for parallel one-dimensional |
||
921 | transforms of complex data (only). Although the speedup is generally |
||
922 | worse than it is for the multi-dimensional routines,<A NAME="DOCF6" HREF="fftw_foot.html#FOOT6">(6)</A> these distributed-memory one-dimensional transforms are |
||
923 | especially useful for performing one-dimensional transforms that don't |
||
924 | fit into the memory of a single machine. |
||
925 | |||
926 | |||
927 | <P> |
||
928 | The usage of these routines is straightforward, and is similar to that |
||
929 | of the multi-dimensional MPI transform functions. You first include the |
||
930 | header <CODE><fftw_mpi.h></CODE> and then create a plan by calling: |
||
931 | |||
932 | |||
933 | |||
934 | <PRE> |
||
935 | fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n, |
||
936 | fftw_direction dir, int flags); |
||
937 | </PRE> |
||
938 | |||
939 | <P> |
||
940 | <A NAME="IDX272"></A> |
||
941 | <A NAME="IDX273"></A> |
||
942 | |||
943 | |||
944 | <P> |
||
945 | The last three arguments are the same as for <CODE>fftw_create_plan</CODE> |
||
946 | (except that all MPI transforms are automatically <CODE>FFTW_IN_PLACE</CODE>). |
||
947 | The first argument specifies the group of processes you are using, and |
||
948 | is usually <CODE>MPI_COMM_WORLD</CODE> (all processes). |
||
949 | <A NAME="IDX274"></A> |
||
950 | A plan can be used for many transforms of the same size, and is |
||
951 | destroyed when you are done with it by calling |
||
952 | <CODE>fftw_mpi_destroy_plan(plan)</CODE>. |
||
953 | <A NAME="IDX275"></A> |
||
954 | |||
955 | |||
956 | <P> |
||
957 | If you don't care about the ordering of the input or output data of the |
||
958 | transform, you can include <CODE>FFTW_SCRAMBLED_INPUT</CODE> and/or |
||
959 | <CODE>FFTW_SCRAMBLED_OUTPUT</CODE> in the <CODE>flags</CODE>. |
||
960 | <A NAME="IDX276"></A> |
||
961 | <A NAME="IDX277"></A> |
||
962 | <A NAME="IDX278"></A> |
||
963 | These save some communications at the expense of having the input and/or |
||
964 | output reordered in an undocumented way. For example, if you are |
||
965 | performing an FFT-based convolution, you might use |
||
966 | <CODE>FFTW_SCRAMBLED_OUTPUT</CODE> for the forward transform and |
||
967 | <CODE>FFTW_SCRAMBLED_INPUT</CODE> for the inverse transform. |
||
968 | |||
969 | |||
970 | <P> |
||
971 | The transform itself is computed by: |
||
972 | |||
973 | |||
974 | |||
975 | <PRE> |
||
976 | void fftw_mpi(fftw_mpi_plan p, int n_fields, |
||
977 | fftw_complex *local_data, fftw_complex *work); |
||
978 | </PRE> |
||
979 | |||
980 | <P> |
||
981 | <A NAME="IDX279"></A> |
||
982 | |||
983 | |||
984 | <P> |
||
985 | <A NAME="IDX280"></A> |
||
986 | <CODE>n_fields</CODE>, as in <CODE>fftwnd_mpi</CODE>, is equivalent to |
||
987 | <CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>, and |
||
988 | should be <CODE>1</CODE> when you are computing the transform of a single |
||
989 | array. <CODE>local_data</CODE> contains the portion of the array local to the |
||
990 | current process, described below. <CODE>work</CODE> is either <CODE>NULL</CODE> or |
||
991 | an array exactly the same size as <CODE>local_data</CODE>; in the latter case, |
||
992 | FFTW can use the <CODE>MPI_Alltoall</CODE> communications primitive which is |
||
993 | (usually) faster at the expense of extra storage. Upon return, |
||
994 | <CODE>local_data</CODE> contains the portion of the output local to the |
||
995 | current process (see below). |
||
996 | <A NAME="IDX281"></A> |
||
997 | |||
998 | |||
999 | <P> |
||
1000 | <A NAME="IDX282"></A> |
||
1001 | To find out what portion of the array is stored local to the current |
||
1002 | process, you call the following routine: |
||
1003 | |||
1004 | |||
1005 | |||
1006 | <PRE> |
||
1007 | void fftw_mpi_local_sizes(fftw_mpi_plan p, |
||
1008 | int *local_n, int *local_start, |
||
1009 | int *local_n_after_transform, |
||
1010 | int *local_start_after_transform, |
||
1011 | int *total_local_size); |
||
1012 | </PRE> |
||
1013 | |||
1014 | <P> |
||
1015 | <A NAME="IDX283"></A> |
||
1016 | |||
1017 | |||
1018 | <P> |
||
1019 | <CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elements |
||
1020 | you should actually allocate for <CODE>local_data</CODE> (and <CODE>work</CODE>). |
||
1021 | <CODE>local_n</CODE> and <CODE>local_start</CODE> indicate that the current process |
||
1022 | stores <CODE>local_n</CODE> elements corresponding to the indices |
||
1023 | <CODE>local_start</CODE> to <CODE>local_start+local_n-1</CODE> in the "real" |
||
1024 | array. <EM>After the transform, the process may store a different |
||
1025 | portion of the array.</EM> The portion of the data stored on the process |
||
1026 | after the transform is given by <CODE>local_n_after_transform</CODE> and |
||
1027 | <CODE>local_start_after_transform</CODE>. This data is exactly the same as a |
||
1028 | contiguous segment of the corresponding uniprocessor transform output |
||
1029 | (i.e. an in-order sequence of sequential frequency bins). |
||
1030 | |||
1031 | |||
1032 | <P> |
||
1033 | Note that, if you compute both a forward and a backward transform of the |
||
1034 | same size, the local sizes are guaranteed to be consistent. That is, |
||
1035 | the local size after the forward transform will be the same as the local |
||
1036 | size before the backward transform, and vice versa. |
||
1037 | |||
1038 | |||
1039 | <P> |
||
1040 | Programs using the FFTW MPI routines should be linked with |
||
1041 | <CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever libraries |
||
1042 | are required for MPI. |
||
1043 | <A NAME="IDX284"></A> |
||
1044 | |||
1045 | |||
1046 | |||
1047 | |||
1048 | <H3><A NAME="SEC61">MPI Tips</A></H3> |
||
1049 | |||
1050 | <P> |
||
1051 | There are several things you should consider in order to get the best |
||
1052 | performance out of the MPI FFTW routines. |
||
1053 | |||
1054 | |||
1055 | <P> |
||
1056 | <A NAME="IDX285"></A> |
||
1057 | First, if possible, the first and second dimensions of your data should |
||
1058 | be divisible by the number of processes you are using. (If only one can |
||
1059 | be divisible, then you should choose the first dimension.) This allows |
||
1060 | the computational load to be spread evenly among the processes, and also |
||
1061 | reduces the communications complexity and overhead. In the |
||
1062 | one-dimensional transform case, the size of the transform should ideally |
||
1063 | be divisible by the <EM>square</EM> of the number of processors. |
||
1064 | |||
1065 | |||
1066 | <P> |
||
1067 | <A NAME="IDX286"></A> |
||
1068 | Second, you should consider using the <CODE>FFTW_TRANSPOSED_ORDER</CODE> |
||
1069 | output format if it is not too burdensome. The speed gains from |
||
1070 | communications savings are usually substantial. |
||
1071 | |||
1072 | |||
1073 | <P> |
||
1074 | Third, you should consider allocating a workspace for |
||
1075 | <CODE>(r)fftw(nd)_mpi</CODE>, as this can often |
||
1076 | (but not always) improve performance (at the cost of extra storage). |
||
1077 | |||
1078 | |||
1079 | <P> |
||
1080 | Fourth, you should experiment with the best number of processors to use |
||
1081 | for your problem. (There comes a point of diminishing returns, when the |
||
1082 | communications costs outweigh the computational benefits.<A NAME="DOCF7" HREF="fftw_foot.html#FOOT7">(7)</A>) The <CODE>fftw_mpi_test</CODE> program can output helpful performance |
||
1083 | benchmarks. |
||
1084 | <A NAME="IDX287"></A> |
||
1085 | <A NAME="IDX288"></A> |
||
1086 | It accepts the same parameters as the uniprocessor test programs |
||
1087 | (c.f. <CODE>tests/README</CODE>) and is run like an ordinary MPI program. For |
||
1088 | example, <CODE>mpirun -np 4 fftw_mpi_test -s 128x128x128</CODE> will benchmark |
||
1089 | a <CODE>128x128x128</CODE> transform on four processors, reporting timings and |
||
1090 | parallel speedups for all variants of <CODE>fftwnd_mpi</CODE> (transposed, |
||
1091 | with workspace, etcetera). (Note also that there is the |
||
1092 | <CODE>rfftw_mpi_test</CODE> program for the real transforms.) |
||
1093 | <A NAME="IDX289"></A> |
||
1094 | |||
1095 | |||
1096 | <P><HR><P> |
||
1097 | Go to the <A HREF="fftw_1.html">first</A>, <A HREF="fftw_3.html">previous</A>, <A HREF="fftw_5.html">next</A>, <A HREF="fftw_10.html">last</A> section, <A HREF="fftw_toc.html">table of contents</A>. |
||
1098 | </BODY> |
||
1099 | </HTML> |