WebSVN - shark - Blame - Rev 1611 - /shark/branches/pj/ports/fftw/doc/fftw

Rev	Author	Line No.	Line
2	pj	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
		2	<HTML>
		3	<HEAD>
		4	<!-- This HTML file has been created by texi2html 1.52
		5	from fftw.texi on 18 May 1999 -->
		6
		7	<TITLE>FFTW - Parallel FFTW</TITLE>
		8	</HEAD>
		9	<BODY TEXT="#000000" BGCOLOR="#FFFFFF">
		10	Go to the <A HREF="fftw_1.html">first</A>, <A HREF="fftw_3.html">previous</A>, <A HREF="fftw_5.html">next</A>, <A HREF="fftw_10.html">last</A> section, <A HREF="fftw_toc.html">table of contents</A>.
		11	<P><HR><P>
		12
		13
		14	<H1><A NAME="SEC47">Parallel FFTW</A></H1>
		15
		16	<P>
		17	<A NAME="IDX201"></A>
		18	In this chapter we discuss the use of FFTW in a parallel environment,
		19	documenting the different parallel libraries that we have provided.
		20	(Users calling FFTW from a multi-threaded program should also consult
		21	Section <A HREF="fftw_3.html#SEC46">Thread safety</A>.) The FFTW package currently contains three parallel
		22	transform implementations that leverage the uniprocessor FFTW code:
		23
		24
		25
		26	<UL>
		27
		28	<LI>
		29
		30	<A NAME="IDX202"></A>
		31	The first set of routines utilizes shared-memory threads for parallel
		32	one- and multi-dimensional transforms of both real and complex data.
		33	Any program using FFTW can be trivially modified to use the
		34	multi-threaded routines. This code can use any common threads
		35	implementation, including POSIX threads. (POSIX threads are available
		36	on most Unix variants, including Linux.) These routines are located in
		37	the <CODE>threads</CODE> directory, and are documented in Section <A HREF="fftw_4.html#SEC48">Multi-threaded FFTW</A>.
		38
		39	<LI>
		40
		41	<A NAME="IDX203"></A>
		42	<A NAME="IDX204"></A>
		43	The <CODE>mpi</CODE> directory contains multi-dimensional transforms
		44	of real and complex data for parallel machines supporting MPI. It also
		45	includes parallel one-dimensional transforms for complex data. The main
		46	feature of this code is that it supports distributed-memory transforms,
		47	so it runs on everything from workstation clusters to massively-parallel
		48	supercomputers. More information on MPI can be found at the
		49	<A HREF="http://www.mcs.anl.gov/mpi">MPI home page</A>. The FFTW MPI routines
		50	are documented in Section <A HREF="fftw_4.html#SEC55">MPI FFTW</A>.
		51
		52	<LI>
		53
		54	<A NAME="IDX205"></A>
		55	We also have an experimental parallel implementation written in Cilk, a
		56	C-like parallel language developed at MIT and currently available for
		57	several SMP platforms. For more information on Cilk see
		58	<A HREF="http://supertech.lcs.mit.edu/cilk">the Cilk home page</A>. The FFTW
		59	Cilk code can be found in the <CODE>cilk</CODE> directory, with parallelized
		60	one- and multi-dimensional transforms of complex data. The Cilk FFTW
		61	routines are documented in <CODE>cilk/README</CODE>.
		62
		63	</UL>
		64
		65
		66
		67	<H2><A NAME="SEC48">Multi-threaded FFTW</A></H2>
		68
		69	<P>
		70	<A NAME="IDX206"></A>
		71	In this section we document the parallel FFTW routines for shared-memory
		72	threads on SMP hardware. These routines, which support parellel one-
		73	and multi-dimensional transforms of both real and complex data, are the
		74	easiest way to take advantage of multiple processors with FFTW. They
		75	work just like the corresponding uniprocessor transform routines, except
		76	that they take the number of parallel threads to use as an extra
		77	parameter. Any program that uses the uniprocessor FFTW can be trivially
		78	modified to use the multi-threaded FFTW.
		79
		80
		81
		82
		83	<H3><A NAME="SEC49">Installation and Supported Hardware/Software</A></H3>
		84
		85	<P>
		86	All of the FFTW threads code is located in the <CODE>threads</CODE>
		87	subdirectory of the FFTW package. On Unix systems, the FFTW threads
		88	libraries and header files can be automatically configured, compiled,
		89	and installed along with the uniprocessor FFTW libraries simply by
		90	including <CODE>--enable-threads</CODE> in the flags to the <CODE>configure</CODE>
		91	script (see Section <A HREF="fftw_6.html#SEC67">Installation on Unix</A>). (Note also that the threads
		92	routines, when enabled, are automatically tested by the <SAMP>`<CODE>make
		93	check'</SAMP></CODE> self-tests.)
		94	<A NAME="IDX207"></A>
		95
		96
		97	<P>
		98	The threads routines require your operating system to have some sort of
		99	shared-memory threads support. Specifically, the FFTW threads package
		100	works with POSIX threads (available on most Unix variants, including
		101	Linux), Solaris threads, <A HREF="http://www.be.com">BeOS</A> threads (tested
		102	on BeOS DR8.2), Mach C threads (reported to work by users), and Win32
		103	threads (reported to work by users). (There is also untested code to
		104	use MacOS MP threads.) If you have a shared-memory machine that uses a
		105	different threads API, it should be a simple matter of programming to
		106	include support for it; see the file <CODE>fftw_threads-int.h</CODE> for more
		107	detail.
		108
		109
		110	<P>
		111	SMP hardware is not required, although of course you need multiple
		112	processors to get any benefit from the multithreaded transforms.
		113
		114
		115
		116
		117	<H3><A NAME="SEC50">Usage of Multi-threaded FFTW</A></H3>
		118
		119	<P>
		120	Here, it is assumed that the reader is already familiar with the usage
		121	of the uniprocessor FFTW routines, described elsewhere in this manual.
		122	We only describe what one has to change in order to use the
		123	multi-threaded routines.
		124
		125
		126	<P>
		127	First, instead of including <CODE><fftw.h></CODE> or <CODE><rfftw.h></CODE>, you
		128	should include the files <CODE><fftw_threads.h></CODE> or
		129	<CODE><rfftw_threads.h></CODE>, respectively.
		130
		131
		132	<P>
		133	Second, before calling any FFTW routines, you should call the function:
		134
		135
		136
		137	<PRE>
		138	int fftw_threads_init(void);
		139	</PRE>
		140
		141	<P>
		142	<A NAME="IDX208"></A>
		143
		144
		145	<P>
		146	This function, which should only be called once (probably in your
		147	<CODE>main()</CODE> function), performs any one-time initialization required
		148	to use threads on your system. It returns zero if successful, and a
		149	non-zero value if there was an error (in which case, something is
		150	seriously wrong and you should probably exit the program).
		151
		152
		153	<P>
		154	Third, when you want to actually compute the transform, you should use
		155	one of the following transform routines instead of the ordinary FFTW
		156	functions:
		157
		158
		159
		160	<PRE>
		161	fftw_threads(nthreads, plan, howmany, in, istride,
		162	idist, out, ostride, odist);
		163	<A NAME="IDX209"></A>
		164	fftw_threads_one(nthreads, plan, in, out);
		165	<A NAME="IDX210"></A>
		166	fftwnd_threads(nthreads, plan, howmany, in, istride,
		167	idist, out, ostride, odist);
		168	<A NAME="IDX211"></A>
		169	fftwnd_threads_one(nthreads, plan, in, out);
		170	<A NAME="IDX212"></A>
		171	rfftw_threads(nthreads, plan, howmany, in, istride,
		172	idist, out, ostride, odist);
		173	<A NAME="IDX213"></A>
		174	rfftw_threads_one(nthreads, plan, in, out);
		175	<A NAME="IDX214"></A>
		176	rfftwnd_threads_real_to_complex(nthreads, plan, howmany, in,
		177	istride, idist, out, ostride, odist);
		178	<A NAME="IDX215"></A>
		179	rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);
		180	<A NAME="IDX216"></A>
		181	rfftwnd_threads_complex_to_real(nthreads, plan, howmany, in,
		182	istride, idist, out, ostride, odist);
		183	<A NAME="IDX217"></A>
		184	rfftwnd_threads_one_real_to_complex(nthreads, plan, in, out);
		185	<A NAME="IDX218"></A>
		186	rfftwnd_threads_one_complex_to_real(nthreads, plan, in, out);
		187	<A NAME="IDX219"></A></PRE>
		188
		189	<P>
		190	All of these routines take exactly the same arguments and have exactly
		191	the same effects as their uniprocessor counterparts (i.e. without the
		192	<SAMP>`<CODE>_threads</CODE>'</SAMP>) <EM>except</EM> that they take one extra
		193	parameter, <CODE>nthreads</CODE> (of type <CODE>int</CODE>), before the normal
		194	parameters.<A NAME="DOCF5" HREF="fftw_foot.html#FOOT5">(5)</A> The <CODE>nthreads</CODE>
		195	parameter specifies the number of threads of execution to use when
		196	performing the transform (actually, the maximum number of threads).
		197	<A NAME="IDX220"></A>
		198
		199
		200	<P>
		201	For example, to parallelize a single one-dimensional transform of
		202	complex data, instead of calling the uniprocessor <CODE>fftw_one(plan,
		203	in, out)</CODE>, you would call <CODE>fftw_threads_one(nthreads, plan, in,
		204	out)</CODE>. Passing an <CODE>nthreads</CODE> of <CODE>1</CODE> means to use only one
		205	thread (the main thread), and is equivalent to calling the uniprocessor
		206	routine. Passing an <CODE>nthreads</CODE> of <CODE>2</CODE> means that the
		207	transform is potentially parallelized over two threads (and two
		208	processors, if you have them), and so on.
		209
		210
		211	<P>
		212	These are the only changes you need to make to your source code. Calls
		213	to all other FFTW routines (plan creation, destruction, wisdom,
		214	etcetera) are not parallelized and remain the same. (The same plans and
		215	wisdom are used by both uniprocessor and multi-threaded transforms.)
		216	Your arrays are allocated and formatted in the same way, and so on.
		217
		218
		219	<P>
		220	Programs using the parallel complex transforms should be linked with
		221	<CODE>-lfftw_threads -lfftw -lm</CODE> on Unix. Programs using the parallel
		222	real transforms should be linked with <CODE>-lrfftw_threads
		223	-lfftw_threads -lrfftw -lfftw -lm</CODE>. You will also need to link with
		224	whatever library is responsible for threads on your system
		225	(e.g. <CODE>-lpthread</CODE> on Linux).
		226	<A NAME="IDX221"></A>
		227
		228
		229
		230
		231	<H3><A NAME="SEC51">How Many Threads to Use?</A></H3>
		232
		233	<P>
		234	<A NAME="IDX222"></A>
		235	There is a fair amount of overhead involved in spawning and synchronizing
		236	threads, so the optimal number of threads to use depends upon the size
		237	of the transform as well as on the number of processors you have.
		238
		239
		240	<P>
		241	As a general rule, you don't want to use more threads than you have
		242	processors. (Using more threads will work, but there will be extra
		243	overhead with no benefit.) In fact, if the problem size is too small,
		244	you may want to use fewer threads than you have processors.
		245
		246
		247	<P>
		248	You will have to experiment with your system to see what level of
		249	parallelization is best for your problem size. Useful tools to help you
		250	do this are the test programs that are automatically compiled along with
		251	the threads libraries, <CODE>fftw_threads_test</CODE> and
		252	<CODE>rfftw_threads_test</CODE> (in the <CODE>threads</CODE> subdirectory). These
		253	<A NAME="IDX223"></A>
		254	<A NAME="IDX224"></A>
		255	take the same arguments as the other FFTW test programs (see
		256	<CODE>tests/README</CODE>), except that they also take the number of threads
		257	to use as a first argument, and report the parallel speedup in speed
		258	tests. For example,
		259
		260
		261
		262	<PRE>
		263	fftw_threads_test 2 -s 128x128
		264	</PRE>
		265
		266	<P>
		267	will benchmark complex 128x128 transforms using two threads and report
		268	the speedup relative to the uniprocessor transform.
		269	<A NAME="IDX225"></A>
		270
		271
		272	<P>
		273	For instance, on a 4-processor 200MHz Pentium Pro system running Linux
		274	2.2.0, we found that the "crossover" point at which 2 threads became
		275	beneficial for complex transforms was about 4k points, while 4 threads
		276	became beneficial at 8k points.
		277
		278
		279
		280
		281	<H3><A NAME="SEC52">Using Multi-threaded FFTW in a Multi-threaded Program</A></H3>
		282
		283	<P>
		284	<A NAME="IDX226"></A>
		285	It is perfectly possible to use the multi-threaded FFTW routines from a
		286	multi-threaded program (e.g. have multiple threads computing
		287	multi-threaded transforms simultaneously). If you have the processors,
		288	more power to you! However, the same restrictions apply as for the
		289	uniprocessor FFTW routines (see Section <A HREF="fftw_3.html#SEC46">Thread safety</A>). In particular, you
		290	should recall that you may not create or destroy plans in parallel.
		291
		292
		293
		294
		295	<H3><A NAME="SEC53">Tips for Optimal Threading</A></H3>
		296
		297	<P>
		298	Not all transforms are equally well-parallelized by the multi-threaded
		299	FFTW routines. (This is merely a consequence of laziness on the part of
		300	the implementors, and is not inherent to the algorithms employed.)
		301	Mainly, the limitations are in the parallel one-dimensional transforms.
		302	The things to avoid if you want optimal parallelization are as follows:
		303
		304
		305
		306
		307	<H3><A NAME="SEC54">Parallelization deficiencies in one-dimensional transforms</A></H3>
		308
		309
		310	<UL>
		311
		312	<LI>
		313
		314	Large prime factors can sometimes parallelize poorly. Of course, you
		315	should avoid these anyway if you want high performance.
		316
		317	<LI>
		318
		319	<A NAME="IDX227"></A>
		320	Single in-place transforms don't parallelize completely. (Multiple
		321	in-place transforms, i.e. <CODE>howmany > 1</CODE>, are fine.) Again, you
		322	should avoid these in any case if you want high performance, as they
		323	require transforming to a scratch array and copying back.
		324
		325	<LI>
		326
		327	Single real-complex (<CODE>rfftw</CODE>) transforms don't parallelize
		328	completely. This is unfortunate, but parallelizing this correctly would
		329	have involved a lot of extra code (and a much larger library). You
		330	still get some benefit from additional processors, but if you have a
		331	very large number of processors you will probably be better off using
		332	the parallel complex (<CODE>fftw</CODE>) transforms. Note that
		333	multi-dimensional real transforms or multiple one-dimensional real
		334	transforms are fine.
		335
		336	</UL>
		337
		338
		339
		340	<H2><A NAME="SEC55">MPI FFTW</A></H2>
		341
		342	<P>
		343	<A NAME="IDX228"></A>
		344	This section describes the MPI FFTW routines for distributed-memory (and
		345	shared-memory) machines supporting MPI (Message Passing Interface). The
		346	MPI routines are significantly different from the ordinary FFTW because
		347	the transform data here are <EM>distributed</EM> over multiple processes,
		348	so that each process gets only a portion of the array.
		349	<A NAME="IDX229"></A>
		350	Currently, multi-dimensional transforms of both real and complex data,
		351	as well as one-dimensional transforms of complex data, are supported.
		352
		353
		354
		355
		356	<H3><A NAME="SEC56">MPI FFTW Installation</A></H3>
		357
		358	<P>
		359	The FFTW MPI library code is all located in the <CODE>mpi</CODE> subdirectoy
		360	of the FFTW package (along with source code for test programs). On Unix
		361	systems, the FFTW MPI libraries and header files can be automatically
		362	configured, compiled, and installed along with the uniprocessor FFTW
		363	libraries simply by including <CODE>--enable-mpi</CODE> in the flags to the
		364	<CODE>configure</CODE> script (see Section <A HREF="fftw_6.html#SEC67">Installation on Unix</A>).
		365	<A NAME="IDX230"></A>
		366
		367
		368	<P>
		369	The only requirement of the FFTW MPI code is that you have the standard
		370	MPI 1.1 (or later) libraries and header files installed on your system.
		371	A free implementation of MPI is available from
		372	<A HREF="http://www-unix.mcs.anl.gov/mpi/mpich/">the MPICH home page</A>.
		373
		374
		375	<P>
		376	Previous versions of the FFTW MPI routines have had an unfortunate
		377	tendency to expose bugs in MPI implementations. The current version has
		378	been largely rewritten, and hopefully avoids some of the problems. If
		379	you run into difficulties, try passing the optional workspace to
		380	<CODE>(r)fftwnd_mpi</CODE> (see below), as this allows us to use the standard
		381	(and hopefully well-tested) <CODE>MPI_Alltoall</CODE> primitive for
		382	<A NAME="IDX231"></A>
		383	communications. Please let us know (<A HREF="mailto:fftw@theory.lcs.mit.edu">fftw@theory.lcs.mit.edu</A>)
		384	how things work out.
		385
		386
		387	<P>
		388	<A NAME="IDX232"></A>
		389	<A NAME="IDX233"></A>
		390	Several test programs are included in the <CODE>mpi</CODE> directory. The
		391	ones most useful to you are probably the <CODE>fftw_mpi_test</CODE> and
		392	<CODE>rfftw_mpi_test</CODE> programs, which are run just like an ordinary MPI
		393	program and accept the same parameters as the other FFTW test programs
		394	(c.f. <CODE>tests/README</CODE>). For example, <CODE>mpirun <I>...params...</I>
		395	fftw_mpi_test -r 0</CODE> will run non-terminating complex-transform
		396	correctness tests of random dimensions. They can also do performance
		397	benchmarks.
		398
		399
		400
		401
		402	<H3><A NAME="SEC57">Usage of MPI FFTW for Complex Multi-dimensional Transforms</A></H3>
		403
		404	<P>
		405	Usage of the MPI FFTW routines is similar to that of the uniprocessor
		406	FFTW. We assume that the reader already understands the usage of the
		407	uniprocessor FFTW routines, described elsewhere in this manual. Some
		408	familiarity with MPI is also helpful.
		409
		410
		411	<P>
		412	A typical program performing a complex two-dimensional MPI transform
		413	might look something like:
		414
		415
		416
		417	<PRE>
		418	#include <fftw_mpi.h>
		419
		420	int main(int argc, char **argv)
		421	{
		422	const int NX = ..., NY = ...;
		423	fftwnd_mpi_plan plan;
		424	fftw_complex *data;
		425
		426	MPI_Init(&argc,&argv);
		427
		428	plan = fftw2d_mpi_create_plan(MPI_COMM_WORLD,
		429	NX, NY,
		430	FFTW_FORWARD, FFTW_ESTIMATE);
		431
		432	...allocate and initialize data...
		433
		434	fftwnd_mpi(p, 1, data, NULL, FFTW_NORMAL_ORDER);
		435
		436	...
		437
		438	fftwnd_mpi_destroy_plan(plan);
		439	MPI_Finalize();
		440	}
		441	</PRE>
		442
		443	<P>
		444	The calls to <CODE>MPI_Init</CODE> and <CODE>MPI_Finalize</CODE> are required in all
		445	<A NAME="IDX234"></A>
		446	<A NAME="IDX235"></A>
		447	MPI programs; see the <A HREF="http://www.mcs.anl.gov/mpi/">MPI home page</A>
		448	for more information. Note that all of your processes run the program
		449	in parallel, as a group; there is no explicit launching of
		450	threads/processes in an MPI program.
		451
		452
		453	<P>
		454	<A NAME="IDX236"></A>
		455	As in the ordinary FFTW, the first thing we do is to create a plan (of
		456	type <CODE>fftwnd_mpi_plan</CODE>), using:
		457
		458
		459
		460	<PRE>
		461	fftwnd_mpi_plan fftw2d_mpi_create_plan(MPI_Comm comm,
		462	int nx, int ny,
		463	fftw_direction dir, int flags);
		464	</PRE>
		465
		466	<P>
		467	<A NAME="IDX237"></A>
		468	<A NAME="IDX238"></A>
		469
		470
		471	<P>
		472	Except for the first argument, the parameters are identical to those of
		473	<CODE>fftw2d_create_plan</CODE>. (There are also analogous
		474	<CODE>fftwnd_mpi_create_plan</CODE> and <CODE>fftw3d_mpi_create_plan</CODE>
		475	functions. Transforms of any rank greater than one are supported.)
		476	<A NAME="IDX239"></A>
		477	<A NAME="IDX240"></A>
		478	The first argument is an MPI <EM>communicator</EM>, which specifies the
		479	group of processes that are to be involved in the transform; the
		480	standard constant <CODE>MPI_COMM_WORLD</CODE> indicates all available
		481	processes.
		482	<A NAME="IDX241"></A>
		483
		484
		485	<P>
		486	Next, one has to allocate and initialize the data. This is somewhat
		487	tricky, because the transform data is distributed across the processes
		488	involved in the transform. It is discussed in detail by the next
		489	section (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>).
		490
		491
		492	<P>
		493	The actual computation of the transform is performed by the function
		494	<CODE>fftwnd_mpi</CODE>, which differs somewhat from its uniprocessor
		495	equivalent and is described by:
		496
		497
		498
		499	<PRE>
		500	void fftwnd_mpi(fftwnd_mpi_plan p,
		501	int n_fields,
		502	fftw_complex local_data, fftw_complex work,
		503	fftwnd_mpi_output_order output_order);
		504	</PRE>
		505
		506	<P>
		507	<A NAME="IDX242"></A>
		508
		509
		510	<P>
		511	There are several things to notice here:
		512
		513
		514
		515	<UL>
		516
		517	<LI>
		518
		519	<A NAME="IDX243"></A>
		520	First of all, all <CODE>fftw_mpi</CODE> transforms are in-place: the output is
		521	in the <CODE>local_data</CODE> parameter, and there is no need to specify
		522	<CODE>FFTW_IN_PLACE</CODE> in the plan flags.
		523
		524	<LI>
		525
		526	<A NAME="IDX244"></A>
		527	<A NAME="IDX245"></A>
		528	The MPI transforms also only support a limited subset of the
		529	<CODE>howmany</CODE>/<CODE>stride</CODE>/<CODE>dist</CODE> functionality of the
		530	uniprocessor routines: the <CODE>n_fields</CODE> parameter is equivalent to
		531	<CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>.
		532	(Conceptually, the <CODE>n_fields</CODE> parameter allows you to transform an
		533	array of contiguous vectors, each with length <CODE>n_fields</CODE>.)
		534	<CODE>n_fields</CODE> is <CODE>1</CODE> if you are only transforming a single,
		535	ordinary array.
		536
		537	<LI>
		538
		539	The <CODE>work</CODE> parameter is an optional workspace. If it is not
		540	<CODE>NULL</CODE>, it should be exactly the same size as the <CODE>local_data</CODE>
		541	array. If it is provided, FFTW is able to use the built-in
		542	<CODE>MPI_Alltoall</CODE> primitive for (often) greater efficiency at the
		543	<A NAME="IDX246"></A>
		544	expense of extra storage space.
		545
		546	<LI>
		547
		548	Finally, the last parameter specifies whether the output data has the
		549	same ordering as the input data (<CODE>FFTW_NORMAL_ORDER</CODE>), or if it is
		550	transposed (<CODE>FFTW_TRANSPOSED_ORDER</CODE>). Leaving the data transposed
		551	<A NAME="IDX247"></A>
		552	<A NAME="IDX248"></A>
		553	results in significant performance improvements due to a saved
		554	communication step (needed to un-transpose the data). Specifically, the
		555	first two dimensions of the array are transposed, as is described in
		556	more detail by the next section.
		557
		558	</UL>
		559
		560	<P>
		561	<A NAME="IDX249"></A>
		562	The output of <CODE>fftwnd_mpi</CODE> is identical to that of the
		563	corresponding uniprocessor transform. In particular, you should recall
		564	our conventions for normalization and the sign of the transform
		565	exponent.
		566
		567
		568	<P>
		569	The same plan can be used to compute many transforms of the same size.
		570	After you are done with it, you should deallocate it by calling
		571	<CODE>fftwnd_mpi_destroy_plan</CODE>.
		572	<A NAME="IDX250"></A>
		573
		574
		575	<P>
		576	<A NAME="IDX251"></A>
		577	<A NAME="IDX252"></A>
		578	<B>Important:</B> The FFTW MPI routines must be called in the same order by
		579	all processes involved in the transform. You should assume that they
		580	all are blocking, as if each contained a call to <CODE>MPI_Barrier</CODE>.
		581
		582
		583	<P>
		584	Programs using the FFTW MPI routines should be linked with
		585	<CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever libraries
		586	are required for MPI.
		587	<A NAME="IDX253"></A>
		588
		589
		590
		591
		592	<H3><A NAME="SEC58">MPI Data Layout</A></H3>
		593
		594	<P>
		595	<A NAME="IDX254"></A>
		596	<A NAME="IDX255"></A>
		597	The transform data used by the MPI FFTW routines is <EM>distributed</EM>: a
		598	distinct portion of it resides with each process involved in the
		599	transform. This allows the transform to be parallelized, for example,
		600	over a cluster of workstations, each with its own separate memory, so
		601	that you can take advantage of the total memory of all the processors
		602	you are parallelizing over.
		603
		604
		605	<P>
		606	In particular, the array is divided according to the rows (first
		607	dimension) of the data: each process gets a subset of the rows of the
		608	<A NAME="IDX256"></A>
		609	data. (This is sometimes called a "slab decomposition.") One
		610	consequence of this is that you can't take advantage of more processors
		611	than you have rows (e.g. <CODE>64x64x64</CODE> matrix can at most use 64
		612	processors). This isn't usually much of a limitation, however, as each
		613	processor needs a fair amount of data in order for the
		614	parallel-computation benefits to outweight the communications costs.
		615
		616
		617	<P>
		618	Below, the first dimension of the data will be referred to as
		619	<SAMP>`<CODE>x</CODE>'</SAMP> and the second dimension as <SAMP>`<CODE>y</CODE>'</SAMP>.
		620
		621
		622	<P>
		623	FFTW supplies a routine to tell you exactly how much data resides on the
		624	current process:
		625
		626
		627
		628	<PRE>
		629	void fftwnd_mpi_local_sizes(fftwnd_mpi_plan p,
		630	int *local_nx,
		631	int *local_x_start,
		632	int *local_ny_after_transpose,
		633	int *local_y_start_after_transpose,
		634	int *total_local_size);
		635	</PRE>
		636
		637	<P>
		638	<A NAME="IDX257"></A>
		639
		640
		641	<P>
		642	Given a plan <CODE>p</CODE>, the other parameters of this routine are set to
		643	values describing the required data layout, described below.
		644
		645
		646	<P>
		647	<CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elements
		648	that you must allocate for your local data (and workspace, if you
		649	choose). (This value should, of course, be multiplied by
		650	<CODE>n_fields</CODE> if that parameter to <CODE>fftwnd_mpi</CODE> is not <CODE>1</CODE>.)
		651
		652
		653	<P>
		654	The data on the current process has <CODE>local_nx</CODE> rows, starting at
		655	row <CODE>local_x_start</CODE>. If <CODE>fftwnd_mpi</CODE> is called with
		656	<CODE>FFTW_TRANSPOSED_ORDER</CODE> output, then <CODE>y</CODE> will be the first
		657	dimension of the output, and the local <CODE>y</CODE> extent will be given by
		658	<CODE>local_ny_after_transpose</CODE> and
		659	<CODE>local_y_start_after_transpose</CODE>. Otherwise, the output has the
		660	same dimensions and layout as the input.
		661
		662
		663	<P>
		664	For instance, suppose you want to transform three-dimensional data of
		665	size <CODE>nx x ny x nz</CODE>. Then, the current process will store a subset
		666	of this data, of size <CODE>local_nx x ny x nz</CODE>, where the <CODE>x</CODE>
		667	indices correspond to the range <CODE>local_x_start</CODE> to
		668	<CODE>local_x_start+local_nx-1</CODE> in the "real" (i.e. logical) array.
		669	If <CODE>fftwnd_mpi</CODE> is called with <CODE>FFTW_TRANSPOSED_ORDER</CODE> output,
		670	<A NAME="IDX258"></A>
		671	then the result will be a <CODE>ny x nx x nz</CODE> array, of which a
		672	<CODE>local_ny_after_transpose x nx x nz</CODE> subset is stored on the
		673	current process (corresponding to <CODE>y</CODE> values starting at
		674	<CODE>local_y_start_after_transpose</CODE>).
		675
		676
		677	<P>
		678	The following is an example of allocating such a three-dimensional array
		679	array (<CODE>local_data</CODE>) before the transform and initializing it to
		680	some function <CODE>f(x,y,z)</CODE>:
		681
		682
		683
		684	<PRE>
		685	fftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start,
		686	&local_ny_after_transpose,
		687	&local_y_start_after_transpose,
		688	&total_local_size);
		689
		690	local_data = (fftw_complex) malloc(sizeof(fftw_complex)
		691	total_local_size);
		692
		693	for (x = 0; x < local_nx; ++x)
		694	for (y = 0; y < ny; ++y)
		695	for (z = 0; z < nz; ++z)
		696	local_data[(xny + y)nz + z]
		697	= f(x + local_x_start, y, z);
		698	</PRE>
		699
		700	<P>
		701	Some important things to remember:
		702
		703
		704
		705	<UL>
		706
		707	<LI>
		708
		709	Although the local data is of dimensions <CODE>local_nx x ny x nz</CODE> in
		710	the above example, do <EM>not</EM> allocate the array to be of size
		711	<CODE>local_nxnynz</CODE>. Use <CODE>total_local_size</CODE> instead.
		712
		713	<LI>
		714
		715	The amount of data on each process will not necessarily be the same; in
		716	fact, <CODE>local_nx</CODE> may even be zero for some processes. (For
		717	example, suppose you are doing a <CODE>6x6</CODE> transform on four
		718	processors. There is no way to effectively use the fourth processor in
		719	a slab decomposition, so we leave it empty. Proof left as an exercise
		720	for the reader.)
		721
		722	<LI>
		723
		724	<A NAME="IDX259"></A>
		725	All arrays are, of course, in row-major order (see Section <A HREF="fftw_2.html#SEC7">Multi-dimensional Array Format</A>).
		726
		727	<LI>
		728
		729	If you want to compute the inverse transform of the output of
		730	<CODE>fftwnd_mpi</CODE>, the dimensions of the inverse transform are given by
		731	the dimensions of the output of the forward transform. For example, if
		732	you are using <CODE>FFTW_TRANSPOSED_ORDER</CODE> output in the above example,
		733	then the inverse plan should be created with dimensions <CODE>ny x nx x
		734	nz</CODE>.
		735
		736	<LI>
		737
		738	The data layout only depends upon the dimensions of the array, not on
		739	the plan, so you are guaranteed that different plans for the same size
		740	(or inverse plans) will use the same (consistent) data layouts.
		741
		742	</UL>
		743
		744
		745
		746	<H3><A NAME="SEC59">Usage of MPI FFTW for Real Multi-dimensional Transforms</A></H3>
		747
		748	<P>
		749	MPI transforms specialized for real data are also available, similiar to
		750	the uniprocessor <CODE>rfftwnd</CODE> transforms. Just as in the uniprocessor
		751	case, the real-data MPI functions gain roughly a factor of two in speed
		752	(and save a factor of two in space) at the expense of more complicated
		753	data formats in the calling program. Before reading this section, you
		754	should definitely understand how to call the uniprocessor <CODE>rfftwnd</CODE>
		755	functions and also the complex MPI FFTW functions.
		756
		757
		758	<P>
		759	The following is an example of a program using <CODE>rfftwnd_mpi</CODE>. It
		760	computes the size <CODE>nx x ny x nz</CODE> transform of a real function
		761	<CODE>f(x,y,z)</CODE>, multiplies the imaginary part by <CODE>2</CODE> for fun, then
		762	computes the inverse transform. (We'll also use
		763	<CODE>FFTW_TRANSPOSED_ORDER</CODE> output for the transform, and additionally
		764	supply the optional workspace parameter to <CODE>rfftwnd_mpi</CODE>, just to
		765	add a little spice.)
		766
		767
		768
		769	<PRE>
		770	#include <rfftw_mpi.h>
		771
		772	int main(int argc, char **argv)
		773	{
		774	const int nx = ..., ny = ..., nz = ...;
		775	int local_nx, local_x_start, local_ny_after_transpose,
		776	local_y_start_after_transpose, total_local_size;
		777	int x, y, z;
		778	rfftwnd_mpi_plan plan, iplan;
		779	fftw_real data, work;
		780	fftw_complex *cdata;
		781
		782	MPI_Init(&argc,&argv);
		783
		784	/* create the forward and backward plans: */
		785	plan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
		786	nx, ny, nz,
		787	FFTW_REAL_TO_COMPLEX,
		788	FFTW_ESTIMATE);
		789	<A NAME="IDX260"></A> iplan = rfftw3d_mpi_create_plan(MPI_COMM_WORLD,
		790	/* dim.'s of REAL data --> */ nx, ny, nz,
		791	FFTW_COMPLEX_TO_REAL,
		792	FFTW_ESTIMATE);
		793
		794	rfftwnd_mpi_local_sizes(plan, &local_nx, &local_x_start,
		795	&local_ny_after_transpose,
		796	&local_y_start_after_transpose,
		797	&total_local_size);
		798	<A NAME="IDX261"></A>
		799	data = (fftw_real) malloc(sizeof(fftw_real) total_local_size);
		800
		801	/* workspace is the same size as the data: */
		802	work = (fftw_real) malloc(sizeof(fftw_real) total_local_size);
		803
		804	/* initialize data to f(x,y,z): */
		805	for (x = 0; x < local_nx; ++x)
		806	for (y = 0; y < ny; ++y)
		807	for (z = 0; z < nz; ++z)
		808	data[(xny + y) (2*(nz/2+1)) + z]
		809	= f(x + local_x_start, y, z);
		810
		811	/* Now, compute the forward transform: */
		812	rfftwnd_mpi(plan, 1, data, work, FFTW_TRANSPOSED_ORDER);
		813	<A NAME="IDX262"></A>
		814	/* the data is now complex, so typecast a pointer: */
		815	cdata = (fftw_complex*) data;
		816
		817	/* multiply imaginary part by 2, for fun:
		818	(note that the data is transposed) */
		819	for (y = 0; y < local_ny_after_transpose; ++y)
		820	for (x = 0; x < nx; ++x)
		821	for (z = 0; z < (nz/2+1); ++z)
		822	cdata[(ynx + x) (nz/2+1) + z].im
		823	*= 2.0;
		824
		825	/* Finally, compute the inverse transform; the result
		826	is transposed back to the original data layout: */
		827	rfftwnd_mpi(iplan, 1, data, work, FFTW_TRANSPOSED_ORDER);
		828
		829	free(data);
		830	free(work);
		831	rfftwnd_mpi_destroy_plan(plan);
		832	<A NAME="IDX263"></A> rfftwnd_mpi_destroy_plan(iplan);
		833	MPI_Finalize();
		834	}
		835	</PRE>
		836
		837	<P>
		838	There's a lot of stuff in this example, but it's all just what you would
		839	have guessed, right? We replaced all the <CODE>fftwnd_mpi*</CODE> functions
		840	by <CODE>rfftwnd_mpi*</CODE>, but otherwise the parameters were pretty much
		841	the same. The data layout distributed among the processes just like for
		842	the complex transforms (see Section <A HREF="fftw_4.html#SEC58">MPI Data Layout</A>), but in addition the
		843	final dimension is padded just like it is for the uniprocessor in-place
		844	real transforms (see Section <A HREF="fftw_3.html#SEC37">Array Dimensions for Real Multi-dimensional Transforms</A>).
		845	<A NAME="IDX264"></A>
		846	In particular, the <CODE>z</CODE> dimension of the real input data is padded
		847	to a size <CODE>2*(nz/2+1)</CODE>, and after the transform it contains
		848	<CODE>nz/2+1</CODE> complex values.
		849	<A NAME="IDX265"></A>
		850	<A NAME="IDX266"></A>
		851
		852
		853	<P>
		854	Some other important things to know about the real MPI transforms:
		855
		856
		857
		858	<UL>
		859
		860	<LI>
		861
		862	As for the uniprocessor <CODE>rfftwnd_create_plan</CODE>, the dimensions
		863	passed for the <CODE>FFTW_COMPLEX_TO_REAL</CODE> plan are those of the
		864	<EM>real</EM> data. In particular, even when <CODE>FFTW_TRANSPOSED_ORDER</CODE>
		865	<A NAME="IDX267"></A>
		866	<A NAME="IDX268"></A>
		867	is used as in this case, the dimensions are those of the (untransposed)
		868	real output, not the (transposed) complex input. (For the complex MPI
		869	transforms, on the other hand, the dimensions are always those of the
		870	input array.)
		871
		872	<LI>
		873
		874	The output ordering of the transform (<CODE>FFTW_TRANSPOSED_ORDER</CODE> or
		875	<CODE>FFTW_TRANSPOSED_ORDER</CODE>) <EM>must</EM> be the same for both forward
		876	and backward transforms. (This is not required in the complex case.)
		877
		878	<LI>
		879
		880	<CODE>total_local_size</CODE> is the required size in <CODE>fftw_real</CODE> values,
		881	not <CODE>fftw_complex</CODE> values as it is for the complex transforms.
		882
		883	<LI>
		884
		885	<CODE>local_ny_after_transpose</CODE> and <CODE>local_y_start_after_transpose</CODE>
		886	describe the portion of the array after the transform; that is, they are
		887	indices in the complex array for an <CODE>FFTW_REAL_TO_COMPLEX</CODE> transform
		888	and in the real array for an <CODE>FFTW_COMPLEX_TO_REAL</CODE> transform.
		889
		890	<LI>
		891
		892	<CODE>rfftwnd_mpi</CODE> always expects <CODE>fftw_real*</CODE> array arguments, but
		893	of course these pointers can refer to either real or complex arrays,
		894	depending upon which side of the transform you are on. Just as for
		895	in-place uniprocessor real transforms (and also in the example above),
		896	this is most easily handled by typecasting to a complex pointer when
		897	handling the complex data.
		898
		899	<LI>
		900
		901	As with the complex transforms, there are also
		902	<CODE>rfftwnd_create_plan</CODE> and <CODE>rfftw2d_create_plan</CODE> functions, and
		903	any rank greater than one is supported.
		904	<A NAME="IDX269"></A>
		905	<A NAME="IDX270"></A>
		906
		907	</UL>
		908
		909	<P>
		910	Programs using the MPI FFTW real transforms should link with
		911	<CODE>-lrfftw_mpi -lfftw_mpi -lrfftw -lfftw -lm</CODE> on Unix.
		912	<A NAME="IDX271"></A>
		913
		914
		915
		916
		917	<H3><A NAME="SEC60">Usage of MPI FFTW for Complex One-dimensional Transforms</A></H3>
		918
		919	<P>
		920	The MPI FFTW also includes routines for parallel one-dimensional
		921	transforms of complex data (only). Although the speedup is generally
		922	worse than it is for the multi-dimensional routines,<A NAME="DOCF6" HREF="fftw_foot.html#FOOT6">(6)</A> these distributed-memory one-dimensional transforms are
		923	especially useful for performing one-dimensional transforms that don't
		924	fit into the memory of a single machine.
		925
		926
		927	<P>
		928	The usage of these routines is straightforward, and is similar to that
		929	of the multi-dimensional MPI transform functions. You first include the
		930	header <CODE><fftw_mpi.h></CODE> and then create a plan by calling:
		931
		932
		933
		934	<PRE>
		935	fftw_mpi_plan fftw_mpi_create_plan(MPI_Comm comm, int n,
		936	fftw_direction dir, int flags);
		937	</PRE>
		938
		939	<P>
		940	<A NAME="IDX272"></A>
		941	<A NAME="IDX273"></A>
		942
		943
		944	<P>
		945	The last three arguments are the same as for <CODE>fftw_create_plan</CODE>
		946	(except that all MPI transforms are automatically <CODE>FFTW_IN_PLACE</CODE>).
		947	The first argument specifies the group of processes you are using, and
		948	is usually <CODE>MPI_COMM_WORLD</CODE> (all processes).
		949	<A NAME="IDX274"></A>
		950	A plan can be used for many transforms of the same size, and is
		951	destroyed when you are done with it by calling
		952	<CODE>fftw_mpi_destroy_plan(plan)</CODE>.
		953	<A NAME="IDX275"></A>
		954
		955
		956	<P>
		957	If you don't care about the ordering of the input or output data of the
		958	transform, you can include <CODE>FFTW_SCRAMBLED_INPUT</CODE> and/or
		959	<CODE>FFTW_SCRAMBLED_OUTPUT</CODE> in the <CODE>flags</CODE>.
		960	<A NAME="IDX276"></A>
		961	<A NAME="IDX277"></A>
		962	<A NAME="IDX278"></A>
		963	These save some communications at the expense of having the input and/or
		964	output reordered in an undocumented way. For example, if you are
		965	performing an FFT-based convolution, you might use
		966	<CODE>FFTW_SCRAMBLED_OUTPUT</CODE> for the forward transform and
		967	<CODE>FFTW_SCRAMBLED_INPUT</CODE> for the inverse transform.
		968
		969
		970	<P>
		971	The transform itself is computed by:
		972
		973
		974
		975	<PRE>
		976	void fftw_mpi(fftw_mpi_plan p, int n_fields,
		977	fftw_complex local_data, fftw_complex work);
		978	</PRE>
		979
		980	<P>
		981	<A NAME="IDX279"></A>
		982
		983
		984	<P>
		985	<A NAME="IDX280"></A>
		986	<CODE>n_fields</CODE>, as in <CODE>fftwnd_mpi</CODE>, is equivalent to
		987	<CODE>howmany=n_fields</CODE>, <CODE>stride=n_fields</CODE>, and <CODE>dist=1</CODE>, and
		988	should be <CODE>1</CODE> when you are computing the transform of a single
		989	array. <CODE>local_data</CODE> contains the portion of the array local to the
		990	current process, described below. <CODE>work</CODE> is either <CODE>NULL</CODE> or
		991	an array exactly the same size as <CODE>local_data</CODE>; in the latter case,
		992	FFTW can use the <CODE>MPI_Alltoall</CODE> communications primitive which is
		993	(usually) faster at the expense of extra storage. Upon return,
		994	<CODE>local_data</CODE> contains the portion of the output local to the
		995	current process (see below).
		996	<A NAME="IDX281"></A>
		997
		998
		999	<P>
		1000	<A NAME="IDX282"></A>
		1001	To find out what portion of the array is stored local to the current
		1002	process, you call the following routine:
		1003
		1004
		1005
		1006	<PRE>
		1007	void fftw_mpi_local_sizes(fftw_mpi_plan p,
		1008	int local_n, int local_start,
		1009	int *local_n_after_transform,
		1010	int *local_start_after_transform,
		1011	int *total_local_size);
		1012	</PRE>
		1013
		1014	<P>
		1015	<A NAME="IDX283"></A>
		1016
		1017
		1018	<P>
		1019	<CODE>total_local_size</CODE> is the number of <CODE>fftw_complex</CODE> elements
		1020	you should actually allocate for <CODE>local_data</CODE> (and <CODE>work</CODE>).
		1021	<CODE>local_n</CODE> and <CODE>local_start</CODE> indicate that the current process
		1022	stores <CODE>local_n</CODE> elements corresponding to the indices
		1023	<CODE>local_start</CODE> to <CODE>local_start+local_n-1</CODE> in the "real"
		1024	array. <EM>After the transform, the process may store a different
		1025	portion of the array.</EM> The portion of the data stored on the process
		1026	after the transform is given by <CODE>local_n_after_transform</CODE> and
		1027	<CODE>local_start_after_transform</CODE>. This data is exactly the same as a
		1028	contiguous segment of the corresponding uniprocessor transform output
		1029	(i.e. an in-order sequence of sequential frequency bins).
		1030
		1031
		1032	<P>
		1033	Note that, if you compute both a forward and a backward transform of the
		1034	same size, the local sizes are guaranteed to be consistent. That is,
		1035	the local size after the forward transform will be the same as the local
		1036	size before the backward transform, and vice versa.
		1037
		1038
		1039	<P>
		1040	Programs using the FFTW MPI routines should be linked with
		1041	<CODE>-lfftw_mpi -lfftw -lm</CODE> on Unix, in addition to whatever libraries
		1042	are required for MPI.
		1043	<A NAME="IDX284"></A>
		1044
		1045
		1046
		1047
		1048	<H3><A NAME="SEC61">MPI Tips</A></H3>
		1049
		1050	<P>
		1051	There are several things you should consider in order to get the best
		1052	performance out of the MPI FFTW routines.
		1053
		1054
		1055	<P>
		1056	<A NAME="IDX285"></A>
		1057	First, if possible, the first and second dimensions of your data should
		1058	be divisible by the number of processes you are using. (If only one can
		1059	be divisible, then you should choose the first dimension.) This allows
		1060	the computational load to be spread evenly among the processes, and also
		1061	reduces the communications complexity and overhead. In the
		1062	one-dimensional transform case, the size of the transform should ideally
		1063	be divisible by the <EM>square</EM> of the number of processors.
		1064
		1065
		1066	<P>
		1067	<A NAME="IDX286"></A>
		1068	Second, you should consider using the <CODE>FFTW_TRANSPOSED_ORDER</CODE>
		1069	output format if it is not too burdensome. The speed gains from
		1070	communications savings are usually substantial.
		1071
		1072
		1073	<P>
		1074	Third, you should consider allocating a workspace for
		1075	<CODE>(r)fftw(nd)_mpi</CODE>, as this can often
		1076	(but not always) improve performance (at the cost of extra storage).
		1077
		1078
		1079	<P>
		1080	Fourth, you should experiment with the best number of processors to use
		1081	for your problem. (There comes a point of diminishing returns, when the
		1082	communications costs outweigh the computational benefits.<A NAME="DOCF7" HREF="fftw_foot.html#FOOT7">(7)</A>) The <CODE>fftw_mpi_test</CODE> program can output helpful performance
		1083	benchmarks.
		1084	<A NAME="IDX287"></A>
		1085	<A NAME="IDX288"></A>
		1086	It accepts the same parameters as the uniprocessor test programs
		1087	(c.f. <CODE>tests/README</CODE>) and is run like an ordinary MPI program. For
		1088	example, <CODE>mpirun -np 4 fftw_mpi_test -s 128x128x128</CODE> will benchmark
		1089	a <CODE>128x128x128</CODE> transform on four processors, reporting timings and
		1090	parallel speedups for all variants of <CODE>fftwnd_mpi</CODE> (transposed,
		1091	with workspace, etcetera). (Note also that there is the
		1092	<CODE>rfftw_mpi_test</CODE> program for the real transforms.)
		1093	<A NAME="IDX289"></A>
		1094
		1095
		1096	<P><HR><P>
		1097	Go to the <A HREF="fftw_1.html">first</A>, <A HREF="fftw_3.html">previous</A>, <A HREF="fftw_5.html">next</A>, <A HREF="fftw_10.html">last</A> section, <A HREF="fftw_toc.html">table of contents</A>.
		1098	</BODY>
		1099	</HTML>

Subversion Repositories shark

(root)/shark/branches/pj/ports/fftw/doc/fftw_4.htm - Rev 1611