Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
2 | pj | 1 | This file contains a random collection of the various hacks we did (or |
2 | did not dare to do) in order to get performance. It is not intended |
||
3 | to be complete nor up to date, but it may be instructive for people to |
||
4 | read (see also my (Matteo Frigo's) slides on the web page). |
||
5 | |||
6 | 1) Negative constants: |
||
7 | |||
8 | a = 0.5 * b; a = 0.5 * b; |
||
9 | c = 0.5 * d; c = -0.5 * d; |
||
10 | e = 1.0 + a; vs. e = 1.0 + a; |
||
11 | f = 1.0 - c; f = 1.0 + c; |
||
12 | |||
13 | The code on the left is faster. Guess why? The constants |
||
14 | 0.5 and -0.5 are stored in different memory locations and loaded |
||
15 | separately. |
||
16 | |||
17 | 2) Twiddle factors: |
||
18 | |||
19 | For some reason that escapes me, every package I have seen |
||
20 | (including FFTW 1.0) stores the twiddle factors in such |
||
21 | a way that the codelets need to access memory in a random |
||
22 | fashion. It is much faster to just store the factors in the |
||
23 | order they will be needed. This increased spatial locality |
||
24 | exploits caches and improves performance by about 30% for big |
||
25 | arrays. |
||
26 | |||
27 | 3) Loops: |
||
28 | |||
29 | You may notice that the `twiddle' codelets contain a loop. |
||
30 | This is the most logical choice, as opposed to moving the loop |
||
31 | outside the codelet. For example, one would expect that |
||
32 | |||
33 | for (i = 0; i < n; ++i) |
||
34 | foo(); |
||
35 | |||
36 | be slower than the case where the loop is inside foo(). This is |
||
37 | usually the case, *except* for the ultrasparc. FFTW would be 10% |
||
38 | faster (more or less) if the loop were outside the codelet. |
||
39 | Unfortunately, it would be slower everywhere else. If you want to |
||
40 | play with this, the codelet generator can be easily hacked... |
||
41 | |||
42 | 4) Array padding: |
||
43 | |||
44 | (This is a disgusting hack that my programmer's ethic |
||
45 | pervents me from releasing to the public). |
||
46 | |||
47 | On the IBM RS/6000, for big n, the following **** is *way* faster: |
||
48 | |||
49 | - Take the input array |
||
50 | - `inflate' it, that is, produce a bigger array in which every |
||
51 | k-th element is unused (say k=512). In other words, insert |
||
52 | holes in the input. |
||
53 | - execute fftw normally (properly hacked to keep the holes into |
||
54 | account) |
||
55 | - compact the array. |
||
56 | |||
57 | With this hack, FFTW is sensibly faster than IBM's ESSL library. |
||
58 | Sorry guys, I don't want to be responsible for releasing this |
||
59 | monster (this works only on the RS/6000, fortunately). |
||
60 | |||
61 | 5) Phase of moon: |
||
62 | |||
63 | Don't take the numbers on the web page too seriously. The |
||
64 | architectural details of modern machines make performance |
||
65 | *extremely* sensitive to little details such as where your code and |
||
66 | data are placed in memory. The ultimate example of brain damage is |
||
67 | the Pentium Pro (what a surprise...), where we had a case in which |
||
68 | adding a printf() statement to FFTW slowed down another completely |
||
69 | unrelated fft program by a factor of 20. |
||
70 | |||
71 | Our strategy to generate huge pieces of straight-line code should |
||
72 | immunize FFTW sufficiently well against these problems, but you are |
||
73 | still likely to observe weird things like: FFTW_ESTIMATE can be |
||
74 | faster than FFTW_MEASURE. |
||
75 | |||
76 | FFTW-17.0 will compute the phase of the moon in the planner and take |
||
77 | it into account. |
||
78 | |||
79 | 6) Estimator: |
||
80 | |||
81 | The estimator tries to come up with a `reasonable' plan. |
||
82 | Unfortunately, this concept is very machine-dependent. The |
||
83 | estimator in the release is tuned for the UltraSPARC. |
||
84 | |||
85 | The following two parameters can be found in fftw/planner.c : |
||
86 | |||
87 | #define NOTW_OPTIMAL_SIZE 32 |
||
88 | #define TWIDDLE_OPTIMAL_SIZE 12 |
||
89 | |||
90 | They say: use non-twiddle codelets of size close to 32, and twiddle |
||
91 | codelets of size close to 12. More or less, these are the most |
||
92 | efficient pieces of code on the UltraSPARC. Your mileage *will* |
||
93 | vary. Here are some suggestions for other machines. |
||
94 | |||
95 | Pentium |
||
96 | |||
97 | #define NOTW_OPTIMAL_SIZE 8 (or 16) |
||
98 | #define TWIDDLE_OPTIMAL_SIZE 8 |
||
99 | |||
100 | Pentium Pro: |
||
101 | |||
102 | #define NOTW_OPTIMAL_SIZE 32 (or 16) |
||
103 | #define TWIDDLE_OPTIMAL_SIZE 2 |
||
104 | |||
105 | |||
106 | RS/6000: |
||
107 | |||
108 | #define NOTW_OPTIMAL_SIZE 32 |
||
109 | #define TWIDDLE_OPTIMAL_SIZE 32 |
||
110 | |||
111 | The basic idea is: compute some plans for sizes you most care |
||
112 | about, print the plan and plug in the numbers that appear more |
||
113 | often (or some kind of average). Note the dramatic difference |
||
114 | between Pentium an Pentium Pro. (NO LONGER TRUE WITH --enable-i386-hacks) |
||
115 | |||
116 | 7) Stack alignment: |
||
117 | |||
118 | Pentium-type processors impose a huge performance penalty if double- |
||
119 | precision values are not aligned to 8-byte boundaries in memory. |
||
120 | (We have seen factors of 3 or more in tight loops.) Unfortunately, |
||
121 | the Intel ABI specifies that variables on the stack need only be aligned to |
||
122 | 4-byte boundaries. Even more unfortunately, this convention is followed |
||
123 | by Linux/x86 and gcc. |
||
124 | |||
125 | To get around this, we wrote special macros (HACK_ALIGN_STACK_EVEN |
||
126 | and HACK_ALIGN_STACK_ODD) that take effect when FFTW is compiled |
||
127 | with gcc/egcs and 'configure --enable-i386-hacks' is used. These |
||
128 | macros are called before the computation-intensive "codelets" |
||
129 | of FFTW. They use the gcc __builtin_alloca function to examine the |
||
130 | stack pointer and align the stack to an even or odd 4-byte boundary, |
||
131 | depending upon how many values will be pushed on the stack to call |
||
132 | the codelet. |