Subversion Repositories shark

Rev

Details | Last modification | View Log | RSS feed

Rev Author Line No. Line
2 pj 1
This file contains a random collection of the various hacks we did (or
2
did not dare to do) in order to get performance.  It is not intended
3
to be complete nor up to date, but it may be instructive for people to
4
read (see also my (Matteo Frigo's) slides on the web page).
5
 
6
1) Negative constants:
7
 
8
        a = 0.5 * b;               a = 0.5 * b;
9
        c = 0.5 * d;	           c = -0.5 * d;
10
        e = 1.0 + a;	 vs.       e = 1.0 + a;
11
        f = 1.0 - c;	           f = 1.0 + c;
12
 
13
   The code on the left is faster.  Guess why? The constants
14
   0.5 and -0.5 are stored in different memory locations and loaded
15
   separately.
16
 
17
2) Twiddle factors:
18
 
19
   For some reason that escapes me, every package I have seen
20
   (including FFTW 1.0) stores the twiddle factors in such
21
   a way that the codelets need to access memory in a random
22
   fashion.  It is much faster to just store the factors in the
23
   order they will be needed.  This increased spatial locality
24
   exploits caches and improves performance by about 30% for big
25
   arrays.
26
 
27
3) Loops:
28
 
29
   You may notice that the `twiddle' codelets contain a loop.
30
   This is the most logical choice, as opposed to moving the loop
31
   outside the codelet.  For example, one would expect that
32
 
33
      for (i = 0; i < n; ++i)
34
          foo();
35
 
36
   be slower than the case where the loop is inside foo().  This is
37
   usually the case, *except* for the ultrasparc.  FFTW would be 10%
38
   faster (more or less) if the loop were outside the codelet.
39
   Unfortunately, it would be slower everywhere else. If you want to
40
   play with this, the codelet generator can be easily hacked...
41
 
42
4) Array padding:
43
 
44
   (This is a disgusting hack that my programmer's ethic
45
   pervents me from releasing to the public).
46
 
47
   On the IBM RS/6000, for big n, the following **** is *way* faster:
48
 
49
   - Take the input array
50
   - `inflate' it, that is, produce a bigger array in which every
51
     k-th element is unused (say k=512).  In other words, insert
52
     holes in the input.
53
   - execute fftw normally (properly hacked to keep the holes into
54
     account)
55
   - compact the array.
56
 
57
   With this hack, FFTW is sensibly faster than IBM's ESSL library.
58
   Sorry guys, I don't want to be responsible for releasing this
59
   monster (this works only on the RS/6000, fortunately).
60
 
61
5) Phase of moon:
62
 
63
   Don't take the numbers on the web page too seriously.  The
64
   architectural details of modern machines make performance
65
   *extremely* sensitive to little details such as where your code and
66
   data are placed in memory.  The ultimate example of brain damage is
67
   the Pentium Pro (what a surprise...), where we had a case in which
68
   adding a printf() statement to FFTW slowed down another completely
69
   unrelated fft program by a factor of 20.
70
 
71
   Our strategy to generate huge pieces of straight-line code should
72
   immunize FFTW sufficiently well against these problems, but you are
73
   still likely to observe weird things like: FFTW_ESTIMATE can be
74
   faster than FFTW_MEASURE.
75
 
76
   FFTW-17.0 will compute the phase of the moon in the planner and take
77
   it into account.
78
 
79
6) Estimator:
80
 
81
   The estimator tries to come up with a `reasonable' plan.
82
   Unfortunately, this concept is very machine-dependent.  The
83
   estimator in the release is tuned for the UltraSPARC.
84
 
85
   The following two parameters can be found in fftw/planner.c :
86
 
87
   #define NOTW_OPTIMAL_SIZE 32
88
   #define TWIDDLE_OPTIMAL_SIZE 12
89
 
90
   They say: use non-twiddle codelets of size close to 32, and twiddle
91
   codelets of size close to 12.  More or less, these are the most
92
   efficient pieces of code on the UltraSPARC.  Your mileage *will*
93
   vary.  Here are some suggestions for other machines.
94
 
95
   Pentium
96
 
97
   #define NOTW_OPTIMAL_SIZE 8   (or 16)
98
   #define TWIDDLE_OPTIMAL_SIZE 8
99
 
100
   Pentium Pro:
101
 
102
   #define NOTW_OPTIMAL_SIZE 32   (or 16)
103
   #define TWIDDLE_OPTIMAL_SIZE 2
104
 
105
 
106
   RS/6000:
107
 
108
   #define NOTW_OPTIMAL_SIZE 32
109
   #define TWIDDLE_OPTIMAL_SIZE 32
110
 
111
   The basic idea is: compute some plans for sizes you most care
112
   about, print the plan and plug in the numbers that appear more
113
   often (or some kind of average).  Note the dramatic difference
114
   between Pentium an Pentium Pro. (NO LONGER TRUE WITH --enable-i386-hacks)
115
 
116
7) Stack alignment:
117
 
118
   Pentium-type processors impose a huge performance penalty if double-
119
   precision values are not aligned to 8-byte boundaries in memory.
120
   (We have seen factors of 3 or more in tight loops.)  Unfortunately,
121
   the Intel ABI specifies that variables on the stack need only be aligned to
122
   4-byte boundaries.  Even more unfortunately, this convention is followed
123
   by Linux/x86 and gcc.
124
 
125
   To get around this, we wrote special macros (HACK_ALIGN_STACK_EVEN
126
   and HACK_ALIGN_STACK_ODD) that take effect when FFTW is compiled
127
   with gcc/egcs and 'configure --enable-i386-hacks' is used.  These
128
   macros are called before the computation-intensive "codelets"
129
   of FFTW.  They use the gcc __builtin_alloca function to examine the
130
   stack pointer and align the stack to an even or odd 4-byte boundary,
131
   depending upon how many values will be pushed on the stack to call
132
   the codelet.