Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev | Author | Line No. | Line |
---|---|---|---|

2 | pj | 1 | This file contains a random collection of the various hacks we did (or |

2 | did not dare to do) in order to get performance. It is not intended |
||

3 | to be complete nor up to date, but it may be instructive for people to |
||

4 | read (see also my (Matteo Frigo's) slides on the web page). |
||

5 | |||

6 | 1) Negative constants: |
||

7 | |||

8 | a = 0.5 * b; a = 0.5 * b; |
||

9 | c = 0.5 * d; c = -0.5 * d; |
||

10 | e = 1.0 + a; vs. e = 1.0 + a; |
||

11 | f = 1.0 - c; f = 1.0 + c; |
||

12 | |||

13 | The code on the left is faster. Guess why? The constants |
||

14 | 0.5 and -0.5 are stored in different memory locations and loaded |
||

15 | separately. |
||

16 | |||

17 | 2) Twiddle factors: |
||

18 | |||

19 | For some reason that escapes me, every package I have seen |
||

20 | (including FFTW 1.0) stores the twiddle factors in such |
||

21 | a way that the codelets need to access memory in a random |
||

22 | fashion. It is much faster to just store the factors in the |
||

23 | order they will be needed. This increased spatial locality |
||

24 | exploits caches and improves performance by about 30% for big |
||

25 | arrays. |
||

26 | |||

27 | 3) Loops: |
||

28 | |||

29 | You may notice that the `twiddle' codelets contain a loop. |
||

30 | This is the most logical choice, as opposed to moving the loop |
||

31 | outside the codelet. For example, one would expect that |
||

32 | |||

33 | for (i = 0; i < n; ++i) |
||

34 | foo(); |
||

35 | |||

36 | be slower than the case where the loop is inside foo(). This is |
||

37 | usually the case, *except* for the ultrasparc. FFTW would be 10% |
||

38 | faster (more or less) if the loop were outside the codelet. |
||

39 | Unfortunately, it would be slower everywhere else. If you want to |
||

40 | play with this, the codelet generator can be easily hacked... |
||

41 | |||

42 | 4) Array padding: |
||

43 | |||

44 | (This is a disgusting hack that my programmer's ethic |
||

45 | pervents me from releasing to the public). |
||

46 | |||

47 | On the IBM RS/6000, for big n, the following **** is *way* faster: |
||

48 | |||

49 | - Take the input array |
||

50 | - `inflate' it, that is, produce a bigger array in which every |
||

51 | k-th element is unused (say k=512). In other words, insert |
||

52 | holes in the input. |
||

53 | - execute fftw normally (properly hacked to keep the holes into |
||

54 | account) |
||

55 | - compact the array. |
||

56 | |||

57 | With this hack, FFTW is sensibly faster than IBM's ESSL library. |
||

58 | Sorry guys, I don't want to be responsible for releasing this |
||

59 | monster (this works only on the RS/6000, fortunately). |
||

60 | |||

61 | 5) Phase of moon: |
||

62 | |||

63 | Don't take the numbers on the web page too seriously. The |
||

64 | architectural details of modern machines make performance |
||

65 | *extremely* sensitive to little details such as where your code and |
||

66 | data are placed in memory. The ultimate example of brain damage is |
||

67 | the Pentium Pro (what a surprise...), where we had a case in which |
||

68 | adding a printf() statement to FFTW slowed down another completely |
||

69 | unrelated fft program by a factor of 20. |
||

70 | |||

71 | Our strategy to generate huge pieces of straight-line code should |
||

72 | immunize FFTW sufficiently well against these problems, but you are |
||

73 | still likely to observe weird things like: FFTW_ESTIMATE can be |
||

74 | faster than FFTW_MEASURE. |
||

75 | |||

76 | FFTW-17.0 will compute the phase of the moon in the planner and take |
||

77 | it into account. |
||

78 | |||

79 | 6) Estimator: |
||

80 | |||

81 | The estimator tries to come up with a `reasonable' plan. |
||

82 | Unfortunately, this concept is very machine-dependent. The |
||

83 | estimator in the release is tuned for the UltraSPARC. |
||

84 | |||

85 | The following two parameters can be found in fftw/planner.c : |
||

86 | |||

87 | #define NOTW_OPTIMAL_SIZE 32 |
||

88 | #define TWIDDLE_OPTIMAL_SIZE 12 |
||

89 | |||

90 | They say: use non-twiddle codelets of size close to 32, and twiddle |
||

91 | codelets of size close to 12. More or less, these are the most |
||

92 | efficient pieces of code on the UltraSPARC. Your mileage *will* |
||

93 | vary. Here are some suggestions for other machines. |
||

94 | |||

95 | Pentium |
||

96 | |||

97 | #define NOTW_OPTIMAL_SIZE 8 (or 16) |
||

98 | #define TWIDDLE_OPTIMAL_SIZE 8 |
||

99 | |||

100 | Pentium Pro: |
||

101 | |||

102 | #define NOTW_OPTIMAL_SIZE 32 (or 16) |
||

103 | #define TWIDDLE_OPTIMAL_SIZE 2 |
||

104 | |||

105 | |||

106 | RS/6000: |
||

107 | |||

108 | #define NOTW_OPTIMAL_SIZE 32 |
||

109 | #define TWIDDLE_OPTIMAL_SIZE 32 |
||

110 | |||

111 | The basic idea is: compute some plans for sizes you most care |
||

112 | about, print the plan and plug in the numbers that appear more |
||

113 | often (or some kind of average). Note the dramatic difference |
||

114 | between Pentium an Pentium Pro. (NO LONGER TRUE WITH --enable-i386-hacks) |
||

115 | |||

116 | 7) Stack alignment: |
||

117 | |||

118 | Pentium-type processors impose a huge performance penalty if double- |
||

119 | precision values are not aligned to 8-byte boundaries in memory. |
||

120 | (We have seen factors of 3 or more in tight loops.) Unfortunately, |
||

121 | the Intel ABI specifies that variables on the stack need only be aligned to |
||

122 | 4-byte boundaries. Even more unfortunately, this convention is followed |
||

123 | by Linux/x86 and gcc. |
||

124 | |||

125 | To get around this, we wrote special macros (HACK_ALIGN_STACK_EVEN |
||

126 | and HACK_ALIGN_STACK_ODD) that take effect when FFTW is compiled |
||

127 | with gcc/egcs and 'configure --enable-i386-hacks' is used. These |
||

128 | macros are called before the computation-intensive "codelets" |
||

129 | of FFTW. They use the gcc __builtin_alloca function to examine the |
||

130 | stack pointer and align the stack to an even or odd 4-byte boundary, |
||

131 | depending upon how many values will be pushed on the stack to call |
||

132 | the codelet. |