Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev | Author | Line No. | Line |
---|---|---|---|

2 | pj | 1 | The following are a number of ideas for future work that we have |

2 | thought of, or which have been suggested to us. Let us know |
||

3 | (fftw@theory.lcs.mit.edu) if you have other proposals, or if there is |
||

4 | something that you want to work on. |
||

5 | |||

6 | * Implement some sort of Prime Factor algorithm (Temperton's?) (PFA is |
||

7 | now used in the codelets.) |
||

8 | |||

9 | * Try the Winograd blocks for the base cases. (We now use Rader's |
||

10 | algorithm for prime size codelets.) |
||

11 | |||

12 | * Try on-the-fly generation of twiddle factors, to save space and |
||

13 | cache. (Done. However, not yet enabled in the standard distribution. |
||

14 | The codelet generator is capable of generating code that either loads |
||

15 | or computes the twiddle factors, and the FFTW C code supports both |
||

16 | ways. We do not have enough experimental numbers to determine which |
||

17 | way is faster, however) |
||

18 | |||

19 | * Since we now have "strided wisdom," it would be nice to keep the |
||

20 | stride into account when planning 1D transform recursively. We should |
||

21 | eliminate the planner table altogether, and just use the wisdom table |
||

22 | for planning. |
||

23 | |||

24 | * Implement fast DCT and DST codes (cosine and sine transforms); |
||

25 | equivalently, implement fast algorithms for transforms of real/even |
||

26 | and real/odd data. There are two parts to this: (i) modify the |
||

27 | codelet generator to output hard-coded transforms of small sizes [this |
||

28 | is done], and (ii) figure out & implement a recursive framework for |
||

29 | combining these codelets to achieve transforms of general lengths. |
||

30 | (Once this is done, implement multi-dimensional transforms, etcetera.) |
||

31 | |||

32 | * Implement a library of convolution routines, windowing, filters, |
||

33 | etcetera based on FFTW. As DSP isn't our field (we are interested in |
||

34 | FFTs for other reasons), this sort of thing is probably best left to |
||

35 | others. Let us know if you're interested in writing such a thing, |
||

36 | though, and we'll be happy to link to your site and give you feedback. |
||

37 | |||

38 | * Generate multi-dimensional codelets for use in two/three-dimensional |
||

39 | transforms. (i.e. implement what is sometimes called a "vector-radix" |
||

40 | algorithm.) There are potential cache benefits to this. |
||

41 | |||

42 | * Take advantage of the vector instructions on the Pentium-III and |
||

43 | forthcoming PowerPC architectures. (Coming from the old Cray vector |
||

44 | supercomputers and the horrible coding they encouraged, this seems |
||

45 | suspiciously like a giant step backwards in computer architectures...) |
||

46 | We'd like to see better gcc support before we do anything along these |
||

47 | lines, though. |
||

48 | |||

49 | * In rfftw, implement a fast O(n lg n) algorithm for prime sizes and |
||

50 | large prime factors (currently, only the complex FFTW has fast |
||

51 | algorithms for prime sizes). The basic problem is that we don't know |
||

52 | of any such algorithm specialized for real data; suggestions and/or |
||

53 | references are welcome. |
||

54 | |||

55 | * In the MPI transforms, implement a parallel 1D transform for real |
||

56 | data (i.e. rfftw_mpi). (Currently, there are only parallel 1D |
||

57 | transforms for complex data in the MPI code.) |
||

58 | |||

59 | * In the MPI transforms, implement more sophisticated (i.e. faster) |
||

60 | in-place and out-of-place transpose routines for the in-process |
||

61 | transposes (used as subroutines by the distributed transpose). The |
||

62 | current routines are quite simplistic, although it is not clear how |
||

63 | much they hurt performance. |