Rev 2 | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
2 | pj | 1 | The following are a number of ideas for future work that we have |
2 | thought of, or which have been suggested to us. Let us know |
||
3 | (fftw@theory.lcs.mit.edu) if you have other proposals, or if there is |
||
4 | something that you want to work on. |
||
5 | |||
6 | * Implement some sort of Prime Factor algorithm (Temperton's?) (PFA is |
||
7 | now used in the codelets.) |
||
8 | |||
9 | * Try the Winograd blocks for the base cases. (We now use Rader's |
||
10 | algorithm for prime size codelets.) |
||
11 | |||
12 | * Try on-the-fly generation of twiddle factors, to save space and |
||
13 | cache. (Done. However, not yet enabled in the standard distribution. |
||
14 | The codelet generator is capable of generating code that either loads |
||
15 | or computes the twiddle factors, and the FFTW C code supports both |
||
16 | ways. We do not have enough experimental numbers to determine which |
||
17 | way is faster, however) |
||
18 | |||
19 | * Since we now have "strided wisdom," it would be nice to keep the |
||
20 | stride into account when planning 1D transform recursively. We should |
||
21 | eliminate the planner table altogether, and just use the wisdom table |
||
22 | for planning. |
||
23 | |||
24 | * Implement fast DCT and DST codes (cosine and sine transforms); |
||
25 | equivalently, implement fast algorithms for transforms of real/even |
||
26 | and real/odd data. There are two parts to this: (i) modify the |
||
27 | codelet generator to output hard-coded transforms of small sizes [this |
||
28 | is done], and (ii) figure out & implement a recursive framework for |
||
29 | combining these codelets to achieve transforms of general lengths. |
||
30 | (Once this is done, implement multi-dimensional transforms, etcetera.) |
||
31 | |||
32 | * Implement a library of convolution routines, windowing, filters, |
||
33 | etcetera based on FFTW. As DSP isn't our field (we are interested in |
||
34 | FFTs for other reasons), this sort of thing is probably best left to |
||
35 | others. Let us know if you're interested in writing such a thing, |
||
36 | though, and we'll be happy to link to your site and give you feedback. |
||
37 | |||
38 | * Generate multi-dimensional codelets for use in two/three-dimensional |
||
39 | transforms. (i.e. implement what is sometimes called a "vector-radix" |
||
40 | algorithm.) There are potential cache benefits to this. |
||
41 | |||
42 | * Take advantage of the vector instructions on the Pentium-III and |
||
43 | forthcoming PowerPC architectures. (Coming from the old Cray vector |
||
44 | supercomputers and the horrible coding they encouraged, this seems |
||
45 | suspiciously like a giant step backwards in computer architectures...) |
||
46 | We'd like to see better gcc support before we do anything along these |
||
47 | lines, though. |
||
48 | |||
49 | * In rfftw, implement a fast O(n lg n) algorithm for prime sizes and |
||
50 | large prime factors (currently, only the complex FFTW has fast |
||
51 | algorithms for prime sizes). The basic problem is that we don't know |
||
52 | of any such algorithm specialized for real data; suggestions and/or |
||
53 | references are welcome. |
||
54 | |||
55 | * In the MPI transforms, implement a parallel 1D transform for real |
||
56 | data (i.e. rfftw_mpi). (Currently, there are only parallel 1D |
||
57 | transforms for complex data in the MPI code.) |
||
58 | |||
59 | * In the MPI transforms, implement more sophisticated (i.e. faster) |
||
60 | in-place and out-of-place transpose routines for the in-process |
||
61 | transposes (used as subroutines by the distributed transpose). The |
||
62 | current routines are quite simplistic, although it is not clear how |
||
63 | much they hurt performance. |