/***************************************************************************
  **************************************************************************
  
                Spherical Harmonic Transform Kit 2.7
  
  
   Contact: Peter Kostelec
            geelong@cs.dartmouth.edu
  
  
   Copyright 1997-2003  Sean Moore, Dennis Healy,
                        Dan Rockmore, Peter Kostelec
  
  
   Copyright 2004  Peter Kostelec, Dan Rockmore


     SpharmonicKit is free software; you can redistribute it and/or modify
     it under the terms of the GNU General Public License as published by
     the Free Software Foundation; either version 2 of the License, or
     (at your option) any later version.
  
     SpharmonicKit is distributed in the hope that it will be useful,
     but WITHOUT ANY WARRANTY; without even the implied warranty of
     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     GNU General Public License for more details.
  
     You should have received a copy of the GNU General Public License
     along with this program; if not, write to the Free Software
     Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  
  
   Commercial use is absolutely prohibited.
  
   See the accompanying LICENSE file for details.
  
  ************************************************************************
  ************************************************************************/


BACKGROUND
----------

First we state that all algorithms expect the number of samples
and coefficients to be a power of 2, and that sampling is done
at the Chebyshev nodes.

The NAIVE algorithm is just the projection of the function
onto the Legendre functions, sampled at the Chebyshev
nodes (zeroes of T_2n).

The SEMINAIVE algorithm is an O(N^2) algorithm that performs
projections in frequency space rather than time space, i.e.,
it projects the function to be transformed onto cosine series 
representations of the associated Legendre functions.  This is
a fast algorithm in practice, provided that the cosine series
representations of the Legendre functions have been precomputed.

Note: in performing a forward spherical transform via the
seminaive algorithm, the total size of the cosine series
representations (the "precomputed data") grows rather quickly
as the bandwidth of the problem increases:

	bw = 64   -> about 0.34 megabytes of precomputed data
	bw = 128  -> about 2.69 megabytes of precomputed data
	bw = 256  -> about 21.5 megabytes of precomputed data
	bw = 512  -> about  171 megabytes of precomputed data
	bw = 1024 -> about 1367 megabytes of precomputed data

Depending on memory available and disk storage capabilities,
you may want to compute the cosine series as needed (``on the fly")
or read the precomputed data off disk.


In its original form, the basic DRISCOLL-HEALY (DH) algorithm
computes a spherical transform of a function with harmonics
of at most order N in O(N^2 (log N)^2) time. This is an asymptotic
result, exact in exact arithmetic. The key component of this
algorithm is a fast Legendre transform which, assuming a
precomputed data structure of size O(N log N), can be performed
in O(N (log N)^2) operations.  The fast Legendre transform algorithm
works as follows. The original problem of size N is reduced (by
smoothing and subsampling) to two smaller problems, each of size
N/2. Then those two smaller problems are themselves smoothed
and subsampled. So now there are four subproblems, each of
size N/4. One continues to divide-and-conquer "all the way down."
In terms of implementation, however, one would not divide-and-
conquer "all the way down." Overhead costs would render the
program computationally inefficient. This is one reason why more
computer-friendly variations of the fast Legendre transform
algorithm were developed.

Another reason for developing variations of the above Legendre
transform was the unstable nature of the original algorithm.
An integral part of the algorithm is its use of shifted Legendre
polynomials. As the order m of the transform increases, these
polynomials become numerically suspect. Therefore, one needs
to take care on how they are used. One can do this with the HYBRID
ALGORITHM. Our experience indicates that the HYBRID ALGORITHM
is the fastest of the variant algorithms.

One can make the HYBRID LEGENDRE TRANSFORM algorithm numerically
stable by varying a number of settings, but which setting
is optimal (in terms of runtime and accuracy) depends on
the platform the code is run on. The settings provided here
can be thought of as starting points. You must tailor them
to your computer.

Note: in performing a forward spherical transform via the
HYBRID ALGORITHM, the total size of the precomputed data
grows quickly, but not as quickly as the seminaive transform's
needs. The exact growth is difficult to predict since it
depends on how much (and how) the hybrid algorithm is used
in a spherical transform. But to at least have some figures
to compare, here are the sizes of precomputed data as we
have used the hybrid algorithm:

	bw = 64   -> about 0.32 megabytes of precomputed data
	bw = 128  -> about 2.31 megabytes of precomputed data
	bw = 256  -> about 17.6 megabytes of precomputed data
	bw = 512  -> about  136 megabytes of precomputed data
	bw = 1024 -> about  869 megabytes of precomputed data

Note that at bw = 1024 the hybrid-based spherical transform uses
significantly less precomputed data than the seminaive spherical
transform. But this is for us - your mileage may vary !!! As we
said above, the hybrid algorithm's performance appears to be
rather architecture dependent. For example, let bw = 1024, and
order m = 0. The hybrid Legendre transform was faster than the
seminaive Legendre transform on both a DEC Alpha and HP Exemplar.
However, at order m = 512 (same bandwidth) the hybrid algorithm
was faster than the seminaive algorithm on the HP, but the reverse
was true on the DEC! So again, performance depends on how the hybrid
algorithm is used. But in terms of precomputed data size, the
seminaive sizes may be taken as a sort of "maximum" for the
hybrid algorithm (if the hybrid algorithm was used in the "worst"
possible way). One final set of numbers: on the DEC and HP,
the seminaive forward spherical transform and hybrid forward
spherical transform were roughly competitive at bw = 512. On
the HP, at bw = 1024 we found the hybrid algorithm to be roughly
30% faster, in terms of cpu and walltime. The code was not
tested at bw = 1024 on the DEC because of hardware limitations
(i.e. not enough available disk space).

