The MIT Media Lab Phase Vocoder

This phase vocoder is split into an analysis and a synthesis
part. The analysis part is done by the program pvanal. It
produces a phase vocoder analysis file with the special
file header (see below), including information about the source
sound, analysis framesize and the overlap factor.

Following this header, the analysis data is stored as float, with
magnitudes and frequencies in turn for the first N/2+1 Fourier
bins of each frame.  We wrote a few programs to investigate the
phase vocoder algorithm on its analysis side.

The source codes of these programs are not included here, but
they are shipped with the Internet version of the catalogue.
Written in C and not depending on audio hardware, the programs
work on any Csound platform. Here is a short summary of their
function.

  -channels [analysis file] 
     reads file, data displayed per channel         
  -frames   [analysis file]
     reads file, data displayed per frame
  -magnitud [analysis file]
     reads file, data displayed per frame,
     above threshold magnitude only
  -wrapped  [analysis file]
     reads file, displays the value of phase
     and intermediate variables on the way to the
     approximated frequency.   

Sample runs of two of these programs are shown on the following
pages. The data flood is large and illustrates  the need for
specific display programs, adapted to purpose and nature of the
sound material at hand. The ability to skip a number of analysis
frames can further reduce the stream of data.

The program flow of pvanal gives a crude picture of this
FFT-based phase vocoder, avoiding the necessity to go too deeply
into the intricate network of source files. 

The meaningful part is the FFT loop of pvanal.c, where
amplitude/time values are transformed into amplitude/frequency
values. The loop functions are found in the source dsputil.c and
the buffers below are central in the this Fourier transform
procedure. We have looked into the phase/frequency conversion in
greater detail on page 172.  

The interpolation mechanism used by the PVOC unit generator
during re-synthesis has not been covered in this catalogue. It is
paramount to a fruitful understanding of this synthesis
technique.

------------------------------------------------------------ next

Testing Procedure
 
In order to get a feel of how pvanal operates, we first create a
simple audio signal with csound. For example, the soundfile
60_01_1.SF holds a complex signal with partials at 1000, 2000,
3000 and 4000 Hz. Calling the program pvanal with different
windowing factors creates several analysis files. Now the data
can be studied by using display programs like 'magnitud'. 



A Sample Session at the Terminal

The user commands are displayed in text boxes. First we show a
call of pvanal analysing a speech file of 5 seconds, sampled at
44.1 KHz (2.2 MB). The FFT window has a size of 1024 points and
the overlap factor is 4 by default. The resulting analysis file
speech1.pv1 already occupies 3.5 MB of storage space.

The second and third command illustrate the operation of
'magnitud'. First the program displays the information found in
the file header. After specifying number of frames and threshold
magnitude, analysis data fills the screen. The data resulting
from the first 'magnitud' call is reproduced on pages 168 and
169. 

It is the _first_ frame of the analyzed speech file. A hard copy
of the complete analysis file occupies 1718 pages! 

The output of the second readfmag call (page 170,171) is
restricted to channels with magnitudes above 3. This greatly
reduces the data flood.


------------------------------------------------------------ next

speech1.SF: AIFF,
            220500 samples,
            baseFrq 261.6 (midi 60),
            sustnLp: mode 0,
            relesLp: mode 0
            audio sr = 44100, monaural
            analysing 220500 sample frames (5.0 secs)
            1024 infrsize, 256 infrInc
            859 output frames estimated

frame: 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
320 340 360 380 400 420
440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740
760 780 800 820 840 858
859 output frames written
speech1.pv1:dataBsize:            data occupies 3525336 bytes
            dataFormat:                     36
            minFreq:                        0.00
            maxFreq:                        22050.00
            freqFormat, log or lin:         1
            frameFormat:                    7
            mono, stereo, quad:             1
            samplingRate of original audio: 44100
            frameSize:                      1024 samps per frame
            frameIncr:                      256 samps per frame
            frameBsize:                     total of 859 frames

How many frames do you want me to display?         1
What threshold magnitude should I take?            0  <-- ie none
speech1.pv1:dataBsize:               data occupies 3525336 bytes
            dataFormat:                     36
            minFreq:                        0.00
            maxFreq:                        22050.00
            freqFormat, log or lin:         1
            frameFormat:                    7
            mono, stereo, quad:             1
            samplingRate of original audio: 44100
            frameSize:                      1024 samps per frame
            frameIncr:                      256 samps per frame
            frameBsize:                     total of 859 frames

How many frames do you want me to display?       400
What threshold magnitude should I take?            3

------------------------------------------------------------ next

Output of 'magnitud',
1st call.

         mag    freq

fr 1:    1.48    0.00           
         1.06   86.13
         0.56   129.20
         0.46   172.27
         0.93   215.33
         1.23   258.40
         1.71   301.46
         1.95   344.53
         1.50   387.60
         0.73   430.66
         0.37   473.73
         0.28   516.80
         0.28   559.86
         0.29   602.93
         0.27   646.00
         0.32   689.06
         0.32   732.13
         0.28   775.20

         etc.   every 40 Hz channel till sr/2       

------ snip ----- snip ---- snip
        

         0.00   21705.47
         0.01   21748.54
         0.01   21791.60
         0.00   21834.67
         0.00   21877.73
         0.01   21920.80
         0.00   21963.87
         0.00   22006.93
         0.00   22050.00

Done: 
1 frame of speech1.pv1


------------------------------------------------------------ next

Output of 'magnitud'
2nd call.

         mag    freq

fr 1:
fr 2:    3.52   344.53
fr 3:
fr 4:
fr 5:
fr 6:
fr 7:    3.04   344.53
fr 8:
fr 9:
fr 10:

fr 11:   3.75   344.53
fr 12:   3.75   344.53
fr 13:
fr 14:
fr 15:
fr 16:   3.55   344.53
fr 17:   3.62   344.53
fr 18:
fr 19:
fr 20:

fr 21:   3.66   344.53
fr 22:   3.19   344.53
fr 23:
fr 24:   3.06   215.33
fr 25:   3.43   215.33
         3.60   344.53
fr 26:   3.78   344.53
fr 27:
fr 28:
fr 29:
fr 30:   3.72   344.53

fr 31:   3.13   344.53
fr 32:
fr 33:
fr 34:   3.16   43.07
fr 35:   4.56    0.00
         3.29   43.07
         4.02   344.53
fr 36:   3.08   43.07
         3.57   344.53
fr 37:   3.45    0.00
         3.40   43.07
fr 38:   3.74   43.07
fr 39:   3.71    0.00
         4.54   43.07
fr 40:   4.68    0.00
         4.15   43.07
         3.79   344.53

fr 41:
fr 42:   3.14    0.00
fr 43:
fr 44:   3.11    0.00
         3.61   43.07
fr 45:   4.60    0.00
         3.86   43.07
         3.09   344.53
fr 46:   3.54   43.07
fr 47:   4.31    0.00
         3.62   43.07
fr 48:   3.56   43.07
fr 49:   4.15    0.00
         3.95   43.07
         3.06   344.53
fr 50:   4.36    0.00
         3.71   43.07
         3.02   344.53

fr 51:   3.35   43.07
fr 52:   3.81    0.00
         3.36   43.07
fr 53:
fr 54:   4.29    0.00
         3.42   43.07
         3.89   344.53
fr 55:   4.27    0.00
         3.72   43.07
         3.41   344.53
fr 56:   3.20   43.07
fr 57:   4.16    0.00
         3.37   43.07
fr 58:   3.34   43.07
         3.29   344.53
fr 59:   4.18    0.00
         3.65   43.07
         3.82   344.53
fr 60:   4.60    0.00
         4.18   43.07

fr 61:   3.98   43.07
fr 62:   4.19    0.00
         3.71   43.07
fr 63:   3.04   43.07
fr 64:   4.19    0.00
         3.16   43.07
         3.24   344.53
fr 65:   3.84    0.00
         3.71   43.07
fr 66:   3.76   43.07
fr 67:   4.12    0.00
         3.82   43.07
fr 68:   3.51   43.07
         4.03   344.53
fr 69:   4.69    0.00
         3.62   43.07
         3.60   344.53
fr 70:   3.45    0.00

fr 71:   3.17   43.07
fr 72:   4.00    0.00
         4.00   43.07
fr 73:   3.82   43.07
         4.02   344.53
fr 74:   4.91    0.00
         3.85   43.07
         3.33   344.53
fr 75:   3.24   43.07
fr 76:
fr 77:   3.24    0.00
fr 78:   3.31   344.53
fr 79:   4.51    0.00
         3.80   43.07
fr 80:   3.53   43.07

fr 81:
fr 82:   3.11   344.53
fr 83:   3.27   344.53
fr 84:   3.75    0.00
fr 85:
fr 86:
fr 87:   3.74   344.53
fr 88:
fr 89:   3.39    0.00
         3.08   215.33
fr 90:

fr 91:
fr 92:
fr 93:   3.18   301.46
         3.40   344.53
fr 94:

=============================
etc.  all frames till the end
=============================

------------ snip -----------------------

fr 362:  3.54   344.53
fr 363:  3.18   344.53
fr 364:
fr 365:
fr 366:  3.06   344.53
fr 367:  4.13   344.53
fr 368:  3.27   344.53
fr 369:
fr 370:
fr 371:  3.25   344.53
fr 372:  3.13   344.53
fr 373:
fr 374:
fr 375:
fr 376:  3.81   344.53
fr 377:  3.74   344.53
fr 378:
fr 379:
fr 380:
fr 381:  4.11   344.53
fr 382:  3.57   344.53
fr 383:
fr 384:
fr 385:
fr 386:  3.67   344.53
fr 387:  3.01   344.53
fr 388:
fr 389:
fr 390:
fr 391:  3.02   344.53
fr 392:
fr 393:
fr 394:
fr 395:  3.28   344.53
fr 396:  3.10   344.53
fr 397:
fr 398:
fr 399:
fr 400:  3.62   344.53

Done: 
400 frames of speech1.pv1

------------------------------------------------------------ next


From Phase to Frequency

The transformation occurs in the functions UnwrapPhase and
PhaseToFrq of the main loop of pvanal. We analyze these two
functions step by step.

The program 'wrapped' displays the successive values of certain
expressions in the source code per channel. In the discussion
below, these expressions will be identified by boldfaced text. 

UnwrapPhase

The function loops through one frame of the analysis file. During
a pass, the phase difference since the last frame is evaluated
and unwrapped. The old phase value is saved in buffer OldPh, then
the unwrapped phase is saved in the main buffer tmpbuf. Here is
one pass:

First the phase value resulting from the rectangular to polar
conversion is assigned to float variable p:

        p = pha[2L*i]                            see phase

Then the phase change since the last frame is computed:

        p -= oldPh[i]                           see diff-p

followed by a call of macro MMmaskPhs, p =  phase:
        
        MMmaskPhs(p,z,pi,oneOnPi)

The preprocessor has replaced this macro call by:

        z = (int)(p*oneOnPi);
        p -= pi*(float)(int)((z+((z>=0)?(z&1):-(z&1) ))));

In the first statement, the integer z is assigned the value of
float p, scaled by 1/.
By casting float to integer, all decimals are truncated. The
second statement is far more complex and can best be approached
in a number of small steps. First we look at the expression
condit

        z + ( (z>=0) ? (z&1) : -(z&1) )         see condit

which consists of a test, an evaluation and an addition. The
expression

        z&1

discriminates between odd and even numbers. In binary
representation, all odd integers have their bit 0 set. Therefore,
if z is odd, then z&1 = 1. For even integers z, z&1
evaluates to 0 and condit = z.
So the conditional test
                   
        z>=0  ?   (z&1): -(z&1)

will only have consequences for odd integers. These can be
simplified to the following if-else statement:

        If z0, condit = z+1
           else condit = z-1                  

The net effect of condit is to make all z even: z = n*2. 

Then the masking/unwrapping is completed by

        p = p -  * condit                     see masked

The value of condit is scaled back by  and added or subtracted
from the  phase p. All p are unwrapped into the principal branch
of the inverse tangent function: -<p< . 

In the last two statements of UnwrapPhase, the old polar phase
and the unwrapped  phase p are saved:

        oldPh[i] = pha[2L]*i;
        pha[2L*i] = p;

In this way, the loop works through all the phase values of the
FFT frame, and then proceeds to PhaseToFreq.

Output from 'wrapped'

We show only the first 6 channels out of 17: each channel with
successive expression values for 5 frames. The example serves to
cross-check one's understanding of the variables in the analyzed
source codes.

=== program output ====

The program displays analysis data of pvoc.file
Successful read of: 60_01_2.pv1

Displaying Header Data

dataBsize:                       data occupies 37264 bytes
dataFormat:                      36
samplingRate of original audio:  22050
mono, stereo, quad:              1
frameSize:                       32 audio samples per frame
frameIncr:                       increasing 16 audio samples per
frame
frameBsize:                      274 frames in this file
frameFormat:                     7
minFreq:                         0.00
maxFreq:                         11025.00
freqFormat, log or lin:          1

How many frames do you want me to display?             5

**************** Settings **************************
All phase related expressions are scaled by factor .
srOn2pi = 219.34      eDphIncr =  3.14      frqPerBin =  689.06

channel 1:                              -689.06 <  0.00 < 689.06

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
   0.00    0.00   0    0.00  0.00    0.00   0.00    0.00    0.00
 219.34  219.34  220  -0.66  0.00   -0.66  -0.66 -457.95 -457.95
-219.34 -438.67 -438  -0.67  0.00   -0.67  -0.67 -462.23 -462.23
   0.00  219.34  220  -0.66  0.00   -0.66  -0.66 -457.95 -457.95
 219.34  219.34  220  -0.66  0.00   -0.66  -0.66 -457.95 -457.95

channel 2:                               0.00 < 689.06 < 1378.12

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 161.35  161.35  162  -0.65 -1.00    0.35   0.35  244.36  933.42
 266.06  104.71  104   0.71 -1.00    1.71  -0.29 -202.56  486.50
 319.40   53.34   54  -0.66 -1.00    0.34   0.34  234.98  924.04
 315.85   -3.55   -4   0.45 -1.00    1.45  -0.55 -381.35  307.71
 322.95    7.11    8  -0.89 -1.00    0.11   0.11   73.05  762.12

channel 3:                            689.06 < 1378.12 < 2067.19

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 315.20  315.20  316  -0.80  0.00   -0.80  -0.80 -554.07  824.06
 339.37   24.17   24   0.17  0.00    0.17   0.17  120.58 1498.71
 288.42  -50.95  -50  -0.95  0.00   -0.95  -0.95 -657.27  720.85
 340.08   51.66   52  -0.34  0.00   -0.34  -0.34 -234.41 1143.71
 341.29    1.21    2  -0.79  0.00   -0.79  -0.79 -541.72  836.40

channel 4:                           1378.12 < 2067.19 < 2756.25

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 716.45  716.45  716   0.45  1.00    1.45  -0.55 -375.74 1691.45
 600.35 -116.10 -116  -0.10 -1.00    0.90   0.90  617.56 2684.75 
 680.25   79.90   80  -0.10 -1.00    0.90   0.90  621.59 2688.77
 616.58  -63.68  -64   0.32 -1.00    1.32  -0.68 -466.72 1600.47
 600.84  -15.74  -16   0.26 -1.00    1.26  -0.74 -508.18 1559.01

channel 5:                           2067.19 < 2756.25 < 3445.31

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 878.03  878.03  878   0.03  0.00    0.03   0.03   19.44 2775.69
1000.71  122.68  122   0.68  0.00    0.68   0.68  470.24 3226.49
 855.26 -145.45 -146   0.55  0.00    0.55   0.55  377.39 3133.64
 991.26 136.00   136   0.00  0.00    0.00   0.00    0.27 2756.52
1004.12  12.86    12   0.86  0.00    0.86   0.86  592.27 3348.52

channel 6:                           2756.25 < 3445.31 < 4134.37

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
1060.97 1060.97 1060   0.97 -1.00    1.97  -0.03  -18.81 3426.50
 944.04 -116.93 -116  -0.93 -1.00    0.07   0.07   45.07 3490.39
1044.30  100.27  100   0.27 -1.00    1.27  -0.73 -505.84 2939.47
 935.53 -108.77 -108  -0.77 -1.00    0.23   0.23  155.56 3600.87
 905.72 -29.81   -30   0.19 -1.00    1.19  -0.81 -560.77 2884.55

Done: 6 channels and 5 frames of file 60_01_2.pv1

------------------------------------------------------------ next

PhaseToFrq

Like the previous function, PhaseToFrq loops through the number
of independent values in one FFT frame.  

Since the phase difference is measured in regular frame increase
intervals, its value depends on the window overlap factor.
Increases of a whole frame result in phase differences of 2,
increases of 1/2 frame result in phase differences of , and so
on.

The corresponding constant is called eDphIncr. In the expression
below, an expected phase difference (see expDpha) is subtracted
from the unwrapped phase difference.  

        p = pha[2L*i]-expectedDphas;                    see Diff

The emerging difference is masked (as in UnwrapPhase) and  saved: 
                 
        pha[2L*i] = p;                                 see masked

Next, the difference is converted to frequency by

        pha[2L*i] = pha[2L*i] * srOn2pi;              see local-F
        pha[2L*i] = pha[2L*i] + binMidFrq;           see global-F

In the latter statement the channel's center frequency is added
to the local frequency value. 

In the last three statements the values of expectedDpha and
binMidFrq are updated for the next pass. The values for the
expected phase difference depend on eDphaIncr, but must lay
within the range -<expectedDphas<. Successive values are
monitored by 'wrapped' (see expDpha). 

In every pass, the variable binMidFrq takes on the next channel's
center frequency. The list below facilitates the study of the
source code.

Variables, names, meanings, values                     EXAMPLE    
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++    
                                                        
incr = frameIncr = samples between frames                   16   
sampRate = sampling rate of audio file: sr               22050
srOn2pi = sampRate/(2*incr)                         219.34 Hz
binMidFrq = channel center frequency                  variable
frqPerBin = maximum deviation from center in Hz     689.06 Hz
size = indepVals = independent values in a frame            17
macro: actual(size); replaced by: ((size-1L)*2L),                 
((17-1)*2) thus returns the actual framesize.               32
eDphIncr = 2*incr/((float)actual(size))              2*16/32

expectedDphas = expected  phase between channels     variable

------------------------------------------------------------ next

60_22_1
additional parameters: none

The instruments of subgroup 22 explore the manipulation of PVOC's
ktimpnt input. The variable ktimpnt signifies a point in time of
the analysis file. 

Instrument 1 shows how to resynthesize the original soundfile (a
santur). For idur = 5 seconds, LINE produces a linear set of
values that will let PVOC progress through the analysis file at
the original speed. 

In instrument 2, LINE will have PVOC re-synthesis the analysis
file backwards. Choosing durations that differ from the original
soundfile duration results in time-stretching or time-compression
of the analysis file, without altering the pitch. 

(flowchart)
(.orc and .sco files)

------------------------------------------------------------ next

60_22_1B
additional parameters:


The pointer into this speech analysis file follows the values
produced by LINSEG.  Resynthesis proceeds in forward motion.  The
various slopes determine the amount of time-stretching or
compression. With regard to these effects, the slope pattern
results in a subtle modulation of the original speech
inflections.

(flowchart)
(.orc and .sco files)

------------------------------------------------------------ next

60_22_1C
additional parameters:


This variation shows us a giant magnification of a time fragment
of 40 ms: an audio microscope! 

The original soundfile featured a finger snap happening at time
0.71 seconds. LINE frames beginning and ending time points for
this computer instrument such that it captures the original
'snap'. Since the note duration is 10 seconds in the example, the
snap has been blown up by factor 250.

(flowchart)
(.orc and .sco files)

------------------------------------------------------------ next

60_22_2
additional parameters:


Here EXPON directs the pace of resynthesis of the santur analysis
file. 

Instrument 1 moves forward and at growing speed through
santur1.pv1, while instrument 2 resynthesizes backwards and
slowing down.

The variable ifildur is set to the duration of the original audio
file. 

(flowchart)
(.orc and .sco files)

------------------------------------------------------------ next

60_22_3
additional parameters:


Here we find an experimental design where the pointer is made to
oscillate through the santur1.pv1 analysis file. 

The oscillator controlling the pointer is set to a 1/4 Hz and has
a phase offset of 3/2. By adding the constant 2.5, this signal
slowly oscillates between times 2.25 and 2.75, and in total
completes a bit more than one cycle during the note duration of 5
seconds.
 
(flowchart)
(.orc and .sco files)

------------------------------------------------------------ next

60_23_1
additional parameters:


In this subgroup, we focus on the second krate variable of PVOC:
kfmod. The pointer ktimpnt is neutralized by a simple linear
resynthesis control.

The first example, again using the santur analysis file,
demonstrates the effect of an EXPON control signal, whose target
value is variable.

(flowchart)
(.orc and .sco files)
------------------------------------------------------------ next
60_23_2
additional parameters:


As in the previous example, some care needs to be taken in order
to convert the pitch of the note into a value suitable for
transposition manipulations. Values in the neighbourhood of unity
are required. Specific values will vary with the approximate fun-
damental frequency of the sound(s) in the analysis file. Then,
multiplication of EXPON with ifsc will produce the desired pitch
modifications. 

(flowchart)
(.orc and .sco files)
