This doesn't appear in the report because it isn't within start/end section tags.

For this project we were required to design a method for representing 16KHz speech waveforms at a rate of 1800 parameters per second. A number of possible methods were considered. An obvious simple solution would be to lowpass filter the speech signal to meet the 1800 parameters per second requirement. This would reduce the high frequency content in the speech but would still retain frequencies below 900Hz which would still provide intelligible speech. While this would provide a solution, it seems to be a cheap way out.

As a result, we also consider a number of other possibilities. These
included adaptive predictive coding, adaptive transform coding,
sub-band coding using adaptive bit allocation, sub-band adaptive predictive
coding, and vector quantization. It was at this point that we realized that
we needed to set some design objective in conjunction with picking a
compression approach. Motivated by the generally warm, fuzzy feeling from

"Develop a speech compression technique that produces reasonably intelligible male speech with as few parameters per second as possible." (We limited ourselves to male speech since all of our training/testing speech was spoken by male speakers.)

Throughout this section we use the "sun" sound bite from the first project to help illustrate our motivation for various design decisions. We resampled the speech signal at 16KHz in order to ensure an optimal match with the LPC codebook that we assume was trained on 16KHz speech data. Figure 1 shows the original "sun" signal.

Our first design decision (other than choosing our design goal) found early
and unanimous agreement. We settled on using LPC to model the vocal tract.
Furthermore, we restricted our LPC model to a twenty pole filter characterizing
30 msec speech frames. This restriction allowed us to take advantage of the
previously trained

The remainder of the design process involved modeling the error signal.

We model the error signal generated by the LPC vocal tract analysis as the excitation component of the speech waveform. We will use "excitation signal" and "error signal" interchangeably. A wide variety of excitation models exist in the literature. In this section we will describe a number of approaches that we considered. We will also describe some of the results for the ones we actually implemented.

On the extreme ends lie two options. One option is to ignore the excitation
and just use the vocal tract information to reconstruct the signal. We call
this approach

Since the

In both the

Among our pitch detection alternatives were cepstral analysis, autocorrelation methods (center clipping prior to autocorrelation calculation (CLIP) and autocorrelation performed on the LPC error signal (SIFT)), a slightly modified autocorrelation method called Average Magnitude Differences Function (AMDF) which subtracts instead of multiplying in the autocorrelation summation, and a parallel processing method based on an elaborate voting scheme. We immediately dismissed the parallel processing method due to its complexity and little promise of significantly superior performance. Based on our design objective we proposed to use the pitch detection algorithm that produced the most perceptually pleasing results. McGonegal (McGonegal 1977) reported that of these methods, AMDF offered the best results. At this point it is necessary for us to write a "weaselly" sentence or two to explain why we didn't actually do this. The bottom line is that a different group did this and we listened to their results and found that they weren't much different from ours using the cepstral analysis method.

While it is true that a number of methods exist for performing pitch detection, we chose to limit our implementational exploration to cepstral techniques. We did so because of the ease of implementation and intuitive attractiveness. We implemented the cepstral analysis as outlined in our second project. The cepstral coefficients are then used to determine whether the frame contains voiced or unvoiced speech. If the speech is determined to be voiced, an estimate of the pitch period is also obtained. By default our algorithm focuses on the cepstral coefficients representing the frequency range from 100 to 270 Hz. (Due to the speaker dependent nature of the cepstral approach to pitch detection, we have included an input parameter to adjust this as needed.) Our algorithm calculates the mean value of nonnegative coefficients in this range. If the peak value is greater than 1.5 times that of the mean value, the speech segment is classified as voiced speech and the pitch period is set based on maximum valued coefficient and is stored as the first excitation modeling parameter. If the peak value is less than 1.5 times that of the mean value, the speech segment is classified as unvoiced speech, and the first excitation modeling parameter is set to zero. In either case, the standard deviation of the excitation signal is calculated and stored as the second excitation modeling parameter.

This processing results in two model parameters for each frame. While it would be possible to arbitrarily chose the frame size for the excitation modeling, for simplicity we chose to remain consistent with the frame length used in the vocal tract modeling, i.e., 30 msec. As a result, we have three parameters for every 30 msec frame or just under 100 parameters per second.

We reconstruct the excitation signal as follows. For an unvoiced frame
the excitation signal is white noise with standard deviation equal to the
second excitation parameter. For a voiced frame we generate a periodic
signal using the function

e_{n} = r_{n} + (α n)/(1 + α n^{2}) mod γ

where r_{n} is white noise sequence with the same standard deviation
as the excitation signal, α determines the steepness of the slope,
and γ is the pitch period. This function provides a periodic excitation
signal that retains a white noise component approximating that of the
excitation signal.
The vocal tract and excitation information are combined via:

s_{n} = e_{n} - Σ_{k=1}^{20} b_{k}s_{n-k}

where e_{n} is the excitation signal and b_{k} are the LPC
codebook coefficients.

We performed cepstral analysis on the original signal (henceforth referred
to as

While the plots thus far are instructive, plots of the excitation signal
only provide a clearer view of the excitation signal modeling. These plots
are included in Figures 5 -- 7 for
the original excitation signal, the excitation modeled by

There exist a large number of reasonable approaches for reaching our design goal. We have considered a number of them and have actually implemented a subset of that number. Since our design goal was founded on intelligibility, we concluded that a quantitative evaluation to be of little use in assessing our ability to achieve our objective. Instead we relied on subjective assessments. Our assessments are rather imprecise and are aimed to provide a feel for our experiences as opposed to a definitive argument for a particular approach. Table 1 contains our estimates on the percentage of intelligible speech present for each speech signal for the two methods included in our final program.

There are five approaches that we evaluated ---

Sentence | Speaker number | Speaker number | ||||

1 | 80% | 60% | 50% | 70% | 20% | 20% |

2 | 60% | 70% | 70% | 30% | 50% | 30% |

3 | 70% | 40% | 100% | 20% | 20% | 30% |

4 | 70% | 60% | 90% | 40% | 20% | 20% |

5 | 80% | 80% | 90% | 40% | 10% | 20% |

Our project guidelines made it clear that we were to not concern ourselves
with the number of bits required to represent the speech; however, it may
be of interest to note that our approach can be easily modified to squeeze the
most information out of each bit as possible. We chose to use a 10 bit
codebook for the LPC coefficients, but we certainly could have reduced this
without much loss of intelligibility. A 6 bit codebook should suffice.
As we saw in the comparison between the

Compression technique | Parameters | Bits per second |

33.3 | 200 | |

33.3 + 1 | 200 + 4 | |

66.6 | 667 | |

99.9 | 1400 | |

99.9 | 1400 |

All of these bit rates could be reduced further by additional coding
techniques. For example, the

This was completely falsified so that students could have a template table for their own submissions. If the project was really a two person project, the activity log should have times for each student.

Activity | Time (in minutes) |

Designing | 90 |

Coding | 120 |

Debugging | 15 |

Testing | 60 |

Writing Report | 120 |

Other (installing SOX) | 15 |

Total | 420 |

The entire project was programmed in 'C' and the source code is attached
at the end of this report. Also, the last page of the report (after the
source code) is the "Project 4S Information Sheet." Our executable
code allows two modes of operation. The default mode processes using the

- L.R. Rabiner, M.J. Cheng, A.E. Rosenberg, and C.A. McGonegal,
"A Comparative Performance Study of Several Pitch Detection
Algorithms,"
IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. ASSP-24, no. 5, pp. 399-418, 1976. - C.A. McGonegal, "A Subjective Evaluation of Pitch Detection Methods
Using LPC Synthesized Speech,"
IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. ASSP-25, no. 6, 1977.

```
#include
```
#include "/home/offset/a/taylor/Src/Recipes/recipes/nrutil.h"
#include "/home/offset/a/taylor/Src/Recipes/recipes/nr.h"
#include "/home/offset/a/taylor/Src/Recipes/Vrecipes/randlib.h"
#include "hw4.h"
#define MOD_FACTOR 1.5
#define OTHER 0
#define MALE 1
#define FEMALE 2
#define CHILD 3
int main(int argc, char *argv[])
{
int i;
int j;
int k;
int N_flag;
int pole;
int itemp;
int num;
int seg_len;
int seg_num;
int filter_order;
int* data;
int pad_location;
int ID;
int sampling_rate;
int lifter_from_this_sample;
int lifter_till_this_sample;
float ftemp;
float rmse;
float errn;
float* filter_coeffs;
float* ceps_coeffs;
float e;
float* gen_e;
float err_stdev;
float err_mean;
float* segment;
float* windowed_segment;
int non_zero_count;
int max_index;
int pitch_period;
int num_codes;
int category_is;
/* long_segment is of length 1024 samples. It comprises the windowed segment
in the centre padded left and right by an appropriate number*/
double* long_segment;
double* fft_segment;
double non_zero_sum;
double max_samp;
FILE* infile;
FILE* errfile;
FILE* gen_errfile;
FILE* cepsfile;
FILE* lpcfile;
float* gen_err;
float** real_cep;
float** code_cep;
float** code_lpc;
float** codeword;
float* error_signal;
float* output_signal;
char fname[55];
char out_fname[55];
char temp_str[90];
char num_codes_string[8];
char code_fname[15];
char group_name[5];
char CODEBOOKS_EXIST;
if (( argc > 1 ) && ( !strcmp (argv [1], "-help" ))) {
print_directions();
}
/*the default values are assigned here*/
strcpy(fname, IN_DEF_FILE);
strcpy(out_fname, OUT_DEF_FILE);
strcpy(code_fname, CODE_DEF_DIR);
N_flag=1;
pole=0;
num_codes=DEF_CODEBK_SIZE;
strcpy(num_codes_string, "2");
num=DEF_DAT;
filter_order= 20;
seg_len=SEGMENT_LENGTH;
category_is=OTHER;
ID=0;
CODEBOOKS_EXIST=1;
sampling_rate=16000;
/*The for loop below works in the command line arguments into the program */
for(i=1;i=0; j--) {
if(((k-1)*seg_len+j-pad_location)>=0) {
long_segment[2*j]=0.0/*(double) data[(k-1)*seg_len+j-pad_location]*/;
} else {
long_segment[2*j]=0.0;
}
long_segment[2*j+1]=0.0;
}
/* Right pad*/
for(j=(pad_location+seg_len+1); j<1024; j++) {
if(((k-1)*seg_len+j)lifter_till_this_sample)||(jmax_samp) {
max_samp=long_segment[2*j];
max_index=j;
}
if((long_segment[2*j]>=0.0)&&(j<=lifter_till_this_sample)&&
(j>=lifter_from_this_sample)) {
non_zero_count++;
non_zero_sum+=fabs(long_segment[2*j]);
}
}
non_zero_sum/=non_zero_count;
/* Pitch detection is done here : If the max value is greater than the
average non-negative signal over the liftered signal, we claim a
pitch to have been detected*/
if((max_samp>(MOD_FACTOR*non_zero_sum))&&(N_flag!=0)) {
pitch_period=max_index;
} else {
pitch_period=-1;
}
lpc(windowed_segment, seg_len, filter_order, filter_coeffs, &rmse, &errn);
/* Calculate error--->Initialization*/
for(j=0;j=0) {
e+=filter_coeffs[i]*segment[j-i];
}
} else {
e+=filter_coeffs[i]*(float)data[(k-1)*seg_len+j-i];
}
}
if(!CODEBOOKS_EXIST) {
fprintf(errfile, "%f\n", e);
}
err_mean+=e;
err_stdev+=e*e;
}
err_mean/=(float)(seg_len);
err_stdev/=(float)(seg_len);
err_stdev-=(err_mean*err_mean);
if(err_stdev>0.0) {
err_stdev=sqrt(err_stdev);
} else {
err_stdev=0.0;
}
/* At this stage... use the voiced unvoiced decision
plus standard deviation of the error signal to generate
an 'error' signal.
To recap - Parameters used are :
a. (optional) Voiced/unvoiced flag : 0 if unvoiced, 1 if otherwise;
b. standard deviation of the error for the frame;
c. pitch period : -1 if unvoiced, something +ve if voiced; */
/* An excitation signal is generated as and how we have classified the frame */
if(pitch_period>0) {
voiced_error_gen(gen_e, seg_len, err_stdev, pitch_period);
} else {
unvoiced_error_gen(gen_e, seg_len, err_stdev);
}
for(j=0; j new segment begins */
if(CODEBOOKS_EXIST) {
codeword=(float **)matrix(1, seg_num, 1, filter_order);
code_lpc=(float **)matrix(1, num_codes, 1, filter_order); /*read codebook LPC*/
code_cep=(float **)matrix(1, num_codes, 1, filter_order); /*read codebook CEPS*/
}
/* Freeing memory */
free_ivector(data, 0, num-1);
free_vector(gen_e, 0, seg_len-1);
free_vector(windowed_segment, 0, seg_len-1);
free_dvector(long_segment, 0, (2*1024)-1);
free_dvector(fft_segment, 0, 1024-1);
free_vector(segment, 0, seg_len-1);
free_vector(filter_coeffs, 0, filter_order);
free_vector(ceps_coeffs, 1, filter_order);
if(CODEBOOKS_EXIST) {
for(i=1; i<=num_codes; i++) {
for(j=1; j<=filter_order; j++) {
fscanf(cepsfile,"%f", &ftemp);
code_cep[i][j]=ftemp;
fscanf(lpcfile,"%f", &ftemp);
code_lpc[i][j]=ftemp;
}
}
/* At this stage... have frame by frame data on cepstral coefficients
have codebooks on lpc and cepstral coeffs.
Proceed with the association
Output is stored in codeword */
code_select(real_cep, code_cep, code_lpc, codeword, seg_num, num_codes, filter_order);
free_matrix(code_cep, 1, num_codes, 1, filter_order);
free_matrix(code_lpc, 1, num_codes, 1, filter_order);
/* Incorporate inverse filtering process */
output_signal=(float *)vector(1, num);
for(k=1;k<=seg_num;k++) {
for(i=1;i<=seg_len;i++) {
output_signal[(k-1)*seg_len+i] = error_signal[(k-1)*seg_len+i];
for(j=1;j<=filter_order;j++) {
/* Generating output using excitation signal
and LPC coefficients from the codebook */
if(((k-1)*seg_len+i-j)>=1) {
output_signal[(k-1)*seg_len+i] -= codeword[k][j]*output_signal[(k-1)*seg_len+i-j];
}
}
printf("%d\n", (int)output_signal[(k-1)*seg_len+i]);
}
}
free_vector(output_signal, 1, num);
free_matrix(codeword, 1, seg_num, 1, filter_order);
fclose(lpcfile);
fclose(cepsfile);
}
free_matrix(real_cep, 1, seg_num, 1, filter_order);
free_vector(error_signal, 1, num);
if(CODEBOOKS_EXIST==0) {
fclose(errfile);
}
if(CODEBOOKS_EXIST==0) {
fclose(gen_errfile);
}
writeseed();
return 0;
}
]]>

```
void code_select(float **real_cep, float **code_cep, float **code_lpc, float **codeword,
int seg_num, int num_codes, int filter_order)
{
int i;
int k;
int j;
float err;
float emin;
for(k=1;k<=seg_num;k++) {
emin = 9999999.9;
for(i=1;i<=num_codes;i++) {
err = 0.0;
/* Measuring difference between the generated codeword and one from the
cepstral codebook*/
for(j=1;j<=filter_order;j++) {
err += (double)fabs((float)real_cep[k][j] - (float)code_cep[i][j]);
}
if(err
```

```
male (default)\n");
printf(" female\n");
printf(" all_males\n");
printf(" all_females\n");
printf(" -segl n segment length\n");
printf(" -group *char group name to decide cepstrum liftering.\n");
printf(" Valid options are -> O or o (default);\n");
printf(" M or m male;\n");
printf(" F or f female;\n");
printf(" J or j child.\n");
printf(" +P use popen\n");
printf(" +N dont classify voiced/unvoiced\n");
printf("\nDESCRIPTION\n");
printf("Default input file : %s\n", IN_DEF_FILE);
printf("Default codebook dir : %s\n", CODE_DEF_DIR);
printf("Default codebook size : %d\n", DEF_CODEBK_SIZE);
printf("Default number of records : %d\n", DEF_DAT);
printf("Default segment length : %d\n", SEGMENT_LENGTH);
printf("Default sampling rate : 16000 Hz\n");
printf("Default filter order : 20\n");
exit(0);
}
]]>
```

```
#include
```
#include "hw4.h"
#include "/home/offset/a/taylor/Src/Recipes/Vrecipes/randlib.h"
void unvoiced_error_gen(float *segment, int seg_len, float err_stdev)
{
int i;
/* The unvoiced excitation signal is just white noise with the
desired variance */
for (i=0; i

```
#include
```
#include "hw4.h"
#include "/home/offset/a/taylor/Src/Recipes/Vrecipes/randlib.h"
void voiced_error_gen(float *segment, int seg_len, float err_stdev, int pitch_period)
{
float var;
float mult_factor;
float ftemp;
float const_factor;
int i;
int j;
int num_peaks;
var=err_stdev*err_stdev*(float)seg_len;
num_peaks=(int)((float)seg_len/(float)pitch_period);
mult_factor = 0.95*sqrt(var/(float) num_peaks);
const_factor=10.0;
j=0;
for(i=0; i

```
#define PI 3.14159265
void hamm(float s[], float hs[], int n)
{
double omega;
double w;
int k;
omega=2*PI/(n-1);
for(k=0; k
```

```
#define PI 3.14159265
void dhamm(double s[], double hs[], int n)
{
double omega;
double w;
int k;
omega=2*PI/(n-1);
for(k=0; k
```

```
#include
```
#define PI 3.14159265
#define c_mag(c1) sqrt((c1.r)*(c1.r) + (c1.i)*(c1.i))
/* A structure to hold a complex number */
typedef struct {
double r;
double i;
} COMPLEX;
/* Authors: Varun Madhok and Chris Taylor
Date: December 6, 1996
Purpose: Returns the product of two complex numbers c1 and c2 */
COMPLEX c_mult(COMPLEX c1, COMPLEX c2)
{
COMPLEX c3;
c3.r=c1.r*c2.r - c1.i*c2.i;
c3.i=c1.i*c2.r + c1.r*c2.i;
return c3;
}
/* Authors: Varun Madhok and Chris Taylor
Date: December 6, 1996
Purpose: Returns the sum of two complex numbers c1 and c2 */
COMPLEX c_add(COMPLEX c1, COMPLEX c2)
{
COMPLEX c3;
c3.r=c1.r + c2.r;
c3.i=c1.i + c2.i;
return c3;
}
/* Authors: Varun Madhok and Chris Taylor
Date: December 6, 1996
Purpose: Returns the difference of two complex numbers c1 and c2 */
COMPLEX c_sub(COMPLEX c1, COMPLEX c2)
{
COMPLEX c3;
c3.r=c1.r - c2.r;
c3.i=c1.i - c2.i;
return c3;
}
/* Authors: Varun Madhok and Chris Taylor
Date: December 6, 1996
Reference: Steiglitz, Introduction to Discrete Systems */
int fftmag(double s[], double mag[], int n)
{
int i;
int j;
int m;
int l;
int length;
int loc1;
int loc2;
double arg;
double w;
COMPLEX c;
COMPLEX z;
COMPLEX f[1024];
for(i=0; i= m)
j += n/(m+m);
}
f[i].r=s[j];
f[i].i=0;
}
for(length=2; length <= n; length += length) {
w = -2.0*PI/(double)length;
for(j=0; j

```
#include
```
#define MAX_LPC_ORDER 40
#define EVEN(x) !(x%2)
int lpc(float x[], int n, int p, float b[], float* rmse, float* errn)
{
int i;
int k;
float reflect_coef[MAX_LPC_ORDER+1];
float auto_coef[MAX_LPC_ORDER+1];
float sum;
float temp1,temp2;
float current_reflect_coef;
float pred_error;
for(i=0; i<=p; i++) {
sum = 0.0;
for(k=0; k< n-i; k++) {
sum += (x[k] * x[k+i]);
}
auto_coef[i] = sum;
}
*rmse = auto_coef[0];
if(*rmse == 0.0) {
return 1; /* Zero power. */
}
pred_error = auto_coef[0];
b[0] = 1.0;
for (k=1; k<=p; k++) {
sum = 0.0;
for(i=0; i