Scalability and efficiency are desired in neural speech codecs, which supports a wide range of bitrates for applications
on various devices. We propose a collaborative quantization (CQ) scheme to jointly learn the codebook of LPC
coefficients and the corresponding residuals. CQ does not simply shoehorn LPC to a neural network, but bridges
the computational capacity of advanced neural network models and traditional, yet efficient and domain-specific
digital signal processing methods in an integrated manner. We demonstrate that CQ achieves much higher quality
than its predecessor at 9 kbps with even lower model complexity. We also show that CQ can scale up to 24 kbps where it
outperforms AMR-WB and Opus. As a neural waveform codec, CQ models are with less than 1 million parameters,
significantly less than many other generative models.
Please consider citing our papers if this helps.
@inproceedings{zhen2020cq,
author={Kai Zhen and Mi Suk Lee and Jongmo Sung and Seungkwon Beack and Minje Kim},
title={{Efficient And Scalable Neural Residual Waveform Coding with Collaborative Quantization}},
year=2020,
booktitle={Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2020},
doi={10.1109/ICASSP40776.2020.9054347}
url={https://ieeexplore.ieee.org/document/9054347}
}
@inproceedings{Zhen2019,
author={Kai Zhen and Jongmo Sung and Mi Suk Lee and Seungkwon Beack and Minje Kim},
title={{Cascaded Cross-Module Residual Learning Towards Lightweight End-to-End Speech Coding}},
year=2019,
booktitle={Proc. Interspeech 2019},
pages={3396--3400},
doi={10.21437/Interspeech.2019-1816},
url={http://dx.doi.org/10.21437/Interspeech.2019-1816}
}

The experiment is conducted on TIMIT corpus. https://catalog.ldc.upenn.edu/LDC93S1
python main.py --learning_rate_tanh 0.0002 # the learning rate for the 1st codec
--learning_rate_greedy_followers '0.00002 0.000002' # the learning rate for the added codecs and finetuning
--epoch_tanh 200 # the epoch for the 1st codec
--epoch_greedy_followers '50 50' # the epoch for the added codecs and finetuning
--batch_size 128
--num_resnets 2 # number of neural codecs involved
--training_mode 4 # see main.py for specifications
--base_model_id '1993783' # used for finetuning and evaluation
--from_where_step 2 # used for finetuning and evaluation
--suffix '_greedy_all_' # the suffix of the name of the model to be saved
--bottleneck_kernel_and_dilation '9 9 100 20 1 2' # configuration of the ResNet block
--save_unique_mark 'follower_all' # the name of the model to be saved
--the_strides '2' # the stride value for the down sampling CNN layer
--coeff_term '60 10 10 0' # coefficients for the loss terms
--res_scalar 1.0
--pretrain_step 2 # number of pretrained step with no quantization
--target_entropy 2.2 # target entropy
--num_bins_for_follower '32 32' # number of quantization bins
--is_cq 1 # is collaborative quantization being enabled
python main.py --training_mode 0 # the base_model_id will need to be set correctly, other settings do not need to be changed
Our work is built upon several recent publications on end-to-end speech coding, trainable quantizer and LPCNet.
Some of the code is borrowed from https://github.com/sri-kankanahalli/autoencoder-speech-compression