# \ \NoIze/ / Sound Classification Tool

Last updated: October 9th, 2019

Welcome to a NoIze interactive notebook on sound classification. Here you can access the project's documentation or code repository.

To follow along this demo, headphones are recommended to hear sound examples. (Don't forget to turn down the volume first as you can always turn it back up.)

If you just want to read along and hear some audio, ignore the snippets of code, like the one below. However, I encourage you to fork this notebook so that you can experiment with the examples. You don't have to download or install anything onto your computer. If you don't have an account with 'notebooks.ai', you can create a free one here.

# Outline¶

### I. Hear the training data¶

Two small sound datasets can be found in this Jupyter Lab.

### II. Quick overview: MFCC vs FBANK features¶

I offer just a bit of background to provide some basic context for when you might extract mfcc or fbank features from acoustic data.

### III. Build sound / noise classifier¶

I show how one can build a sound classifier, using the sound dataset and extracting mfcc features.

Suggestions for what you can do, for example, build a speech recognition model using the reduced speech commands dataset provided.

# Let's get started!¶

In [ ]:
# install what is required to use NoIze:
!pip install -r requirements.txt
import noize

# what is necessary to play audio files in this notebook:
import IPython.display as ipd

# what is neceesary for visualizing MFCC and FBANK features
from python_speech_features import mfcc, logfbank
import matplotlib.pyplot as plt
import numpy as np


### Set directories for training data¶

In [2]:
path2audiodata = './audiodata/'
path2_speechcommands_data = '{}speech_commands_sample/'.format(path2audiodata)
path2_backgroundnoise_data = '{}background_noise/'.format(path2audiodata)


# Hear the training data¶

Two small sound datasets can be found in this Jupyter Lab.

• background noise
• speech commands

One dataset contains recordings of background noise. Some of these were collected from freesound.org and others I recorded with my own smartphone.

The other dataset is a reduced collection from Google's speech commands dataset from 2017.

I offer these two datasets so you can explore how different acoustic data (noise vs. speech) as well features extracted from these data (MFCC vs FBANK) influence the training of acoustic models.

#### Background Noise: buzzing¶

In [3]:
buzzing = '{}buzzing/118340__julien-matthey__jm-noiz-buzz-01-neon-light21.wav'.format(
path2_backgroundnoise_data)
ipd.Audio(samps,rate=sr)

Out[3]:

#### Background Noise: street¶

In [4]:
street = '{}street/2019-08-19 10.10.433.wav'.format(
path2_backgroundnoise_data)
ipd.Audio(samps,rate=sr)

Out[4]:

#### Background Noise: train¶

In [5]:
train = '{}train/331877.wav'.format(
path2_backgroundnoise_data)
ipd.Audio(samps,rate=sr)

Out[5]:

#### Speech Commands: nine¶

In [6]:
nine = '{}nine/e269bac0_nohash_0.wav'.format(
path2_speechcommands_data)
ipd.Audio(samps,rate=sr)

Out[6]:

#### Speech Commands: right¶

In [7]:
right = '{}right/d0ce2418_nohash_1.wav'.format(
path2_speechcommands_data)
ipd.Audio(samps,rate=sr)

Out[7]:

#### Speech Commands: zero¶

In [8]:
zero = '{}zero/b3bb4dd6_nohash_0.wav'.format(
path2_speechcommands_data)
ipd.Audio(samps,rate=sr)

Out[8]:

# Quick Overview: MFCC vs FBANK features¶

MFCC: Mel Frequency Cepstral Coefficients

FBANK: Log-Mel Filterbank Energies

All you really need to know is that both MFCC and FBANK features are derived from a more complex set of features, termed the Short-Time Fourier Transform (STFT). Note: the picture at the top of this page (pictured below) is a 3D representation of what STFT features convey: the frequencies and their energy levels in a sound recording.

Here's more on spectrograms.

STFT features contain detailed information about the frequencies in an acoustic signal over time. Great acoustic models can be trained on these features. However, to reduce computational complexity, additional calculations can be applied to reduce the number of features.

FBANK features are basically STFT features reduced to the frequencies most relevant for human hearing.

MFCC features are FBANK features further reduced in complexity, where any somewhat repetitive features or collinear features are removed.

I think this post does a good job walking through the steps. In reference to the 6 steps they list in that post:

• Steps 1 - 2: STFT
• Steps 3 - 4: FBANK
• Steps 5 (optional 6): MFCC

## Visualizing MFCCs and FBANK energies¶

Below I've put together a function that allows one to explore how the feature extraction process is influenced by the various parameters:

soundfile wavfile to be loaded (must be compatible with scipy.io.wavfile)

features either 'mfcc' or 'fbank'

win_size_ms time in milliseconds, or window size in milliseconds, to apply the FFT

win_shift_ms time in milliseconds each window should overlap

num_filters number of mel-filters to apply: default is 40 but try out more and fewer

num_mfcc (only considered if extracting 'mfcc' features) number of mel frequency cepstral coefficients to apply. For speech, 13-40 is typical. For scene analysis, 40 is typical.

Feel free to try different wavfiles, mess with the window sizes, etc. Note: if you are confused by the labels of the graphs, I apologize. I am still working on scaling the frequencies and time to the features, something the librosa library did for me... I cannot import librosa in JupyterLab and now have to do it myself. Which is good, I guess.

In [1]:
def visualize_feats(soundfile, features='fbank', win_size_ms = 20, \
win_shift_ms = 10, num_filters=40,num_mfcc=40):
win_samples = int(win_size_ms * sr // 1000)
if 'fbank' in features:
feats = logfbank(data,
samplerate=sr,
winlen=win_size_ms * 0.001,
winstep=win_shift_ms * 0.001,
nfilt=num_filters,
nfft=win_samples)
axis_feature_label = 'Mel Filters'
elif 'mfcc' in features:
feats = mfcc(data,
samplerate=sr,
winlen=win_size_ms * 0.001,
winstep=win_shift_ms * 0.001,
nfilt=num_filters,
numcep=num_mfcc,
nfft=win_samples)
axis_feature_label = 'Mel Freq Cepstral Coefficients'
feats = feats.T
plt.clf()
plt.pcolormesh(feats)
plt.ylabel('Num {}'.format(axis_feature_label))
plt.xlabel('Frames (each {} ms)'.format(win_size_ms))
plt.title('{}s Visualized'.format(features.upper(),soundfile))
plt.plot()


### Backgound noise¶

Loaded files you can try out:

• buzzing
• street
• train

#### MFCC¶

In [93]:
visualize_feats(train, features = 'mfcc',num_mfcc=40, win_size_ms=20, win_shift_ms=10)


#### FBANK¶

In [92]:
visualize_feats(train, features = 'fbank',num_filters=40, win_size_ms=20, win_shift_ms=10)


### Speech Commands:¶

Loaded files you can try out:

• nine
• right
• zero

#### MFCC¶

In [102]:
visualize_feats(zero, features = 'mfcc',num_mfcc=40, win_size_ms=20, win_shift_ms=10)


#### FBANK¶

In [109]:
visualize_feats(zero, features = 'fbank',num_filters=40, win_size_ms=20, win_shift_ms=10)


## Build a Sound Classifier!¶

In [9]:
from noize.templates import noizeclassifier


### Set directory for saving newly created files¶

In [10]:
path2_features_models = './feats_models/'


#### Name Project¶

Tip: include something about the data used to train the classifier

In [11]:
project_backgroundnoise = 'background_noise'


Running the following code will extract 'mfcc' features from the audio data provided. These features will then be used to train a convolutional neural network to classify such data as either sound most similar to 'buzzing', 'street', or 'train' noise.

In [12]:
noizeclassifier(classifer_project_name = project_backgroundnoise,
audiodir = path2_backgroundnoise_data,
feature_type = 'mfcc')

multiple models found. chose this model:
feats_models/background_noise/models/mfcc_40_1.0/background_noise_model/bestmodel_background_noise_model.h5

Features have been extracted.


Using TensorFlow backend.

Loading previously trained classifier.


### Use the classifier to classify new data!¶

In [13]:
cafe_noise = '{}cafe18.wav'.format(path2audiodata)
ipd.Audio(samps,rate=sr)

Out[13]:
In [14]:
noizeclassifier(classifer_project_name = project_backgroundnoise,
audiodir=path2_backgroundnoise_data,
target_wavfile = cafe_noise, # the sound we want to classify
feature_type='mfcc')

multiple models found. chose this model:
feats_models/background_noise/models/mfcc_40_1.0/background_noise_model/bestmodel_background_noise_model.h5

Features have been extracted.

Label classified:  train


## Challenges¶

1)

Try training the background noise classifier with the feature_type 'fbank' instead of 'mfcc'. Do you notice a difference? Does the cafe noise still get labeled as 'train' noise?

2)

Collect a sound or two you would like to classify with this classifier, for example from freesound.org. You will need to create a free account in order to download sounds, which I highly encourage. Note: as of now, NoIze can only process monochannel, 16-bit wavfiles. The link offered should be set to only show sounds that adhere to those requirements.

3)

Build a speech commands classifier using the data provided in the speech_commands_sample folder. Try adjusting the arguments for noizeclassifier, such as features extraced ('mfcc' vs 'fbank').

How do you think the classifier will classify the following words: 'cat', 'marvin', and 'wow'?

• cat
In [15]:
cat = '{}cat.wav'.format(path2audiodata)
ipd.Audio(samps,rate=sr)

Out[15]:
• marvin
In [16]:
marvin = '{}marvin.wav'.format(path2audiodata)
ipd.Audio(samps,rate=sr)

Out[16]:
• wow
In [17]:
wow = '{}wow.wav'.format(path2audiodata)
ipd.Audio(samps,rate=sr)

Out[17]:

And how does the classifer actually classify them? Are the classifications the same for both 'mfcc' and 'fbank' features? Which adhere better to your expectations?

4)

Adjust the model architecture in the file 'cnn.py'. This can be located in the following directory: './noize/models/'. You can try implementing another convolutional neural network (CNN) architecture or even try adding a long short-term memory network (LSTM). This latter option would require a bit of fiddling around with data input sizes.

### A little prompt to get you started¶

In [18]:
project_speechcommands = 'speech_commands'

In [19]:
noizeclassifier(classifer_project_name = project_speechcommands,
audiodir=path2_speechcommands_data,
target_wavfile = cat, # file for classification - test out the other words as well
feature_type='mfcc' # try 'fbank' features and see if the validation score increases or decreases
)

multiple models found. chose this model:
feats_models/speech_commands/models/mfcc_40_1.0/speech_commands_model/bestmodel_speech_commands_model.h5

Features have been extracted.

Label classified:  zero

In [20]:
noizeclassifier(classifer_project_name = project_speechcommands,
audiodir=path2_speechcommands_data,
target_wavfile = cat,
feature_type='fbank'
)

multiple models found. chose this model:
feats_models/speech_commands/models/fbank_40_1.0/speech_commands_model/bestmodel_speech_commands_model.h5

Features have been extracted.

Label classified:  right

In [21]:
noizeclassifier(classifer_project_name = project_speechcommands,
audiodir=path2_speechcommands_data,
target_wavfile = zero,
feature_type='mfcc'
)

multiple models found. chose this model:
feats_models/speech_commands/models/mfcc_40_1.0/speech_commands_model/bestmodel_speech_commands_model.h5

Features have been extracted.

Label classified:  zero

In [22]:
noizeclassifier(classifer_project_name = project_speechcommands,
audiodir=path2_speechcommands_data,
target_wavfile = zero,
feature_type='fbank'
)

multiple models found. chose this model:
feats_models/speech_commands/models/fbank_40_1.0/speech_commands_model/bestmodel_speech_commands_model.h5

Features have been extracted.