Transcription

Training a neural network to transcribe chords

Published June 8, 2024 | By Steve

The problem with training neural networks is always finding the training data. Samples of individual notes are available on the internet. Samples of chords are not available.

There are 16 chord types that are recognized by the Audiophile’s Analyzer. There are 96 notes in 8 octaves. So, it is possible to play 1536 different cords on a piano keyboard. It is not practical to play, record and label so many samples.

To solve this a feature has been added to the CNN tab of the Audiophile’s Analyzer which will produce MIDI files for every possible chord for a selected instrument.

These MIDI files are then converted to .wav audio files using a third-party application. When read by the Audiophile’s Analyzer a spectrogram is produced.

The Training set creation utility is then used to slice the spectrogram to create the training images.

Monochrome images are used to train the CNN, color is for humans.

This utility will recognize the filename format and produce the labels file. Multiple labels will be applied to each file.

The Audiophile’s Analyzer is used to build and train a CNN using the labeled images.

The output layer of the neural network will have 96 outputs, labeled from 0C to 7B. During training the required outputs will be set for each note in the file label.

Training results are recorded:

Using the Audiophile’s Analyzer to transcribe a .wav file:

No profile, no model, no problem

Published March 30, 2024 | By Steve

It can take a day or so to train a CNN model, or several minutes to create a profile, assuming that you have enough suitable samples. Fortunately it is possible to transcribe without prior training for a particular instrument. To illustrate this I have a single note played on a double bass (A octave 2).

To transcribe I set the options to No Profile and Constant Q Mapping.

Selecting Use FFT Profile Result only

The result is

Selecting Use Spectrogram Analysis Result only I get

Audiophile’s Analyzer – Note sequence transcription.

Published March 29, 2024 | By Steve

Here I use a sequence of notes played on a clarinet.

The first note is A (octave 4) followed by A#, the final note id E (octave 4).

Using the built in profile the correlation algorithm gives:

Pretty good.

To improve this I used the built in CNN model in combination with the algorithm’s results.

Audiophile’s Analyzer – Simple use case 2

Published March 28, 2024 | By Steve

Here I perform the same task, i.e. single note (A2), this time from a cello. This is a little more difficult as the cello is a polyphonic instrument with very strong harmonics.

From the signal and FFT result we can see that the first overtone has more energy than the fundamental frequency.

This problem is attenuated in the Multi-rate FFT spectrogram as the overtone is sampled over less time in the shorter sampled higher frequency FFT.

We use the built in profile for the cello.

Here we can see that as the note fades on the last beat the overtone is also transcribed.

Since we know that this was a single note the monophonic check box should be checked.

Now the result is as expected.

Using the built in Convolutional Neural Network for the cello we see leading and trailing silence; only the red line from the spectrogram has been transcribed.

The threshold for silence is calculated differently for the algorithm and the CNN. The algorithm sets the threshold on the fly during transcription based on a running average of the energy in the audio. For the CNN the threshold for silence is implemented when the spectrogram slices are prepared for training; the CNN was never trained on the parts of the note which were too quiet.

Audiophile’s Analyzer – Simple use case

Published March 23, 2024 | By Steve

The simplest music file to transcribe is a single note. Here I use a bassoon playing a single note A (octave 2) to walk through the simplest transcription.

From the signal and FFT result we can see that this is indeed a single note with a single dominant frequency.

The spectrogram confirms the simplicity of this example.

To transcribe this note we will use the built in bassoon profile and the default options (i.e. correlation).

The result is as expected.

Alternatively we could have elected to use the built in Convolutional Neural Network for the bassoon.

The result is a little different. The note ends just after the forth beat. The CNN transcribes this as a 3 beat note, not 4 beats.