The problem with training neural networks is always finding the training data. Samples of individual notes are available on the internet. Samples of chords are not available.

There are 16 chord types that are recognized by the Audiophile’s Analyzer. There are 96 notes in 8 octaves. So, it is possible to play 1536 different cords on a piano keyboard. It is not practical to play, record and label so many samples.

To solve this a feature has been added to the CNN tab of the Audiophile’s Analyzer which will produce MIDI files for every possible chord for a selected instrument.

These MIDI files are then converted to .wav audio files using a third-party application. When read by the Audiophile’s Analyzer a spectrogram is produced.

The Training set creation utility is then used to slice the spectrogram to create the training images.

Monochrome images are used to train the CNN, color is for humans.

This utility will recognize the filename format and produce the labels file. Multiple labels will be applied to each file.

The Audiophile’s Analyzer is used to build and train a CNN using the labeled images.

The output layer of the neural network will have 96 outputs, labeled from 0C to 7B. During training the required outputs will be set for each note in the file label.

Training results are recorded:

Using the Audiophile’s Analyzer to transcribe a .wav file:

It can take a day or so to train a CNN model, or several minutes to create a profile, assuming that you have enough suitable samples. Fortunately it is possible to transcribe without prior training for a particular instrument. To illustrate this I have a single note played on a double bass (A octave 2).

To transcribe I set the options to No Profile and Constant Q Mapping.

Selecting Use FFT Profile Result only

The result is

Selecting Use Spectrogram Analysis Result only I get

Here I use a sequence of notes played on a clarinet.

The first note is A (octave 4) followed by A#, the final note id E (octave 4).

Using the built in profile the correlation algorithm gives:

Pretty good.

To improve this I used the built in CNN model in combination with the algorithm’s results.

Here I perform the same task, i.e. single note (A2), this time from a cello. This is a little more difficult as the cello is a polyphonic instrument with very strong harmonics.

From the signal and FFT result we can see that the first overtone has more energy than the fundamental frequency.

This problem is attenuated in the Multi-rate FFT spectrogram as the overtone is sampled over less time in the shorter sampled higher frequency FFT.

We use the built in profile for the cello.

Here we can see that as the note fades on the last beat the overtone is also transcribed.

Since we know that this was a single note the monophonic check box should be checked.

Now the result is as expected.

Using the built in Convolutional Neural Network for the cello we see leading and trailing silence; only the red line from the spectrogram has been transcribed.

The threshold for silence is calculated differently for the algorithm and the CNN. The algorithm sets the threshold on the fly during transcription based on a running average of the energy in the audio. For the CNN the threshold for silence is implemented when the spectrogram slices are prepared for training; the CNN was never trained on the parts of the note which were too quiet.

The simplest music file to transcribe is a single note. Here I use a bassoon playing a single note A (octave 2) to walk through the simplest transcription.

From the signal and FFT result we can see that this is indeed a single note with a single dominant frequency.

The spectrogram confirms the simplicity of this example.

To transcribe this note we will use the built in bassoon profile and the default options (i.e. correlation).

The result is as expected.

Alternatively we could have elected to use the built in Convolutional Neural Network for the bassoon.

The result is a little different. The note ends just after the forth beat. The CNN transcribes this as a 3 beat note, not 4 beats.

A spectrogram tab has been added to the Audiophile’s Analyzer to give a visual representation of the whole audio file spectrum. https://en.wikipedia.org/wiki/Spectrogram

Spectrogram – 3 octaves x 37 seconds

The spectrogram has time on the horizontal axis, and frequency on the vertical axis. Two options are available for the frequency axis, raw FFT results, and mapped notes (a simple approximation to a constant Q transform). Here the mapped notes are shown.

This functionality is a precursor to applying a convolutional neural network (CNN) to the transcription process. This technique is currently available commercially CNN (lunaverus.com) for piano only. The results are on a par with the algorithmic methods so far implemented. Working in two dimensions has advantages in that attack and decay patterns can be trained for. By allowing the user to use algorithmic transcription, or AI, or both in concert hopefully better accuracy can be achieved. By training models for various instruments and providing the user with the means to train their own models this can be applied more broadly.

Version 1.2 now released supports both MIDI file output from a score transcribed from audio, and MIDI file input for analysis and transcription.

To illustrate the MIDI functionality simply I will use a single note played on a bassoon.

Using the metronome this transcribes as:

Saving to MIDI we get these events:

Opening the saved MIDI file we see:

Scoring produces:

Note the missing rest at the end. The MIDI standard has no way of specifically defining a rest. The Audiophile’s Analyzer will infer rests that occur between notes, but this final rest cannot be inferred.

Now that the Audiophile’s Analyzer is available in the Microsoft App Store I need to profile as many instruments a possible to increase the accuracy and utility of the application.

While the profiling tool is provided with the Audiophile’s Analyzer allowing the user to create their own profiles, it makes more sense for me to create them and incorporate then in future releases.

The tuba cannot be transcribed without a profile as the overtones dominate its spectrum.

The profile created from this and other samples:

Without applying the profile the highest point in the spectrum is selected we already know that the tuba is monophonic). This is C# which is incorrect.

Applying the profile using correlation we correctly identify the low A.

                The Imaging Whiteboard is done. I’ve added every feature I can think of, fixed every bug I know about. 3.0.5 is the final version.

                So, what is the next project?

                Obviously, I asked ChatGPT.

Me:

“My name is Steve Weavers; my personal web site is http://sound-analysis.com/ . I am responsible for The Musicians Workbench and The Imaging Whiteboard. I have just completed the Imaging Whiteboard. I am looking for ideas for my next project. What do you think I should tackle?”

Here is what it said:

“Hi Steve! It’s great to hear that you’ve completed the Imaging Whiteboard. As for your next project, there are many directions you could go in. Here are a few ideas to consider:

Music notation software: If you enjoy working on music-related projects, creating a software tool that makes it easier for musicians to notate their compositions could be a valuable resource for the community.

Audio editing software: Building an audio editing software could be a complex but rewarding project. You could focus on a specific niche, such as podcast editing, or create a comprehensive tool that covers all aspects of audio editing.

Virtual instrument plugin: If you have experience with music production software, developing a virtual instrument plugin could be a fun challenge. This could involve creating realistic-sounding emulations of real instruments, or coming up with entirely new and unique sounds.

Music recommendation engine: With so much music available online, it can be overwhelming for listeners to discover new artists and albums. Creating a recommendation engine that uses machine learning algorithms to suggest music based on a user’s listening history could be a valuable tool for music lovers.

Audio analysis tool: Given your background in sound analysis, creating a tool that helps users analyze audio files could be a natural fit. This could involve developing algorithms to identify key features of the audio, such as tempo, key, and tonality.

I hope these ideas give you some inspiration for your next project!”

Me:

“Those were excellent suggestions. Do you have any ideas for imaging projects?”

“Certainly, here are some ideas for imaging projects you could consider:

Image recognition software: Developing software that uses machine learning algorithms to recognize and identify objects within images could be a valuable tool for a range of industries, from healthcare to security.

Image processing software: Creating software that enhances or alters images could be a fun and creative project. You could focus on specific effects, such as color correction or noise reduction, or create a comprehensive tool that covers a range of editing capabilities.

Virtual reality experiences: With the increasing popularity of virtual reality technology, creating immersive visual experiences could be a rewarding project. This could involve creating 360-degree images or videos, or developing interactive VR games.

Image compression algorithms: Developing algorithms to compress image files without compromising their quality could be a valuable resource for a range of industries, from web development to photography.

Augmented reality applications: Building applications that overlay virtual objects or information onto real-world images could be a fun and innovative project. This could involve creating educational tools or marketing applications for businesses.

I hope these ideas provide some inspiration for your next imaging project!”

That’s brilliant! The Audio analysis tool might be a good challenge. What do you think?

Using the MNIST database of handwritten digits MNIST database – Wikipedia a convolutional neural network was trained to an accuracy of 90%. This took 50 epochs.

The trained model was loaded into the Image Classifier control and used to identify handwritten digits.

The files required to reproduce this demo are available here https://drive.google.com/file/d/1XKSYvJfAW1maNsaiV0iaWZXor0Tbtuat/view?usp=share_link