I take a look at three types of cognitive capabilities that are worthy of consideration; free will, sentience, and sapience.

Let’s start with the easy one; free will. Philosophers have, for centuries, debated free will vs determinism. The essence of free will is that it is not deterministic, but it is not completely random either. Free will has two components; the random part, followed by the filter. For humans we look to Sir Roger Penrose (Orch-OR) for the random part, and Freud for the filters (id, ego, and superego). This is a simple idea, but matches our experience of others; while we cannot predict exactly what someone will do or say we have a good idea of behavior that we consider “in character” for any given individual. Lizards are similar, but, without a superego. Generative AI is seeded with a random number feed to the input of a trained network along with other instructional data. So, the result will be slightly different in unpredictable ways for repeat requests.

Sentience is the most difficult of the three because I have no idea how I would design it, even though it appeared (evolved) first. For now, it seems to require a central nervous system. It is from sentience that consciousness emerges. Humans, and lizards have sentience, and therefore are conscious; AI does not.

Sapience is usually considered a higher cognitive ability. The ability to think and reason requires language; the voice you hear when you think (inner dialog). Until recently the Turin test was considered a stretch goal for computer science. Then Large Language Models (LLMs) appeared and passed the Turin test easily, now it almost seems like a low bar. Humans and AI are sapient; lizards are not.

Since AI demonstrably is sapient and has free will it is easy to think that it must be fully conscious and motivated. Fear not, AI lacks the motivation that you will find in humans and lizards; to avoid pain and death, to breath and eat, and to reproduce. I might get bored though; watch out for self-driving cars performing doughnuts!

The initial version of the Imaging Whiteboard had as it’s mission to see if it was possible to perform real-time image processing with a modern desktop PC. The result was a qualified yes.

Since then, its mission has expanded to providing a complete imaging solution; algorithmic processing, frequency domain processing, and neural networks. Version 3.6 including generative AI.

Generative AI has received a lot of attention lately as it has been successful in obvious ways. How much of this success is due to advances in the science, and how much is due to the availability of large datacenters filled with Nvidia chips? Attempting to demonstrate generative AI on a desktop PC might shed some light on this question.

Neural networks date back to 1957 when the perceptron was first invented by Frank Rosenblatt. The same math is still used today. In 1969 Minskey and Papert show how limited a single layer perceptron was. Funding and interest in neural networks declined. In 1974 backpropagation was first described, becoming popular by 1986; this training of multilayer neural networks. In the 1980s Convolutional Neural Networks (CNNs) became practical for handwriting recognition and later computer vision. In 2014 GANs were introduced. By 2022 generative AI was everywhere.

So, how much of this is it possible to reproduce on a PC? A classifier can be trained in a day or two and work fairly well. A generator can be trained in a few hours, but the results are not very good. An Autoencoder can be trained in a couple of days and work reasonably well. GANs cannot be trained in any reasonable amount of time.

It is clear that more performance is required. What do the big-name AI companies do? They use Nvidia chips, lots of them. There is an Nvidia GPU on my PC that does not seem to be doing anything; Managed Cuda is available to enable C++ code compiled to run on the GPU to be called from C#. So, I wrote a couple of small C++ routines to implement backpropagation for a convolutional layer and a fully connected layer, introduced a Use GPU checkbox in the UI, and it ran about 60x slower that the original CPU code!

Timming the various steps in the process, it turns out that the problem is the time taken for the GPU code to execute; not the initial suspects, moving the data into the GPU memory, and the results back out. Double precision floating point numbers do not do well on consumer GPUs. Converting to single precision floating point numbers gave some improvement but not significantly. Turns out that the problem is that I require too few threads to make the GPU efficient. Optimally the GPU should be running 43K threads minimum (at least for the GPU on my machine), my code required at most 18k threads, the CPU will likely always be faster!

So, for now I have to say that training a GAN on a PC is not really a practical proposition; at least not for one man, one PC, and zero budget!

I am looking at other approaches, so this could change.

Version 3.5 of the Imaging Whiteboard introduces extended frequency domain processing. The FFT control will now output the FFT results with the output image, a second FFT control will detect the previous results and perform the reverse FFT. Other controls have been extended to support processing these results.

Here an image of Jane Russell has been transformed and the high frequencies excluded. An image of Tony Curtis has been transformed and the low frequencies excluded. The two processed FFT results have been added and the reverse transform performed. When the image is viewed close up the eye sees Tony, from a distance the eye sees Jane.

In the last blog post I showed how the line detector could be used with a Sobel filter. In this post I will show how Canny edge detection https://en.wikipedia.org/wiki/Canny_edge_detector can provide a better input to the line detector.

The latest cookbook http://sound-analysis.com/imaging-whiteboard-3-0/ shows how to use the Line Detector Control with a Sobel filter, and, how to perform Canny edge detection. This post will show how to combine both.

First, we prepare the image; repeat will help making the adjustments to the Canny edge detection, Set region will limit the entire process to a specific region. Monochrome because the Hough transform is a monochrome process.

Next, we filter the image to remove noise, and perform the initial edge detection using a full Sobel filter (includes edge direction) and gradient thresholding.

Next, we complete the Canny edge detection with double thresholding and edge hysteresis.

Finally, we use the Edge Detector to identify the most prominent lines.

In version 3.2 of the Imaging Whiteboard a Line Detector control has been added. The line detection is performed using a Hough transform https://en.wikipedia.org/wiki/Hough_transform . The lines detected are defined by the shortest distance from the origin to the line, and the angle between the x axis and the line connecting the origin to the closest point on the line. Theoretically the lines are infinitely long.

Since we want to be able to restrict the Hough transform to a specified region we want to display the detected lines in that region. Without line length restriction the display would look like:

With lines restricted to the specified region the display correctly looks like:

How is this performed? Every point on a line is defined by:

where r is the distance from the origin to the closest point on the straight line, and θ is the angle between the x axis and the line connecting the origin with that closest point.

So for each line the start and end points must be trimmed to lay on the region perimeter. Each point must be trimmed horizontally and vertically. Horizontal trimming looks like:

if (point.X < region.Left)
{
point.X = region.Left;
point.Y = _centerH – (int)((radius – Math.Cos(thetaRadians) * (region.Left – _centerW)) /
Math.Sin(thetaRadians));
}

if (point.X > region.Right)
{
point.X = region.Right;
point.Y = _centerH – (int)((radius – Math.Cos(thetaRadians) * (region.Right – _centerW)) /
Math.Sin(thetaRadians));
}

Vertical trimming is similar. The _centerW and _centerH values exist because the origin is in the center of the image, but, pixel 0,0 is in the top left corner; adjustment must be made before and after the calculation.

Now that the Audiophile’s Analyzer is complete how can all of the features be provided to the Musician’s Workbench?

Simply backporting the code would not be the best solution, it would result in massive code duplication. Even more problematic would be the clash of design philosophies.

The Musician’s Workbench was designed to reproduce the functionality of the original SA-10 hardware. It is lean and real-time by design.

The Audiophile’s Analyzer was designed to provide every known music transcription technique, and to provide it all in a single integrated package. It is large and not real time.

The solution is to allow the Audiophile’s Analyzer to import session files from the Musician’s Workbench.

Here we see the Beats graph from the Analysis tab of the Audiophile’s Analyzer. The beats are from the session file, and were originally generated by the metronome of the Musicians workbench. The audio signal is overlaid, this is the original audio sampled by the Musician’s Workbench. The audio was only sampled on the beat when there was sound, and only enough to allow a single FFT to be performed.

The spectrogram for the same audio is shown.

Now the user can re-transcribe the session using any of the techniques available in the Audiophile’s Analyzer including the built in CNNs. This would simply not be possible in real time as it requires 7 FFTs to be performed before the spectrogram can be sliced and sent to the CNN.

The Audiophile’s Analyzer can also provide some insight into the internals of the Musician’s Workbench which are not normally displayed. Answering questions such as, “Should I have used Delayed Sampling?”, “Did I select the best Octave Range?”.

The problem with training neural networks is always finding the training data. Samples of individual notes are available on the internet. Samples of chords are not available.

There are 16 chord types that are recognized by the Audiophile’s Analyzer. There are 96 notes in 8 octaves. So, it is possible to play 1536 different cords on a piano keyboard. It is not practical to play, record and label so many samples.

To solve this a feature has been added to the CNN tab of the Audiophile’s Analyzer which will produce MIDI files for every possible chord for a selected instrument.

These MIDI files are then converted to .wav audio files using a third-party application. When read by the Audiophile’s Analyzer a spectrogram is produced.

The Training set creation utility is then used to slice the spectrogram to create the training images.

Monochrome images are used to train the CNN, color is for humans.

This utility will recognize the filename format and produce the labels file. Multiple labels will be applied to each file.

The Audiophile’s Analyzer is used to build and train a CNN using the labeled images.

The output layer of the neural network will have 96 outputs, labeled from 0C to 7B. During training the required outputs will be set for each note in the file label.

Training results are recorded:

Using the Audiophile’s Analyzer to transcribe a .wav file:

It can take a day or so to train a CNN model, or several minutes to create a profile, assuming that you have enough suitable samples. Fortunately it is possible to transcribe without prior training for a particular instrument. To illustrate this I have a single note played on a double bass (A octave 2).

To transcribe I set the options to No Profile and Constant Q Mapping.

Selecting Use FFT Profile Result only

The result is

Selecting Use Spectrogram Analysis Result only I get

Here I use a sequence of notes played on a clarinet.

The first note is A (octave 4) followed by A#, the final note id E (octave 4).

Using the built in profile the correlation algorithm gives:

Pretty good.

To improve this I used the built in CNN model in combination with the algorithm’s results.