A while back, a client asked my company if we could help them with a feature for a web application they were using. They needed to make audio recordings online. Basically the application shows off users portfolios online, and offers the ability to add comments about their work.
So far, all comments and graces have been made using text. But now they wanted to move it into a somewhat more interactive solution, making it possible to record audio comments and thoughts about the users portfolio.
And obviously, being a Microsoft focused company, we came to the conclusion that this would be an easy thing to do in Silverlight. Especially since we all know that Silverlight in later releases gives us access to the users microphone and webcam. So this would be a piece of cake…or would it…?
Well, yes…Silverlight offers us access to the users mic and cam as long as we ask for permission before we start using them for the first time.
But before we start asking for access to the devices, we might want to have a look around to verify that there are actually devices available. This is very easy to do. All we need to do is to look at one of the static methods available on the CaptureDeviceConfiguration class. There are 2 methods that are of particular interest when working with audio, GetDefaultAudioCaptureDevice() and GetAvailableAudioCaptureDevices(). The first one will give you the default on (doh!) and the second one will give back a ReadOnlyCollection of AudioCaptureDevice.
AudioCaptureDevice device = CaptureDeviceConfiguration.GetDefaultAudioCaptureDevice();
if (device == null)
throw new Exception("Cannot go on without a mic...");
Once we have checked to see that there in fact is at least one audio capture device available, we need to ask for permission to use it. That part is really simple, just use a piece of code similar to this
private bool EnsureDeviceAccess()
{
if (CaptureDeviceConfiguration.AllowedDeviceAccess)
return true;
return CaptureDeviceConfiguration.RequestDeviceAccess();
}
It first verifies if we have already obtained access. If that is the case, we are all good to go. If we on the other hand do not have access, we need to ask permission to access the devices on the computer.
If our request to get access returns false, we might as well give up…or at least try again and this time inform the user that we REALLY need access to the devices… Without having successfully called RequestDeviceAccess(), we will not be able to continue…
But let’s assume that the user agrees and lets us record audio through one of the mics attached to the computer…
The next step is to figure out what AudioFormat we want to use. The AudioFormat class is represents a specific audio format that we want to use…hmm…pretty obvious… That format specification includes things like the number of channels to record, bits per sample and sample frequency. We can get all available formats by looking at the SupportedFormats property on the AudioCaptureDevice. This will return ReadOnlyCollection of AudioFormat.
TOo get hold of a specific format, we can use a simple LINQ query
AudioFormat format = (from af in device.SupportedFormats
where af.Channels == 1 && af.SamplesPerSecond == 11025 && af.BitsPerSample == 16
select af).First();
In the above case, I select a format that uses only 1 channel and samples 11025 times per second using 16 bits per sample.
Next that format is used by setting the DesiredFormat property to the returned value.
device.DesiredFormat = format;
device.AudioFrameSize = 100;
The code above also sets the AudioFrameSize. This is the amount of milliseconds the device should “run” before telling any body listening that there are new samples available. It can be set between 10 and 2000 ms. The default is 1000 ms…
Finally we create a new CaptureSource, which is responsible for actually capturing the information. It has two important properties, AudioCaptureDevice and VideoCaptureDevice. But in this case, only AudioCaptureDevice is really interesting. So let’s set that to the device we have been configuring…
CaptureSource captureSource = new CaptureSource();
captureSource.AudioCaptureDevice = device;
The last thing to do is to call the Start() method on the CaptureSource…and of course call Stop() when we are done recording…
But where do we get the actually sound? Well, that’s where it gets a little complicated… Calling Stop() will not just return a nice Stream containing a wav file or anything… The way that we capture the audio, is by using an AudioSink, or rather by creating a class that inherits from AudioSink.
The AudioSink class is an abstract class with 4 abstract methods, OnCaptureStarted(), OnCaptureStopped(), OnFormatChange() and OnSamples(). It also contains a single property of type CaptureSource.
You might already see where this is headed. We will get ourselves a class that will get notified when the CaptureSource starts, stops and also whenever there are new samples. And also when the format changes if we want to…
This is where the AudioFrameSize comes into play. The device will sample the data X times per second, and get Y bits of data for each sample. These samples will then be collected over AudioFrameSize milliseconds before being passed on to any AudioSink that is listening.
Ok…so all we need is a simple AudioSink that collects those samples and stores them and we are done! Well…sort of… The only issue is that the data we are getting is quite big.
I am neither a math geek or media person, but as far as I get it, the currently selected audio format will end up using a bit over 21.5 kb per second. I come to that conclusion by looking at the format. It samples the data 11025 times per second, and collects 16 bits every time. This is 176400 bits per second, or 2050 bytes per second…or 21.53 kb/sec…
So that is very soon become very large amounts of data, and this is where an encoder comes into play. And encoder takes that big amount of data and compresses it using some smart algorithm. This will obviously cause a loss of quality, but that is ok…at least in this case… Size and quality will always work against each other, so it is a matter of finding a ratio that works.
In my case, I needed to record a single voice from someone giving verbal feedback. It doesn’t need to be perfect like for example a singer adding vocals to a song. So I am ok with compressing it quite a lot as a small size is much more important to me…
So how do we compress/encode audio in Silverlight? Well, we don’t…not by default at least. Silverlight has no built in encoding support, at least not that I know of but feel free to correct me if I am wrong. So we need to look externally. The problem is that anything that we find, needs to be in managed code and written using apis and .NET features available in Silverlight.
The only managed code encoder I could find was CSpeex. CSpeex is a C# port of JSpeex, a Java implementation of Speex. Speex is an Open Source audio compression format designed for speech (I proudly stole that information of the project introduction on Codeplex). It is available for download at http://cspeex.codeplex.com/.
It works fine in Silverlight. The only issue is the fact that it is a Java port. That means it uses a lot of Java based syntax. Instead of enums it uses classes with static members, and instead of properties it uses get and set methods and so on. So it feels a little clunky to work with. I would have loved to see it ported to proper C# with all of the C'# language features…but hey, right now I am just happy that it exists…
Having gotten my hands on this project, it is time to look at how to use it with an AudioSink.
First we need a class that inherits from AudioSink
public class StreamAudioSink : AudioSink
{
protected override void OnCaptureStarted()
{
}
protected override void OnCaptureStopped()
{
}
protected override void OnFormatChange(AudioFormat audioFormat)
{
}
protected override void OnSamples(long sampleTime, long sampleDuration, byte[] sampleData)
{
}
}
That’s it. A class that inherits from AudioSink and implements the abstract methods…or at least declares them…
The first step is to make sure that the AudioCaptureDevice has a DesiredFormat that works with CSpeex. CSpeex expects PCM format. So in OnCaptureStarted() we check this
AudioFormat audioFormat = CaptureSource.AudioCaptureDevice.DesiredFormat;
if (audioFormat.WaveFormat != WaveFormatType.Pcm)
throw new Exception("Codec not supported");
Next, we initialize a SpeexEncoder, which is available in the CSpeex project. This encoder is initialized with some awesome integer values that says very little…another reason to use enums guys!
speexEncoder = new org.xiph.speex.SpeexEncoder();
speexEncoder.init(2, 8, audioFormat.SamplesPerSecond, audioFormat.Channels);
The initialized encoder gives us access to some data that makes it possible to calculate the packet size. This packet size is then used to initialize a byte array that is used as buffer for the read data…
int pcmPacketSize = 2 * speexEncoder.getChannels() * speexEncoder.getFrameSize();
temp = new byte[pcmPacketSize];
tempOffset = 0;
The speex implementation uses a RandomOutputStream to write its encoded speex format data to. The RandomOutputStream basically wraps a MemoryStream that we can later use to decode the speex format to wave format.
The RandomOutputStream is passed to an OggSpeexWriter class, which is responsible for writing speex packets to the RandomOutputStream
memFile = new RandomOutputStream(new MemoryStream(2 * 1024 * pcmPacketSize));
writer = new org.xiph.speex.OggSpeexWriter(2, audioFormat.SamplesPerSecond, audioFormat.Channels, 1, true);
writer.open(memFile);
writer.writeHeader("Encoded with Speex");
When the capture stops, we make sure to flush the data in the writer to the stream
protected override void OnCaptureStopped()
{
((OggSpeexWriter)writer).flush(true);
}
On format change we do nothing as I don’t expect the format to change in the middle of everything… But when samples arrive in the OnSamples() method, we need to work a little…
I won’t go into detail what is happening, but you will probably be able to understand it from the code. But it basically goes through the sampled data and pulls out pieces in the right size to be written as “packets”. It figures out the packet size by looking at the length of the byte array called temp that was created in the OnCaptureStarted() method.
Whenever a full packet has been received, it is processed by the SpeexEncoder and then written to the RandomOutputStream using the OggSpeexWriter.
protected override void OnSamples(long sampleTime, long sampleDuration, byte[] sampleData)
{
for (int i = 0; i < sampleData.Length; )
{
int len = Math.Min(sampleData.Length - i, temp.Length - tempOffset);
Buffer.BlockCopy(sampleData, i, temp, tempOffset, len);
if (len < temp.Length - tempOffset)
{
tempOffset += len;
}
else
{
tempOffset = 0;
speexEncoder.processData(temp, 0, temp.Length);
int encsize = speexEncoder.getProcessedData(temp, 0);
if (encsize > 0 && (memFile.InnerStream.Position + encsize < ((MemoryStream)memFile.InnerStream).Capacity))
{
writer.writePacket(temp, 0, encsize);
}
}
i += len;
}
}
Unfortunately, as this blog post comes out of a project for a client, I will not be providing a full code download for this blog post. But to make it a little easier for you, the code for the whole AudioSink looks like this
using System;
using System.Net;
using System.Windows.Media;
using org.xiph.speex;
using java.io;
using System.IO;
using cspeex;
namespace Curtin.AudioFeedback.AudioRecorder
{
public class StreamAudioSink : AudioSink
{
private SpeexEncoder speexEncoder;
private byte[] temp;
private int tempOffset;
private RandomOutputStream memFile;
private AudioFileWriter writer;
public RandomOutputStream MemFile { get { return memFile; } }
protected override void OnCaptureStarted()
{
AudioFormat audioFormat = CaptureSource.AudioCaptureDevice.DesiredFormat;
if (audioFormat.WaveFormat == WaveFormatType.Pcm)
{
speexEncoder = new org.xiph.speex.SpeexEncoder();
speexEncoder.init(2, 8, audioFormat.SamplesPerSecond, audioFormat.Channels);
int pcmPacketSize = 2 * speexEncoder.getChannels() * speexEncoder.getFrameSize();
temp = new byte[pcmPacketSize];
tempOffset = 0;
if (writer != null)
writer.close();
memFile = new RandomOutputStream(new MemoryStream(2 * 1024 * pcmPacketSize));
writer = new org.xiph.speex.OggSpeexWriter(2, audioFormat.SamplesPerSecond, audioFormat.Channels, 1, true);
writer.open(memFile);
writer.writeHeader("Encoded with Speex");
}
else
{
throw new Exception("Codec not supported");
}
}
protected override void OnCaptureStopped()
{
((OggSpeexWriter)writer).flush(true);
}
protected override void OnFormatChange(AudioFormat audioFormat)
{
}
protected override void OnSamples(long sampleTime, long sampleDuration, byte[] sampleData)
{
for (int i = 0; i < sampleData.Length; )
{
int len = Math.Min(sampleData.Length - i, temp.Length - tempOffset);
Buffer.BlockCopy(sampleData, i, temp, tempOffset, len);
if (len < temp.Length - tempOffset)
{
tempOffset += len;
}
else
{
tempOffset = 0;
speexEncoder.processData(temp, 0, temp.Length);
int encsize = speexEncoder.getProcessedData(temp, 0);
if (encsize > 0 && (memFile.InnerStream.Position + encsize < ((MemoryStream)memFile.InnerStream).Capacity))
{
writer.writePacket(temp, 0, encsize);
}
}
i += len;
}
}
}
}
So now that we have an AudioSink, we can go back to the actual recording code and start using it. So right before you call the Start() method on the CaptureSource, you should create a new StreamAudioSink and set the CaptureSource property to the CaptureSource.
audioSink = new StreamAudioSink() { CaptureSource = _captureSource };
After this is done, we can call Start() and start the recording. The sink will continuously process the data as it passes through and encode it into Speex format. When we stop recording by calling the Stop() method on the CaptureSource, we need to decode the Speex format to Wave format.
Decoding the Speex format to Wave format is not hard. Just create a new JSpeexDec instance. Tell it what destination format it should produce and if it should be stereo. Then call the decode() method, passing in a RandomInputStream and a RandomOutputStream.
public void Save(Stream stream)
{
JSpeexDec decoder = new JSpeexDec();
decoder.setDestFormat(JSpeexDec.FILE_FORMAT_WAVE);
decoder.setStereo(false); //true
Stream memStream = _audioSink.MemFile.InnerStream;
memStream.Position = 0;
decoder.decode(new RandomInputStream(memStream), new RandomOutputStream(stream));
}
The stream that is passed into the Save() method in this example will contain a valid Wave formatted audio stream. This can then be written to disk or passed to a webservice or what ever it might be used for.
I hope that this has given you a little insight in how to record audio in Silverlight in a “real” way. It seems as most examples just show how to capture the audio, and show very little about what to do with it. So hopefully this will get you a bit further.
And as usual, feel free to send me an e-mail or make a comment if there is something that is unclear or needs more explaining…