DarksideCookie

Come to the dark side...we have cookies!

Audio recording and encoding in Silverlight

A while back, a client asked my company if we could help them with a feature for a web application they were using. They needed to make audio recordings online. Basically the application shows off users portfolios online, and offers the ability to add comments about their work.

So far, all comments and graces have been made using text. But now they wanted to move it into a somewhat more interactive solution, making it possible to record audio comments and thoughts about the users portfolio.

And obviously, being a Microsoft focused company, we came to the conclusion that this would be an easy thing to do in Silverlight. Especially since we all know that Silverlight in later releases gives us access to the users microphone and webcam. So this would be a piece of cake…or would it…?

Well, yes…Silverlight offers us access to the users mic and cam as long as we ask for permission before we start using them for the first time.

But before we start asking for access to the devices, we might want to have a look around to verify that there are actually devices available. This is very easy to do. All we need to do is to look at one of the static methods available on the CaptureDeviceConfiguration class. There are 2 methods that are of particular interest when working with audio, GetDefaultAudioCaptureDevice() and GetAvailableAudioCaptureDevices(). The first one will give you the default on (doh!) and the second one will give back a ReadOnlyCollection of AudioCaptureDevice.

AudioCaptureDevice device = CaptureDeviceConfiguration.GetDefaultAudioCaptureDevice();
if (device == null)
throw new Exception("Cannot go on without a mic...");

Once we have checked to see that there in fact is at least one audio capture device available, we need to ask for permission to use it. That part is really simple, just use a piece of code similar to this

private bool EnsureDeviceAccess()
{
if (CaptureDeviceConfiguration.AllowedDeviceAccess)
return true;

return CaptureDeviceConfiguration.RequestDeviceAccess();
}

 

It first verifies if we have already obtained access. If that is the case, we are all good to go. If we on the other hand do not have access, we need to ask permission to access the devices on the computer.

If our request to get access returns false, we might as well give up…or at least try again and this time inform the user that we REALLY need access to the devices… Without having successfully called RequestDeviceAccess(), we will not be able to continue…

But let’s assume that the user agrees and lets us record audio through one of the mics attached to the computer…

The next step is to figure out what AudioFormat we want to use. The AudioFormat class is represents a specific audio format that we want to use…hmm…pretty obvious… That format specification includes things like the number of channels to record, bits per sample and sample frequency. We can get all available formats by looking at the SupportedFormats property on the AudioCaptureDevice. This will return ReadOnlyCollection of AudioFormat.

TOo get hold of a specific format, we can use a simple LINQ query

AudioFormat format = (from af in device.SupportedFormats
where af.Channels == 1 && af.SamplesPerSecond == 11025 && af.BitsPerSample == 16
select af).First();

In the above case, I select a format that uses only 1 channel and samples 11025 times per second using 16 bits per sample.

Next that format is used by setting the DesiredFormat property to the returned value.

device.DesiredFormat = format;
device.AudioFrameSize = 100;

The code above also sets the AudioFrameSize. This is the amount of milliseconds the device should “run” before telling any body listening that there are new samples available. It can be set between 10 and 2000 ms. The default is 1000 ms…

Finally we create a new CaptureSource, which is responsible for actually capturing the information. It has two important properties, AudioCaptureDevice and VideoCaptureDevice. But in this case, only AudioCaptureDevice is really interesting. So let’s set that to the device we have been configuring…

CaptureSource captureSource = new CaptureSource();
captureSource.AudioCaptureDevice = device;

 

The last thing to do is to call the Start() method on the CaptureSource…and of course call Stop() when we are done recording…

But where do we get the actually sound? Well, that’s where it gets a little complicated… Calling Stop() will not just return a nice Stream containing a wav file or anything… The way that we capture the audio, is by using an AudioSink, or rather by creating a class that inherits from AudioSink.

The AudioSink class is an abstract class with 4 abstract methods, OnCaptureStarted(), OnCaptureStopped(), OnFormatChange() and OnSamples(). It also contains a single property of type CaptureSource.

You might already see where this is headed. We will get ourselves a class that will get notified when the CaptureSource starts, stops and also whenever there are new samples. And also when the format changes if we want to…

This is where the AudioFrameSize comes into play. The device will sample the data X times per second, and get Y bits of data for each sample. These samples will then be collected over AudioFrameSize milliseconds before being passed on to any AudioSink that is listening.

Ok…so all we need is a simple AudioSink that collects those samples and stores them and we are done! Well…sort of… The only issue is that the data we are getting is quite big.

I am neither a math geek or media person, but as far as I get it, the currently selected audio format will end up using a bit over 21.5 kb per second. I come to that conclusion by looking at the format. It samples the data 11025 times per second, and collects 16 bits every time. This is 176400 bits per second, or 2050 bytes per second…or 21.53 kb/sec…

So that is very soon become very large amounts of data, and this is where an encoder comes into play. And encoder takes that big amount of data and compresses it using some smart algorithm. This will obviously cause a loss of quality, but that is ok…at least in this case… Size and quality will always work against each other, so it is a matter of finding a ratio that works.

In my case, I needed to record a single voice from someone giving verbal feedback. It doesn’t need to be perfect like for example a singer adding vocals to a song. So I am ok with compressing it quite a lot as a small size is much more important to me…

So how do we compress/encode audio in Silverlight? Well, we don’t…not by default at least. Silverlight has no built in encoding support, at least not that I know of but feel free to correct me if I am wrong. So we need to look externally. The problem is that anything that we find, needs to be in managed code and written using apis and .NET features available in Silverlight.

The only managed code encoder I could find was CSpeex. CSpeex is a C# port of JSpeex, a Java implementation of Speex. Speex is an Open Source audio compression format designed for speech (I proudly stole that information of the project introduction on Codeplex). It is available for download at http://cspeex.codeplex.com/.

It works fine in Silverlight. The only issue is the fact that it is a Java port. That means it uses a lot of Java based syntax. Instead of enums it uses classes with static members, and instead of properties it uses get and set methods and so on. So it feels a little clunky to work with. I would have loved to see it ported to proper C# with all of the C'# language features…but hey, right now I am just happy that it exists…

Having gotten my hands on this project, it is time to look at how to use it with an AudioSink.

First we need a class that inherits from AudioSink

public class StreamAudioSink : AudioSink
{
protected override void OnCaptureStarted()
{
}

protected override void OnCaptureStopped()
{
}

protected override void OnFormatChange(AudioFormat audioFormat)
{
}

protected override void OnSamples(long sampleTime, long sampleDuration, byte[] sampleData)
{
}
}

That’s it. A class that inherits from AudioSink and implements the abstract methods…or at least declares them…

The first step is to make sure that the AudioCaptureDevice has a DesiredFormat that works with CSpeex. CSpeex expects PCM format. So in OnCaptureStarted() we check this

AudioFormat audioFormat = CaptureSource.AudioCaptureDevice.DesiredFormat;
if (audioFormat.WaveFormat != WaveFormatType.Pcm)
throw new Exception("Codec not supported");

Next, we initialize a SpeexEncoder, which is available in the CSpeex project. This encoder is initialized with some awesome integer values that says very little…another reason to use enums guys!

speexEncoder = new org.xiph.speex.SpeexEncoder();
speexEncoder.init(2, 8, audioFormat.SamplesPerSecond, audioFormat.Channels);

 

The initialized encoder gives us access to some data that makes it possible to calculate the packet size. This packet size is then used to initialize a byte array that is used as buffer for the read data…

int pcmPacketSize = 2 * speexEncoder.getChannels() * speexEncoder.getFrameSize();
temp = new byte[pcmPacketSize];
tempOffset = 0;

 

The speex implementation uses a RandomOutputStream to write its encoded speex format data to. The RandomOutputStream basically wraps a MemoryStream that we can later use to decode the speex format to wave format.

The RandomOutputStream is passed to an OggSpeexWriter class, which is responsible for writing speex packets to the RandomOutputStream

memFile = new RandomOutputStream(new MemoryStream(2 * 1024 * pcmPacketSize));

writer = new org.xiph.speex.OggSpeexWriter(2, audioFormat.SamplesPerSecond, audioFormat.Channels, 1, true);
writer.open(memFile);
writer.writeHeader("Encoded with Speex");

When the capture stops, we make sure to flush the data in the writer to the stream

protected override void OnCaptureStopped()
{
((OggSpeexWriter)writer).flush(true);
}

 

On format change we do nothing as I don’t expect the format to change in the middle of everything… But when samples arrive in the OnSamples() method, we need to work a little…

I won’t go into detail what is happening, but you will probably be able to understand it from the code. But it basically goes through the sampled data and pulls out pieces in the right size to be written as “packets”. It figures out the packet size by looking at the length of the byte array called temp that was created in the OnCaptureStarted() method.

Whenever a full packet has been received, it is processed by the SpeexEncoder and then written to the RandomOutputStream using the OggSpeexWriter.

protected override void OnSamples(long sampleTime, long sampleDuration, byte[] sampleData)
{
for (int i = 0; i < sampleData.Length; )
{
int len = Math.Min(sampleData.Length - i, temp.Length - tempOffset);
Buffer.BlockCopy(sampleData, i, temp, tempOffset, len);
if (len < temp.Length - tempOffset)
{
tempOffset += len;
}
else
{
tempOffset = 0;
speexEncoder.processData(temp, 0, temp.Length);
int encsize = speexEncoder.getProcessedData(temp, 0);
if (encsize > 0 && (memFile.InnerStream.Position + encsize < ((MemoryStream)memFile.InnerStream).Capacity))
{
writer.writePacket(temp, 0, encsize);
}
}
i += len;
}
}

 

Unfortunately, as this blog post comes out of a project for a client, I will not be providing a full code download for this blog post. But to make it a little easier for you, the code for the whole AudioSink looks like this

using System;
using System.Net;
using System.Windows.Media;
using org.xiph.speex;
using java.io;
using System.IO;
using cspeex;

namespace Curtin.AudioFeedback.AudioRecorder
{
public class StreamAudioSink : AudioSink
{
private SpeexEncoder speexEncoder;
private byte[] temp;
private int tempOffset;
private RandomOutputStream memFile;
private AudioFileWriter writer;

public RandomOutputStream MemFile { get { return memFile; } }

protected override void OnCaptureStarted()
{
AudioFormat audioFormat = CaptureSource.AudioCaptureDevice.DesiredFormat;
if (audioFormat.WaveFormat == WaveFormatType.Pcm)
{
speexEncoder = new org.xiph.speex.SpeexEncoder();
speexEncoder.init(2, 8, audioFormat.SamplesPerSecond, audioFormat.Channels);
int pcmPacketSize = 2 * speexEncoder.getChannels() * speexEncoder.getFrameSize();
temp = new byte[pcmPacketSize];
tempOffset = 0;

if (writer != null)
writer.close();

memFile = new RandomOutputStream(new MemoryStream(2 * 1024 * pcmPacketSize));

writer = new org.xiph.speex.OggSpeexWriter(2, audioFormat.SamplesPerSecond, audioFormat.Channels, 1, true);
writer.open(memFile);
writer.writeHeader("Encoded with Speex");
}
else
{
throw new Exception("Codec not supported");
}
}

protected override void OnCaptureStopped()
{
((OggSpeexWriter)writer).flush(true);
}

protected override void OnFormatChange(AudioFormat audioFormat)
{
}

protected override void OnSamples(long sampleTime, long sampleDuration, byte[] sampleData)
{
for (int i = 0; i < sampleData.Length; )
{
int len = Math.Min(sampleData.Length - i, temp.Length - tempOffset);
Buffer.BlockCopy(sampleData, i, temp, tempOffset, len);
if (len < temp.Length - tempOffset)
{
tempOffset += len;
}
else
{
tempOffset = 0;
speexEncoder.processData(temp, 0, temp.Length);
int encsize = speexEncoder.getProcessedData(temp, 0);
if (encsize > 0 && (memFile.InnerStream.Position + encsize < ((MemoryStream)memFile.InnerStream).Capacity))
{
writer.writePacket(temp, 0, encsize);
}
}
i += len;
}
}
}
}

So now that we have an AudioSink, we can go back to the actual recording code and start using it. So right before you call the Start() method on the CaptureSource, you should create a new StreamAudioSink and set the CaptureSource property to the CaptureSource.

audioSink = new StreamAudioSink() { CaptureSource = _captureSource };

 

After this is done, we can call Start() and start the recording. The sink will continuously process the data as it passes through and encode it into Speex format. When we stop recording by calling the Stop() method on the CaptureSource, we need to decode the Speex format to Wave format.

Decoding the Speex format to Wave format is not hard. Just create a new JSpeexDec instance. Tell it what destination format it should produce and if it should be stereo. Then call the decode() method, passing in a RandomInputStream and a RandomOutputStream.

public void Save(Stream stream)
{
JSpeexDec decoder = new JSpeexDec();
decoder.setDestFormat(JSpeexDec.FILE_FORMAT_WAVE);
decoder.setStereo(false); //true

Stream memStream = _audioSink.MemFile.InnerStream;
memStream.Position = 0;

decoder.decode(new RandomInputStream(memStream), new RandomOutputStream(stream));
}

 

The stream that is passed into the Save() method in this example will contain a valid Wave formatted audio stream. This can then be written to disk or passed to a webservice or what ever it might be used for.

I hope that this has given you a little insight in how to record audio in Silverlight in a “real” way. It seems as most examples just show how to capture the audio, and show very little about what to do with it. So hopefully this will get you a bit further.

And as usual, feel free to send me an e-mail or make a comment if there is something that is unclear or needs more explaining…

Posted: Oct 11 2010, 09:01 by ZeroKoll | Comments (11) |
  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5
Filed under: Silverlight
Manage post: :)

Comments (11) -

Hima United States said:

Chris,
Nice article. Can I get project source code of this article?
Thanks

# November 25 2010, 13:25

Hima India said:

Getting error here  
((OggSpeexWriter)writer).flush(true);

Saying OggSpeexWriter  does not contain definition for flush.

I do have the same req for my project where end user can record the medicines and labresults , need to record the spoken words and save them on the server. Please help me in this

# November 25 2010, 14:21

aoli United States said:

Great article, I am trying to do the exact same thing and having a heck of a time.  Could you post an example of how you pass the stream to a service?  From what I can tell, services can't take streams, just byte[].  But I may have no idea what I'm talking about Smile

# December 22 2010, 08:35

Hima United States said:

Yes. You need to encode the stream at the client side and conver the stream as bytes then pass this to service .

At the service side decode the stream of bytes and save as stream on the server.

# December 22 2010, 08:42

aoli United States said:

Thanks, Hima. My problem is when I encode the stream to bytes then decode back to a file, I can't seem to get a valid wav file.

Do you happen to have any sample code or know of examples?

# December 22 2010, 17:58

n.Gian Greece said:

Valuable post! Thank you!
- as Himma i can't access flush function at line
((OggSpeexWriter)writer).flush(true);

- When i try to use your format (with samplesPS=11025 etc) it throws
"A first chance exception of type 'java.io.EOFException' occurred in cspeex", and the size of the output is 44 bytes.

It works well for sample rate 44100 but for nothing else!

This is strange because the other formats are supported (I checked)!

Any suggestions?

thank you

# January 01 2011, 20:19

n.Gian Greece said:

I just noticed that the real problem is that the encoder crops the output to be 314 kb. If the actual size is < than 314 then he writes 44 bytes. Generally the output has size 314xN! Can i fix that?

# January 01 2011, 23:24

ZeroKoll New Zealand said:

Hi n.Gian!
The flush method is internal in the original build I think. As I recall it, I have made a tiny change to the source of the CSpeex stuff and made it public to be able to call the method manually...
The "cropping" problems seem to be connected to the CSpeex code, which I haven't written. I suggest looking around the web for answers to that...
Cheers!

# January 02 2011, 00:11

bfwu Taiwan said:

i set my format on the method.
But its cant compile, bcuz the error says that i dont have enough priority to access those resource(mean channels, sample rate, bits per sample...)...plz help me

--------------------------------------------------------------
protected override void OnFormatChange(AudioFormat audioFormat)
        {
            AudioFormat desiredAudioFormat= (from af in device.SupportedFormats
                                              where af.Channels == 1
                                                 && af.SamplesPerSecond == 8000
                                                 && af.BitsPerSample == 16 select af).First();
            device.DesiredFormat = desiredAudioFormat;
            device.AudioFrameSize = 100;

# January 28 2011, 06:56

ZeroKoll New Zealand said:

Hi bfwu!
I haven't seen this before. Unfortunately I can't come up with a solution with that little information. Do you have anything else that I can use? Any way that you can get the source to me so I can test it? That might help...
Cheers!

# February 02 2011, 06:11

Alan United States said:

How can I put output file into byte array or memory stream since I need save the output into database instead of local file.

Thanks

# October 11 2011, 20:34

Pingbacks and trackbacks (2)+

Comments are closed