HTML5

Breaking Out – Web Audio and Perceptive Media

By Mark Boas

Sometimes when we take on projects we don’t really know what we’re letting ourselves in for. To fully know, we’d spec things out to the ninth degree and who wants to or has time for that? Well people like NASA do, but we don’t.

When we were given the opportunity to work on something called Perceptive Media all I saw was a colourful but amorphous form ahead and although it was explained to me (as well it could be) I still didn’t have a real clue of what it would end up being and crucially what was involved in making it a reality.

But hey it’s the BBC right? You don’t often get to work with the BBC – you know that BBC you spent so much time watching and listening to when you were growing up. The same BBC who make Doctor Who ? – That BBC!

Time-lords aside (or not actually as the case may be), being a bit of an audio-geek I’ve always been fascinated by the BBC’s audio output especially their Radiophonic Workshop especially Delia Derbyshire’s work and this was an audio gig coming from the R&D department of the same organisation. How could we turn it down?

Delia Derbyshire at the BBC's Radiophonic Workshop © BBC

Delia Derbyshire at the BBC's Radiophonic Workshop © BBC

So I jumped at the chance without really knowing what lay in store, which incidentally is very unfair on my colleague who ended up doing the lion’s share of the work.

The brief went something like this : We want to create a web-audio based demo that will adjust its content to the listener, based on information we can ascertain about them. Oh and we want it to sound as natural as possible so that the listener may not suspect that the content is being tailored. I soon found out that this involved generating audio on-the-fly and applying convolution reverb and other effects so that audio sounded natural in various environments. What this all meant is that we needed to use an advanced audio API. Now I’ve had fun investigating advanced audio APIs before, they differ from the standard audio APIs by being designed to allow you not only to play audio files, but to generate and alter audio.

This is exciting because we are finally getting around to doing stuff with audio that we have been doing with text for years. It also crosses over with something I’m looking at called Hyperaudio.


Experimentation

So with this cross-over in mind, I felt I could manage to find the time to experiment and create some proof of concept demos and this lead me to try out the following libraries:

Speak.js

The Speak.js demo pretty much demonstrates what it can do. You put text in and speech comes out. The library itself is ported from eSpeak using something called Emscripten and actually allows you to generate audio pretty much in real-time by constructing a data URI in WAV format.

AudioLib.js

We wanted it to work well on all browsers that support some advanced audio API. Firefox’s Audio Data API adopts a more low-level approach but in theory at least, you should be able to do anything that Webkit’s Web Audio API does, but with raw JavaScript. AudioLib.js provides the libraries to do this and also abstracts the differences between the two APIs so that you can write one set of code for both.

Unfortunately time and budget restrictions meant that we couldn’t create the version we wanted to make – a version that would work similarly in all browsers that provided an advanced audio API. As it seems that the future W3C standard for advanced audio will be based on the Web Audio API we decided to concentrate on that.


Personalisation

The data we use to personalise the broadcast comes from a number of sources. The main differentiator is the listener’s location which we use via the geo-location API to determine, local weather, radio streams and landmarks which are then subtly inserted into the audio stream. The only restriction is that you must be in the UK to really encounter the differences – this is partly due to the fact that we use some BBC resources that are only available for the UK but also so that we can keep the data manageable for what, after all, is just a demo.

The second source of information about the listener came from a slightly more sinister place. For fun I’d been working with a good friend of mine Matteo Spinelli on a project called Underpants and while looking at the issue of browser traceability across websites we figured out how to determine which social networks a browser is logged into. Cross-over struck once more and we used this technique to personalise the part where our outdoor-challenged hero is urged to log out of her favourite social network and leave the apartment.

Breaking Out Loading Screen

Breaking Out Loading Screen

Try the demo!


Advanced Web Audio

So what changes between the version of the broadcast that uses the Web Audio API and the fallback version that doesn’t? Well there are a number of factors, some of them quite subtle. Speak.js outputs a robot style voice which is fine for our use as the electronic voice of the lift. But we wanted to make sure it would fit in properly to the various environments in which it was set. To do this we created something called a convolution-reverb. In short, a this reverb allows us to apply the right sort of audio ambiance to a sound. So if a sound is coming from a lift we apply a lift type echo. We also apply the same ‘echo’ to the streamed radio broadcast that is played at a certain point in the broadcast.

The fact that we are using an advanced audio API also enables us to add various other effects to other pieces of audio. However we soon found that we needed to be sensible with our audio design, since convolutions with the Web Audio API does take a up a fair amount of CPU. During a development error, a unique convolution was used for each sound, this was found to start failing at around the 14th.

We also made use of audio filters, for example the radio podcast uses a high-pass filter applied to it which makes it sound ‘tinny’, another example is when Harriet opens the her apartment door at the start we apply both filters and faders. The Web Audio API uses a node based approach which means that you can feed the results of one effect into the next so we can apply filters, faders and convolutions to any audio source. To achieve all of this we made heavy use of Web Audio API’s AudioParam which allows nearly any attribute to be changed using handy linear transform effects – we used this to fade in and out, or cross-fade between filtered and unfiltered outputs.

So the Web Audio API version applies filters, faders and convolutions to the audio whereas the standard HTML5 audio versions do not. That’s not to say that given enough time we couldn’t have achieved the same effect using the Audio Data API included in Firefox. But since the new audio standard is slated to be largely based on the Web Audio API it was decided for the purposes of this demo to concentrate our energy in this area.

Once we’d got the core of the functionality working, we set to work on creating a control panel to allow us to tweak every single one of the volumes, filters, faders and convolutions.

We wanted to be able to demonstrate to editors of audio how we could tweak pace, reverb and sound-effects in real-time and although requiring a complete code re-factor we hope that the fact that the whole thing is pretty-much customisable makes this a powerful demo and will be useful to others as well as ourselves who are dabbling in this area.

The Breaking Out Control Panel

The Breaking Out Control Panel

So this is a good time to mention that all the source-code is open source and that you can grab it from the following GitHub location https://github.com/happyworm/PerceptiveMedia.

Conclusions

I think this was definitely an interesting and worthwhile experiment. However as its aim was to be subtle, it purposely does not make immediately clear the potential of the technology. Technically what we are able to achieve turned out to be a kind of audio framework to allow the ability to create and tweak audio as it’s being played. This is useful to producers of audio to see how effects and timings can alter the experience and is especially useful for applications such as games where perhaps you want to give your sound-effects context. I also feel that these techniques could be used for applications such as dynamic story-telling. My daughter — all too often — asks me to tell her a story featuring robots, dinosaurs and goblins and all too often I fall back on the same old principles and formulas of children’s story-telling, the rule of three and so on. Post happy-ever-after she often wants me to add to the story, getting me to fill-in or clarify some of the details. It’s not a huge stretch to imagine that we could create dynamic storytelling applications for kids. A pinch of AI here, some personalisation there and a heavy dose of randomness might just be enough to keep them happy for a bit.

So there we have it, one small step closer to the old Star Trek computer (Did Doctor Who’s have a voice interface? I forget). We’ve already seen the application of voice input with software such as Siri. It shouldn’t be too much longer until audio interfaces start to become common-place, with so much current emphasis on the visual I think this could be quite refreshing.


Epilogue

Question. What’s harder to debug than an intermittent bug? Answer. An intermittent bug that only manifests itself when you deploy to the server. Crazy I know and totally unexpected to us and for this reason you may see issues when running on Firefox (but not Opera). Being supporters of both Mozilla and Firefox we were much dismayed by this bug and spent a significant amount of time trying to get to the bottom of it. Unfortunately due to its nature we were only able to put in a loose bug-report If anybody wants to help us solve this issue please feel free to take a look, even if it’s just to download the application from GitHub and verify that it works locally for you.

Thanks then to Ian Forrester and Tony Churnside for the opportunity to work with them and their team at BBC R&D, also of course Sarah Glenister for the excellent script and Angie Chan for the great artwork. Jussi Kalliokoski for helping us work with AudioLib.js But most of all I want to thank my colleague Mark Panaghiston for working tirelessly behind the scenes not only on the significantly challenging audio aspects of the project but also go above and beyond in integrating the visual aspects and even sourcing and setting up the hosting.

Related Articles

Writing for Perceptive Media

Illustrations for BBC R&D’s Perceptive Media Demo: Breaking Out

What is Perceptive Media?

BBC demonstrates revolutionary ‘perceptive media’

Perceptive Media Launch at Social Media Cafe Manchester

The BBC unveils its first ‘Perceptive Media’ experiment – and you can try it now

What is Perceptive Media?

Tags: , , , , , ,

Thursday, July 12th, 2012 Audio, HTML5, Hyper Audio, javascript, jPlayer, Web Audio API Comments Off

HTML5 Audio APIs – How Low can we Go?

By Mark Boas

“O ye’ll tak’ the high road, and Ah’ll tak’ the low (road)
And Ah’ll be in Scotlan’ afore ye.”
(The Bonnie Banks o’ Loch Lomond)

The web audio community are a vibrant bunch. No sooner had the standard <audio> API been established, than developers were clamouring for more. Just playing audio wasn’t enough, we wanted to analyse, react to and manipulate our audio. Happily, the browser makers obliged with first Mozilla, then Google producing enhanced web audio APIs for their browsers – the only problem was, they were two very different implementations. The Audio Data API implemented in Firefox exposed the data at a fairly low level, while Webkit’s Web Audio API provided a higher level abstraction providing a number of predefined functions. Luckily, it didn’t take long for the JavaScript community to react and start bridging the gap between the two, by writing libraries that provided a common API, libraries such as sink.js which smooths over low level differences. In turn, sink.js was used by ‘higher level’ libraries like audiolib.js – (a general purpose audio toolkit) and Audiolet (which provides a more musically intuitive API, with similar objectives to Webkit’s in-browser solution). There are many others, such as XAudioJS which sports a Flash® and base64 data url wav generation fallback, older projects like dynamic.js that just provides a Flash® fallback for the Audio Data API and DSP.js a Digital Signal Processing Library.

People really love messing about with audio.

Notice that the process of creating all this cool functionality didn’t come about from a W3C spec. Similarly, the Advanced Audio APIs were not the result of a W3C think-tank, but from two competing visions of what an advanced audio API should look like. Now it looks like the Web Audio API will be implemented in Safari as well as Chrome.

Once you create compelling functionality, developers will immediately start to use it. It may be experimental but developers will start to rely on it to make cool stuff. Cutting edge technology is seductive like that. I’m surer than sure that the Web Audio API has been well researched and has taken much inspiration from tried and tested APIs that exist outwith of our lovely browser based world (Apple’s Core Audio Frameworks, I believe), but I’m not convinced that you can really tell what web developers need or want until you give them something to play with.

Mozilla’s approach was to expose a very comprehensive low level API, which potentially allows JavaScript developers to create all the functionality of Webkit’s Web Audio API and then some. As a result we get libraries like JSMad cropping up. What does JSMad do? Significantly, it allows you to play MP3s in Firefox*. Is JavaScript fast enough? Apparently so. This was a ‘this changes everything’ moment for me and since then a similar approach has been taken by pdf.js and more recently Broadway.js which decodes H.264 on the fly.

*Neither Firefox or Opera support MP3 natively due to patent concerns.

I’m not saying Mozilla’s Audio Data API is perfect, there are issues with audio using the same thread as the UI and synch issues with multiple streams. However this is being addressed in the MediaStreams Processing proposal and it’s worth taking a look at it, even if it’s just for an insight into what future implementations could look like.

I’m digressing. The point is, if browser makers expose the low level API, developers will quickly come in and start writing libraries on top of that API. As is often the case, the developer community will start making things that the browser makers had never even considered. It makes sense, there are many more web developers than browser developers. Sure, web developers will bridge the gaps and polyfill over the cracks, which let’s face it, has been the only reasonable way of going forward with HTML5, but crucially they will also make new libraries that other developers can use – and all of this at very high rates of turnaround. Of course, the common-or-garden JavaScript developer has a series of enormous advantages over the browser API developer or the standards bodies that seek to define these APIs. I’m gonna name three here:

  1. Strong community — Web developers have a huge active and open community to draw from.

  2. Lower barrier to entry — The barrier of participation once something is put on something like github is virtually zero.

  3. Room to manouevre — Nothing web developers write is ever set in stone, JavaScript represents a much more fluid abstraction than the less flexible native browser code.

Ok, so bear with me here, and this is more of a question than a proposal – What if we separate concerns between browser makers and web developers when it comes to creating standards? Browser makers could concentrate on security, privacy, performance and exposing low level API’s in such a way that web developers can start to build libraries and APIs in the fluid, dynamic, iterative and extremely reactive manner that the web as a media allows. Once these libraries reach an acceptable level of adoption, browser makers can get together and decide which of these features they want to adopt based on tried and tested use cases, and yes make it a standard and build it into the browser. Wouldn’t we move forward more quickly that way? And as a bonus, no browser would be left behind as we’d be building the polyfills along the way.

In short, what I’m saying is that if the standard bodies put their energy into defining low level APIs, the high level APIs will look after themselves, or rather the community will look after them. After all it seems that the W3C themselves want a more community based approach to standards and besides we all know that bottom-up trumps top-down, right?

Outside my flat is an open space that the local council didn’t quite know what to do with, I’m sure they considered adding basket-ball hoops, concrete tables, a kid’s playground and all kinds of things. As it turned out they created a decent flat surface and pretty much left it as that. The users of this space, mostly children, decided this was a perfect space for playing soccer and improvised the space to include a hand drawn goal and pitch markings. If the council really wanted to make something permanent, they could take inspiration from this and create real goals and solid pitch markings.

It’s probably too late to change the Webkit implementation of the Web Audio API significantly, but I would strongly urge the developers of it to include a more comprehensive low level API in future releases. What’s the worst that could happen?

Improvised Goal Posts

A big thanks to Jussi Kalliokoski and Dustin Blake for helping with this post and deep respect and gratitude to all those hacking on audio.

➸ Follow me on Twitter

Tags: , , , , ,

Tuesday, November 15th, 2011 Audio, javascript 17 Comments

Hyperaudio at the Mozilla Festival

This year has run contrary to previous years of my life in the sense that it seems to have lasted much longer than the year before. This may have something to do with a small addition to the family that arrived in March and the consequent ramping up of waking hours, but I prefer to think it’s because things are moving so quickly in that wonderful place we call the worldwide web, that it seems that too much has happened to compress into a year.

It was only a year ago for example, when I started tinkering with jPlayer to couple text and audio and was asked to show off a very small demo at the first Mozilla Festival in Barcelona last year.

A few months later I was introduced to and became involved with something known as Hyperaudio and was given the opportunity to create a couple of proof-of-concept demos : Denmark Radio’s Hyperdisken Demo and the Radiolab/Soundcloud collaboration ( a shout-out to Henrik Moltke for doing much of the groundwork for these demos and my colleagues at Happyworm without which they just wouldn’t have been possible).

Further on through the arc of this epic year, I took part in the Mozilla Knight Journalism challenge where I was encouraged to research and blog some ideas on a tool I called the Hyperaudio Pad. Happily I was flown over to Berlin with other like minded MoJo peeps, for a week of intense discussion, collaboration and hacking and managed to get the first semblance of an actual product out of the door.

Hyperaudio Pad

Which brings me in a round-about way to this year’s Mozilla Festival, where I will again be demoing and chatting about the coupling of text and media and the other things that Hyperaudio is (or could be) about. This all at the Science Fair on Friday, 4th of November and a Hyperaudio Workshop / Design Challenge on Saturday, 5th.

The Science Fair will be all about demos, chatting to people interested in the general Hyperaudio concept and making sure I don’t spill my drink over my laptop. Meanwhile in the workshop we hope to put the Hyperaudio Pad through it’s paces and actually make something with it, while gathering ideas and maybe hacking on some of them into the bargain. It should be a fun mix of hacks, hackers, designers, audio buffs and basically anybody curious enough to be involved. The session will be fairly free flow but the plan so far is to:

  • Briefly chat about and demo the Hyperaudio concept, what it can do and with whom we can collaborate.
  • Breakout into groups to come up with new ideas, applications and designs.
  • Reconvene and talk about the group’s ideas.
  • Breakout into groups to work on these ideas, whether it is creating a programme, UI mockups, story-boarding functionality or even developing small demos.
  • Creating something we can show other people – hopefully a program created with the Hyperaudio Pad but also some new ideas.

Hopefully we can cram this all into 3 hours. Folk from the BBC, Universal Subtitles, Sourcefabric and other cool people are planning to be in attendance, which should make for some interesting crossovers and fantastic opportunities to consult and maybe even influence organisations in a position to use the stuff we’re making. Additionally the BBC are providing us with some fantastic media to work with.

All will be revealed over the weekend, book your place and get your eyes and ears ready to be part of the Hyperaudio experience.

Tags: , , , ,

Tuesday, November 1st, 2011 Audio, HTML5, Hyper Audio, jPlayer, Popcorn.js, Uncategorized 2 Comments

Introducing the Hyperaudio Pad (working title)

Mark Boas

Last week as part of the Mozilla News Lab, I took part in webinars with Shazna Nessa – Director of Interactive at the Associated Press in New York, Mohamed Nanabhay – an internet entrepreneur and Head of Online at Al Jazeera English and Oliver Reichenstein – CEO of iA (Information Architects, Inc.).

I have a few ideas on how we can create tools to help journalists. I mean journalists in the broadest sense – casual bloggers as well as hacks working for large news organizations. In previous weeks I have been in deep absorption of all the fantastic and varied information coming my way. Last week things started to fall into place. A seed of an idea that I’ve had at the back of my mind for some time pushed its way to the front and started to evolve.

Something that cropped up time and again was that if you are going to create tools for journalists, you should try and make them as easy to use as possible. The idea I hope to run with is a simple tool to allow users to assemble audio or video programs from different sources by using a paradigm that most people are already familiar with. I hope to build on my work something I’ve called hypertranscripts which strongly couple text and the spoken word in a way that is easily navigable and shareable.

The Problem

Editing, compiling and assembling audio or video usually requires fairly complex tools, this is compounded by the fact that it’s very difficult to ascertain the content of the media without actually playing through it.

The Solution?

I propose that we step back and consider other ways of representing this media content. In the case of journalistic pieces, this content usually includes the spoken word which we can represent using text by transcribing it. My idea is to use the text to represent the content and allow that text to be copied, pasted, dragged and dropped from document to document with associated media intact. The documents will take the form of hypertranscripts and this assemblage will all work within the context of my proposed application, going under the working title of the Hyperaudio Pad. (Suggestions welcome!) Note that the pasting of any content into a standard editor will result in hypertranscripted content that could exist largely independently of the application itself.

Some examples of hypertranscripts can be found in a couple of demos I worked on earlier this year:

Danish Radio Demo
Radiolab Demo

As the interface is largely text based I’m taking a great deal of inspiration from the elegance and simplicity of Oliver’s iAWriter. Here are a couple of rough sketches :

Edit Mode

Sketch of the Hyperaudio Pad in Edit Mode

Sketch of the Hyperaudio Pad in Edit Mode

Playback Mode

Hyperaudio Pad in Play Mode

Hyperaudio Pad in Play Mode

Working Together

Last week I’m happy to say that I found myself collaborating with other members of the News Lab, namely Julien Dorra and Samuel Huron, both of whom are working on related projects. These guys have some excellent ideas that relate to meta-data and mixed media that tie in with my own and I look forward to working with them in the future. Exciting stuff!

Tags: , , , , , , , ,

Monday, August 1st, 2011 Audio, HTML5, Hyper Audio, inspiration, javascript 5 Comments

Further Experimentation with Hyper Audio

Mark Boas

Following the fun we had making the Hyperdisken demo, I was happy to be asked by Mozilla, in collaboration with Radiolab and SoundCloud to help create another demo to show off the possibilities of hyper audio. This time we had an excellent Radiolab program as audio material and we wanted to get a little more ‘involved’. What was required was an application that would consist of many of the features of the Hyperdisken demo but also integrate deeply with SoundCloud API, and on top of this something extra, something to catch the eye.

Here’s what we came up with.

symmetry

I was fortunate again to work with the ideas-forge known as Henrik Moltke, who collaborated early on with Paul Rouget to produce something he dubbed the ‘Word River’ – a CSS3 manipulated flowing river of words that dynamically picked up content from an HTML transcript. We were also keen to make a pure HTML5 based solution and Paul helped figure out the hooks into the SoundCloud API that would allow us to achieve that. We were also very lucky to be given a great design by the multi-talented Lee Martin, SoundCloud’s experimenter extraordinaire.

So with proof of concept and some visual bling firmly in hand I was tasked with making this baby fly. Luckily I had help. SoundCloud engineers were at hand to answer any questions and crucially we had great support and code contributions from the popcorn.js group. I also managed to talk jPlayer author and all round JS media guru and of course, colleague Mark Panaghiston into giving me a hand. So despite the tight deadlines we were pretty much set.

Henrik has already blogged about the ideas and functionality that make up the demo. I want to write a little about the technology used.

Although I found out in retrospect, not strictly essential, we once again used jPlayer as our audio base, we’re familiar with it and we can move fast using it. It also meant that we could take much of the functionality developed in previous demos and plug it right in. Again, the excellent Popcorn.js was the engine that drove all the time based display of text and images and dealt with the parsing of data. Steven Weerdenberg, active Popcornista from Seneca College, very kindly wrote a plugin that grabbed, parsed and presented comments (amongst other things) from the track we used hosted on SoundCloud. This is where the Popcorn framework comes into its own as a plugin oriented architecture, something we took advantage of when we converted both the transcript and the word river functionality into plug-ins.

So where did the data come from? Well again the transcript HTML doubled as the source for richer interaction when used by the word river plugin. I like this approach I have to say. It means you don’t have to be a programmer to come in and immediately understand the content and change it accordingly. I also like the fact that the transcript is a separate HTML file, it pleases the separatist in me and means that it works as a standalone resource. We also used the standard speaker notations as a type of meta-data, the word river plugin hiding these parts for the purpose of display but using them to colour code each speaker’s text.

This part of the transcript:

Transcript Markup
is used to create this ‘word river’:

Word River

and this interactive transcript :

Transcript Text

Data-wise, everything else came via the SoundCloud API, this included their trademark wave-form, both ogg and mp3 audio sources and all of the comments. We also hijacked the comments to make a crude content management system. The idea being that any comments posted by the Radiolab account with references to images in them, were picked up and displayed as images in the main content area, and did not show up as comments on the timeline. If two images were present it meant they were square, one and it was ‘widescreen’ a blank image was used to remove images when they were no longer needed.

The last pieces of the puzzle and one we’ve still some polish to apply to, (if polishing puzzles makes any sense to you) was getting it all working on the majority of tablets and mobile devices. Since this demo didn’t use Flash this was actually a possibility and we got our web designer Silvia Benvenuti to come out of maternity leave and sort this out for us at the 9th hour, leaving me quite literally holding the baby.

This was a tough gig but all in all I’m happy with what we achieved, everyone seemed to really enjoy taking part in the process, and I certainly enjoyed bringing it all together. Hopefully it will inspire both program makers, designers and developers to come together and explore the limits of what hyper-audio can do. As Inspiral Carpets would say, moo!

moo!

Source code for this project and other demos can be found on github

Follow me on Twitter if you want to hear more about this sort of thing.

Tags: , , , , , , , ,