Project in short: Voice as a multimodal input into 3D worlds

(I'm hoping to animate this face according to tone of voice)
DescriptionTo build an environment either 3D, web based, or both, that is influenced in an original manner by voice commands, specifically, using a mobile phone to influence 2D or 3D action and models. I intend to build a new model in MAYA , but during the course of the thesis will use my pre-fabricated models from last semester including a head, a body, a car, a bar counter top.
Personal StatementMy work in Dynamic Web Development and Ubiquitous Computing for Mobile Devices led me to this attempt to combine VoiceXML/DTMF with Interactive 3D. Although voice-driven menus are dry as can be and irritate the hell out of customers, I'm inspired by using VoiceXML for creative purposes - specifically, navigating an historical map of Northern Africa, or telling a car to travel along a path. Unfortunately VoiceXML is a hosted solution - the voice engine is proprietary, so I'm hoping to explore other alternatives to interfacing voice to 3D. Carnegie Mellon University created
SPHINX, an open-source engine - time providing, I'll explore this software although I don't want to get bogged down in the technical details of creating a viable software build - that is beyond the scope of this thesis. Of note, VXML falls under the umbrella of "multimodal" datatype. This is significant because an XML web site can be parsed into VXML, meaning that all web sites could be conceivably "voice enabled", or authoring for one input device can be inclusive of another.
BackgroundWC3 specifications for VoiceXML, multimodal devices
Sphinx CMU software
VoiceXML references and articles
Playstation II Voice activated commands
Readings in Interactive Telecommunications and Design for Voice Interface


Socom II for Playstation II features voice input with a USB headset. Here you can call on additional troops with a contextual menu. I believe the underlying technology is Dragon Naturally Speaking for headsets - this works well, a low margin of error.
AudienceThe audience for this project falls into several categories - 2D, 3D for the web, and live interactive 3D. For the broadest reach, I'd like to publish a Virtools .cmo for the web. This would just require a telephone and a browser with the Virtools plug-in to experience the 3D, if the project goes as planned. The web audience could also access JavaScripted pages for
examples of 2D interactivity.
The second audience is more specific - I envision a real-time demonstration on a PC with Internet connection and Virtools.
User Scenario-The user would dial to my VXML provider or turn on a piece of hardware for a live demonstration.
-The dial-in user access my account and listens to a voice menu of options regarding voice commands to web.
-The user points the browser to either a Processing Applet or .cmo file on the web
-The web applet/.cmo makes a connection to my Database
-The Database stores user "moves" and commands
-User speaks to actor on web page
-User navigates aforementioned map
ImplementationThis project is mostly software based, using front-end and backend web programming, Virtools Scripting, and modeling in MAYA.
Preliminary ObservationsThus far, I've learned that VXML is powerful, but proprietary. Voice engines are a science in and of themselves. VXML as a specification is multimodal, so if you want to access data from the web, it can be formatted for voice too. I know that voice commands can be captured in a web page through Processing or platform-specific JavaScripting, but I really don't know how it's going to work with exported Virtools .cmo's.
ReferencesHere is a link to groundbreaking work in 3D by NYU's MRL lab and Ken Perlin's "Responsive Face":
http://mrl.nyu.edu/~perlin/experiments/head/Face.html The expressiveness of this model is remarkable.
On the voice-recognition side, inspiration for the project input came from Dennis Crowley's Ubiquitous Computing for Mobile Devices where I learned how to intereface a mobile phone to the web and a database.