I'm developing a new product to help people improve their speaking skills called PitchCake.

One of the interactions I've been working on is the process of collecting audio from a browser. I'm using Flash based WAMI which kind of sucks. I'd like a hybrid solution using HTML5 and falling back on Flash but not there yet.

The current flow...
1) User presses record button and speaks into the mic.
2) Loading sign appears as audio is processing.
3) Audio player lights up and plays back the pitch.
4) User has option to try again and repeat steps 1-3 until they are satisfied.
5) User clicks on submit pitch button when finished.
You can test it by signing up and doing a practice pitch.

LNPxhSjkl2T1_pitchcake_audio_collection_text_change.png

So I have a few questions:
1) Does the flow make sense to you?
2) What could I do visually to make the interface more logical?
3) It currently takes up to a minute for our server to process the audio before hearing the playback. How can we make the waiting experience more engaging?