I'm building a voice-assisted navigation feature for my app that would allow users to:
Navigate between screens/pages using voice commands
Have an AI agent take actions on the current page (clicking buttons, filling forms, etc.)
Think of it as a "Computer Use"-style experience, but scoped entirely to my own application rather than being a cross-app or system-wide agent.
Questions:
What's the recommended approach for implementing this with the Computer Use API?
How should I expose my app's UI to the model for it to understand and interact with elements?
Are there best practices for handling the feedback loop between voice input → AI decision → UI action?