So with the code-context, the dictation in Listener is getting okay-ish. It's still pretty frustrating and error prone, but I can use it maybe 1/4 of the time (mostly for doc-strings). Part of the frustration is just that the language models are not yet well tuned for some commonly needed phrases that the tokeniser didn't generate for the code corpus, but a big part of it is that I don't yet have the correction/undo operations nor the navigation bits, so any mistake means editing by keyboard. The lack of a good contextual awareness/biasing model is also pretty big.
So, I've been working on getting the GUI built up for doing corrections. As of now, a dbus service runs in the background which is driving the interpreter, and IBus and delivering the partial and final transcriptions via signals to the front-end GUI. I've also got a "floating window" that shows the text as you speak, though currently you have to install some KWin config to get it to float over other windows (due to focus-stealing protections). I'm thinking a KDE plasmoid that runs in the panel might be a better approach.
The idea is that the DBus service will be reusable for other GUIs and will be the primary API used to communicate with the service, so e.g. a code editor plugin will pass context information through the DBus API, the GUI will use the API to update rules, send correction updates, etc. Up until this point, I've been using ad-hoc socket communication... which honestly was a lot easier to develop and control, but if the project is to become a general service for the Linux desktop I suppose I need to embrace the native RPC mechanism.
Side notes: I am finding pydantic useful, but definitely frustrated with it not playing well with regular properties. DBus in PySide isn't thrilling me, but it works once you fire up qdbusviewer and realise that you're listening for the wrong thing and you debug cryptic data-format issues. The "Qt tool window that floats over all other windows" thing took entirely too long to figure out was something blocking the effect (Kwin), rather than me "spelling" it incorrectly. Facebook just announced a transformer-and-audio based model that sounds a lot closer to what I'd like to work with, but I'll focus on getting my current implementation useful first before experimenting with swapping out the back-end.
Pingbacks are closed.