Control web apps via natural language by casting speech to commands with GPT-3

Original Source Here

Here I’ll show you how to practically implement this scheme as a way to control a very specific app: JSmol, a JavaScript library for molecular visualization inside web pages. You can see the example running in this video and twit, where typed or spoken requests made in natural language are converted into JSmol commands and applied to the visualization:

If you want to try out this example yourself right now, go here:

(You’ll need an OpenAI key for GPT-3, see here.)

Use of this besides being cool?

I see at least three obvious benefits of this kind of technology:

  • Most importantly, it allows users to control the app as finely as JSmol’s scripting language allows, but without the need to know it. Indeed, I wrote up this example after a request of a colleague from my lab. He once pointed out (and I agree) that there are so many programs out there doing similar things, all with their own commands and scripting languages, that it is very hard to remember all of them.
  • Even if you kind of know the commands but don’t type them well, or if the speech recognition engine fails, GPT-3 corrects errors as it tries to produce JSmol-consistent commands.
  • And as I exemplify in the video, you don’t even need to type the natural language commands in English, because GPT-3 “internally translates” your input, casting it from any language directly into JSmol’s language.

Plus, in particular of this example being web-based, as I always stress, by working entirely inside a webpage, you can run this same tool in your laptop, tablet, or computer, simply pointing your web browser to the web page’s URL. (And by the way, again I recommend you this free hosting service that includes JS and PHP support as you require to use GPT-3 in web apps).

Three key points from the developer’s point of view, and dive into the code

Before we see how this app works, there are three important points regarding JSmol as our example target app to control, that you’ll have to take into account for other applications:

  • JSmol supports command passing through a scripting language, key to simplify the execution of the commands produced by GPT3.
  • The internet is flooded with tutorials and examples about JSmol’s scripting language, so GPT-3 has very likely seen a lot of it during training.
  • Even considering the above, we must provide GPT-3 a clear prompt that includes few-shot learning so that it “understands” that we want JSmol-like commands in its output.

General workflow + converting natural language into commands

As summarized in the first figure, the core of the workflow is the input of the natural language text as typed or as recognized from speech, into a GPT-3 prompt that contains a series of natural language-command pairs tuned to the target app one wants to control. In more detail, the starting prompt includes pairs of “question-answer” sentences that reflect what command (answer) is required to achieve the task indicated in the natural-language request (question), and the actual request that the user asks for is appended to the prompt, in the form of a question.

What I called above “question” and “answer” can also be thought as “comment” and “code” in the spirit of automated code writers. GPT-3 doesn’t really care what we call them or even how we encode them, as long as we respect the format. For the JSmol example right as you see it working in the video, I used “Q:” for the natural language inputs and “A:” for the app command outputs.

Thus, given this short piece of prompt (the actual prompt is larger):

Q: show me all the hydrophobic residues please.
A: select ala,ile,leu,val,ile,phe,met; wireframe 50.
Q: show alanine 34 as spheres.
A: select ala 34; spacefill.
Q: show chain b in magenta.
A: select :b; color magenta.
Q: show chains A and B as green surface.
A: select :a,:b; isoSurface molecular; color green.
Q: hide all surfaces.
A: select all; isoSurface off.
Q: hide residues 33 to 99.
A: select 33–99; wireframe off; spacefill off; cartoon off.
Q: show residues 5 to 107 of chain P.
A: select 5–107:p; cartoon on; wireframe 50.
Q: please show me all the negative residues as sticks.
A: select asp,glu; wireframe 50.
Q: show me glutamine 154.
A: select gln and 154; wireframe 50.

When the user makes a request such as “please show all positive residues as blue sticks” this is added at the end of the prompt like this:

Q: please show all positive residues as blue sticks

GPT-3 will then get that it must extend the text from that last “A: ” with a JSmol-consistent command. By removing the input up to the “A: ”, you are left with the commands that must be sent to JSmol.

A note that might seem obvious but is worth clarifying given some comments I get. Of course GPT-3 doesn’t truly understand anything; it is just a statistical model based on the connection of billions of training tokens, tuned by our prompt, that (hopefully) leads it to produce the right command. With “hopefully” in the above paragraph I intend to mean that there’s no full guarantee the output will be the right exact command you need JSmol to execute. It may not even be a command, which will make JSmol fail. This said, the quality and accuracy of the produced commands depends on GPT-3’s training, which we could possibly improve through a fine-tuning procedure, and especially on the few-shot learning provided in the prompt. You can see the full prompt by inspecting the source code of this very same example web page at

Full source code, from speech recognition to GPT-3 processing and command execution

The source code of my example page (link above, you can see it with CTRL+U in most browsers) is unobfuscated and commented at the key lines, ready for you to inspect in full. But let me here add a few more explanations:

Source code of the app discussed here, available at

Libraries, basic HTML, and controls layout

The first set of lines load three libraries: Jquery, used to simplify the asynchronous call to the GPT-3 API as I’ve shown in previous examples, plus Annyang to simplify speech recognition as also shown previously, and JSmol which is the JavaScript library for molecular visualization.

The script tag starting in line 14 only configures the JSmol visualization, which is then inserted into the HTML in line 35.

Lines 37 to 45 set up a div that contains a series of buttons, checkboxes, and paragraphs where the user interacts with the bot (I say “bot” because I like to think of this app as a “bot that operates JSmol”). Of these controls, the textbox allows the user to type text, the checkbox controls whether speech recognition is active or not (and when it is, the recognized text is sent to the textbox), and the paragraph on line 44 will display the commands as cast by GPT-3 and sent to the JSmol app.

JavaScript code for GPT-3

Starting at line 51 we define the function that processes an input text in natural language with GPT-3. After getting the API key that the user provided, the app appends its input to the core prompt (line 53, note we add “Q: ” and “A: ”). With an asynchronous call to GPT-3 (line 56) we end up getting back a chunk of data (line 58) that consists of our full prompt (i.e. core prompt plus new question) extended with the JSmol commands proposed by GPT-3.

Lines 59 to 63 clean up the text that was returned by GPT-3 inside data, to remove the URL, the Qs and As, and the starting prompt, so as to get only the produced commands in a clean form ready for JSmol, in the variable I called textupdate.

Line 64 sends the commands to JSmol, and line 66 displays them on screen.

Last inside this function, line 68 starts speech recognition by Annyang, which as we’ll see next is off while GPT-3 processes commands.

JavaScript code for speech recognition

Line 73 opens up a function that is called only once when the page is loaded. This function sets a callback function in Annyang (result) that tells it what to do when it has heard and recognized a phrase. After logging what it heard (lines 75 and 76), if the checkbox to apply spoken commands is on (line 77) this function displays the recognized text (line 78) and calls (line 80) the function that applies GPT-3 on the inputs, all explained above (and defined in line 51 of the code). Notice that we also switch off Annyang before calling the GPT function (line 79), so that it will stop recognizing speech until this is activated again (line 68).


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: