We’ve been developing our first natural language interface recently as part of our project looking at decision making. We want users to be able to simply ask questions such as, “When shall I go for a run?”, and get an answer that combines weather data with their preferences. The end goal is to let people interact with our data as naturally and unobtrusively as possible. You can find out more about how we actually make the decision here.

Specifically, we’ve been developing for Amazon’s new digital assistant Alexa. The is mainly because Amazon is one of the first of the major tech companies to open up their ecosystem to developers, via the Alexa Skills Kit. We’ve been running our new app (or skill as their known in the context of Alexa) on an Amazon Echo, which is essentially a smart radio which speaks to you.

The Alexa Skills Kit has been available to the public for less than a year. It’s self-described as being a “preview” and it seems like it’s still in the process of developed.

Teaching Alexa to find out what she doesn’t know

It seems that a lot of the skills that have been developed for Alexa so far consist of “one-shot” questions. Such an interaction might go something like this.

You: Alexa, who are the UK Met Office
Alexa: The Met Office is the United Kingdom’s national weather service.

It consists of one so-called utterance by the user and one response from Alexa. We really wanted to push what Alexa could do by developing dialogs with back-and-forth interactions. While some other projects have implemented this kind of functionality, we weren’t able to find any systems which we thought were flexible, extensible and easy-to-use.

So we made our own! Now we can have a conversation like this:

  1. You: “Hi Alexa, get the Met Office”
  2. Alexa: “Hi. Welcome to the Met Office Alexa app. How can I help you?”
  3. You: “I don’t understand, what can you do?”
  4. Alexa: “I can help you decide what activity you want to do based on the weather. For instance, you can ask me something like, ‘when should I go for a run?’”
  5. You: “When I should go for a run?”
  6. Alexa: “Where do you want to do that?”
  7. You: “I don’t really understand what you mean…”
  8. Alexa: “I need to know where you are to help you.”
  9. You: “In Exeter.”
  10. Alexa: “The best time for your run is 5PM on Saturday.”

We’ve done this by using a very object oriented approach to our code. Our two main objects are the Session and SlotInteraction.

The Session, receives and sends information to the Alexa interface and is generally responsible for keeping track of the interaction with the user. It asks questions until it knows what kind of functionality the user is trying to use (which happens on line 5). Once it knows the user is asking about when to run, it finds out what information it needs for this calculation from a configuration file.

Each bit of information needed to answer the question is represented by a SlotInteraction. Each SlotInteraction is responsible for finding out its value by

  1. Finding out if the user has said it in the initial utterance
  2. Looking in the user’s profile to see they have a default preference
  3. Asking the user

in order of precedence.

In the case above there are three SlotInteractions needed to make a recommendation:

We really like this model for a few reasons:

Who needs servers anyway

We’ve ended up using some quite interesting architecture for this project. Evey time Alexa hears something, it constructs a small data packet with the speech and some meta-data and sends it to our code. We then run this through our program send back a similar packet of a speech response and meta-data. The program has no real persistence between interactions with the user - for instance, the code is called five times in the example conversation.

We can persist what has happened on the previous interactions (e.g. remember that the user is asking about a run) by including this information in the response data-packet we pass to Alexa. She then passes it back to us us with the next bit of user speech.

Our code is run in an Amazon Web Services (AWS) Lambda Function. A Lambda Function is essentially an on-demand program that takes an input (i.e. the Alexa data-packet with the user speech) and returns an output (i.e. the response data-packet with the Alexa speech). An ad-hoc server, operating system etc. are automatically set-up for the Lambda Function every time its called, which makes our life much simpler. It also makes life much cheaper, as we only have to pay for what we use!

We also store all the user information, speech, and general config files in an AWS Dynamo database, which, again, doesn’t require us to explicitly set up a server, OS, etc.

What next?

It looks like Natural Language is set to be a major disruption in user experience design. In our exercise, we’ve been focusing on how to deliver content to the public. However, it’ll be interesting to see how pervasive it becomes over the next few years. I might be that in five years time, it will combine with AI to help get rid of the professional “housekeeping” tasks which take up so much valuable time. I’m waiting for the day when, as a scientist, I look forward to the day when I can spend more time doing science and less getting ready to do science:

Me: “Igor, fetch me the data for rainfall in Exeter”
Igor the AI lab assistant: “Any particular time?”
Me: “Certainly Igor, I’d like the data for December 1997 please”
Igor: “Here you go Niall. /path/to/my/data. Will that be all”
Me: “Yes, thank you Igor”
Igor: “…Actually is Eye-gore…