Today I will begin with something that I had pending to try for several months already. I will write a little bit about things that I am discovering when developing. New libraries, methods and design patterns, and in general things that I find along the way. I’ve noticed that there are a lot of people writing about this in English, but I don’t know how many people write in Spanish. So I have done some posts about software development, Linux administration and stuff like that in the Spanish version of my blog, partly to contribute a little with the documentation on various topics in Spanish, and not to forget what I’m learning; and after some months I have decided to translate what I have written so far into english. Here I will be writing about some topics that I find interesting. Take into account that this is a translation of several posts from spanish. some info might be out of date or incorrect, so please contact me if you find anything that can be improved in this and other posts.
Throughout these set of posts, we will work to create one of the simplest Skills to date: Creating an RSS feed reader. We will work using Python, flask and the libraries provided by Amazon, so some knowledge of Python is more than enough. At the end of this series, the Skill will be deployed as a web service, although alternatively you can remove flask as a dependency and run the skill in a Lambda service, which is provided free of charge when creating a Skill.
What is a Skill?
A Skill is, roughly speaking, an “app” for Alexa. Alexa users can start a Skill and interact with it. Most skills will remain open, listening for responses and interactions with the user, until the user explicitly ends the session by saying “stop” or similar. There are other methods of interacting with Skills, but this is the most common for most developments.
How do users interact with a skill?
Before coding anything, it is necessary to know roughly how our potential users will interact with the Skill. This process can be summarized as follows:
- The user decides to open the Skill using the invocation name: Each skill must have one, and it is the sentence that the user must say for Alexa to send the necessary data to our web service.
- Once our Skill is identified, Amazon sends a LaunchRequest to our web service. Amazon can send several types of requests for our web service to react to. Every time the user calls the skill by its invocation name, a LaunchRequest will be generated.
- Our service must receive the post request, check that the call is made by one of amazon’s servers, and send a response back to the user. This response can include text-to-speech, audio, a card (which is a widget for devices with a screen) and various other directives.
- The user will continue to interact with the Skill by answering the questions it asks, or by being guided by the skill to call other intents. In Alexa, an intent is a word or sentence that trigger an action by our Skill. Example intents might be PlaySong (which might be triggered when someone asks to the Skill to play a song), Help (triggered when users asks for help), etc.
- Our Skill receives the intent generated by the user, processes it, responds back, and steps 4 and 5 are repeated until the user decides that’s enough of our Skill, and asks alexa to stop.
- Asking Alexa to stop, or deciding, in our Skill’s logic, that it is time to stop, will generate a SessionEndedRequest, which will cause the skill to close the session with the user. This is important because from this point on, the user will no longer be able to interact with the skill unless they call it again using its invocation name, and any persistent data stored in the session will be lost as well.
For more information on the different kinds of requests alexa sends to Skills, The reference in the Alexa skills kit is quite good.
For this example, we will create a skill that has a very simple purpose. It will be responsible for doing the following tasks:
- It can list the last x news items from our RSS site of choice. Naturally, the number will depend on the RSS feed.
- If we ask alexa to read us the item 5, it will read the whole text of the article.
- To prevent Alexa from reading too much text and putting the user to sleep, we will make our Skill to ask every so often if we should continue reading, to make the experience more comfortable.
The following are the requirements to make this project:
- An Amazon developer account: If you don’t have one yet, you can Create it here
- A VPS server with a domain pointing to it and an SSL certificate: Technically the code exposed here can run without problems on a Lambda service, but I have never tested it. For the SSL certificate you can use letsencrypt.
Interestingly, for this kind of Skills you are not required to have an Alexa device. But of course, if you want to enjoy (and not just test) the Skill, it will be necessary.
Step 1: Creating the Skill
The first thing we are going to do, before we can even touch some code, is to create the skill in the Amazon Developers portal. This is essential since it will be the connection point between our Skill and any alexa device:
- Go to the Alexa console on Amazon’s developer site. Here you will see a table with all the Skills already created in the account. If we have not created any before, naturally nothing will be displayed. To create a new skill, look for a button labeled as “create skill”, just after the search field on the website.
- When creating a new skill, a web page appears that can be somewhat confusing. Many controls are not labeled correctly so they can be selected, even with the keyboard, but you have to pay attention when working with this. The first thing we will probably want to fill in is the name of the skill, which at this point can contain no more than 50 characters. You can put the name of the RSS feed you want to extract, for example. Naturally, if this Skill were to go into production someday we would have to change the name so as not to violate the amazon skill name policy. It is important to note that this is the name of the skill, but not the invocation name.The invocation name might be different from this one and must have more than two words, not counting articles.
- In the “Default locale” section, normally the language with which the Amazon account of the Developer is configured is selected. In my case it is set to Spanish (MX). This is read by a screen reader as text only, but it is not: when you press the space bar in the locale selector, the list of languages is displayed. You can select a different language by using the arrow keys and pressing space over the language you want to select. When you do this, the list will be collapsed again and only the selected language will be displayed. If you want to include support for more than one language or locale in a skill, you can add more languages later.
- In the “choose a model” section, we have to choose the voice model that our skill will use. A voice model is in charge of mapping the sentences and words of the users with Alexa intents so that the developer does not have to generate the voice model by hand. Although at this point it is easier to choose the “custom” voice model to start with. To choose it with the keyboard, simply locate the header with the “Custom” model and press space or enter.
- In the section “Choose a method to host your skill’s backend resources” we have to select how we want to store our Skill. Basically we have two options: use the “Alexa Hosted” (for Node and Pytohn), that give you a free amount of resources under a Lambda service, and give you everything ready to start working, or the “Provision your own” option, where you provide the URL of a server that alexa will call every time a user interacts with the Skill. In my case I will choose the “Provision your own” option, but in theory the code we are going to write should be perfectly possible to run under lambda. Again, to choose any of these options you just have to place over the corresponding header and press enter.
- Maybe this is the weirdest option for screen reader users, but here we go: To save the changes, you have to scroll back to the name of your application, and from there use the up arrow until you find the button called “create-skill-save-btn”. Alternativel, you can use the shortcut Shift+B to navigate to the previous button, which should be this one. Whichever method is used, this button must be pressed to save the data of our Skill.
- Once these preferences have been saved, we will be asked for our template. That is, we can import a skill from a git repository, or take a template from a Skill that someone has made and shared on Amazon Developers. For this example I’ll leave the default option, which is starting from scratch. To continue, just click on the “Continue with Template” button.
- We’ve done it! once we click on that button, amazon will let us know (if you use a screen reader, the information is at the bottom of the page) that it is currently creating our skill, and tells us that the process may take several minutes. Actually it doesn’t take more than one minute, so we will soon be able to continue with step 2.
Step 2: configuring our Skill
Once the Skill has been created in the Amazon portal, first of all you must modify two things, so that we can focus on the code a bit later. The invocation name and the endpoint. The invocation name tells Alexa what your Skill will be called in front of the user, while the endpoint tells Alexa where to send the requests when the user asks your Skill for something.
If you use a screen reader, here is another weird tip for navigating the Amazon portal: There is a combo box from where you can access the Skill language, and in general, the language preferences (from where you can add more variants or new languages). Just below that combo box, there is the Skill navigation menu, from where all important actions are performed (edit the interaction model, change the invocation name, the endpoint, add or remove interfaces, etc). Whenever I refer to the skill menu I will refer to this menu specifically.
To change the endpoint, which is where our skill will call to -once the user requests something, you must do the following:
- In the skill menu, select “endpoints”. This will take you to a website where there are two radio buttons that ask you by which method you want your skill to communicate with the code.
- Make sure you select the second button, called “https”.
- Fill in at least the first text box with the address of your endpoint. According to Amazon’s requirements, it must be an SSL endpoint.
- Once completed, after that field there is another field where you are asked for the kind of your SSL certificate. You can usually select the option “My development endpoint has a certificate from a trusted certificate authority”.
- When you are done, click on the “save endpoint” button to save tese settings. If you are using a screen reader, you may want to use the shortcut Shift+B again to move to the previous buttons.
Then, to change the skill’s invocation name, do the following steps:
- From the skill menu, select “Invocation”.
- From the following page, you will only see a text field where you must type the invocation sentence that you want the skill to accept. As a note, keep in mind that it should be a name that is not easy for Alexa to confuse, preferably in the language of the skill. If you want to use letters as part of acronyms, for example for “RSS”, use capital letters, separated by spaces. My example is simply “R S S news”. It is Important that your invocation name cannot contain things like skill, app, alexa or so on.
- Again, if you use screen readers, you should look for the previous buttons. You have to press two buttons, first the one called “Save model” and then “build model”. If you click on build model, you can search for a previous heading to find out if the model has been saved successfully. Normally, you will find “Skill Saved Successfully” if everything went well, or another message with details about the error otherwise. Similarly, when building the model you can find in a header, always before the current focus, whether the skill has been built successfully or not.
As you have already seen, the voice interaction model is important. It is the most important piece we have at the moment in the skill. The interaction model, which must be saved and rebuilt every few changes, maps the intents with the voice of users, and always generates a series of data that helps alexa to know how to identify what is being said. That’s why you have to rebuild the model every time you change something that has even the slightest thing to do with what the user says (like the invocation name, or pretty much anything you put in the model).
That’s it for now. We have configured, from the Amazon developer portal, a skill that we are about to use. In a next article, we will explore a bit more about the interaction model we have built, what we can do with it, and something about the alexa SDK.