Smarter Conversations. Part 1 - Sentiment

This post starts a series of short articles on building smarter conversations with Microsoft Bot Framework. I will explore detecting sentiment (part 1), keeping the dialog open-ended (part 2), using a simple history engine to help the bot be context-aware (part 3), and recording a full transcript of a conversation to intelligently hand it off to a human operator.

Affirmation

Imagine the following dialog:

User >> I’m looking for screws used for printer assembly
Bot >> Sure, I’m happy to help you. 
Bot >> Is the base material metal or plastic?
User >> metal
Bot >> [lists a few recommendations]
Bot >> [mentions screws that can form their own threads]
User >> Great! I think that's what I need
Bot >> [recommends more information and an installation video]

It’s not hard to train an NLU service like LUIS to see a product lookup intent in the first sentence. A screw would be an extracted entity. Following a database lookup, the bot then clarifies an important attribute to narrow the search down to either plastic or metal screws and presents the results.

The highlighted sentence that follows is a positive affirmation. It is not an intent that needs to be fulfilled, not an answer to the question asked by the bot. And yet it presents an opportunity for a smarter bot to be more helpful, act as an advisor.

Sentiment

Text Analytics API is part of the Microsoft’s Cognitive Services offering. The /text/analytics/v2.0/sentiment endpoint makes it a single HTTP request to score a text fragment or a sentence on a scale from 0 (negative) to 1 (positive).

I decided to make the expressed sentiment look like an intent for my bot and so I built a custom recognizer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
const request = require('request-promise-native');

const url = process.env.SENTIMENT_ENDPOINT;
const apiKey = process.env.SENTIMENT_API_KEY;

module.exports = {
recognize: function (context, callback) {
request({
method: 'POST',
url: `${url}`,
headers: {
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': `${apiKey}`
},
body: {
"documents": [
{
"language": "en",
"id": "-",
"text": context.message.text
}
]
},
json: true
}).then((result) => {
if (result && result.documents) {
const positive = result.documents[0].score >= 0.5;

callback(null, {
intent: positive ? 'Affirmation' : 'Discouragement',
score: 0.11 // <-- just above the threshold
});
} else {
callback();
}
}).catch((reason) => {
console.log('Error detecting sentiment: %s', reason);
callback();
});
}
};

Context

Now I can attach a dialog that would be triggered when the bot detects an affirmation and no other intent scores higher. The default intent threshold is 0.1 and that’s why a detected sentiment is given 0.11.

1
2
3
4
5
6
7
8
9
10
11
const sentiment = require('./app/recognizer/sentiment');

bot.recognizer(sentiment);

bot.dialog('affirmation', [
function (session, args, next) {
// ...
}
]).triggerAction({
matches: 'Affirmation'
});

Unlike other intents, however, a detected sentiment is not enough to properly react to on its own.

The bot needs to understand the context to properly react to an affirmation or discouragement expressed by a user. The bot also needs to be able to handle an interrupted dialog if an affirmation (or an expression of frustration) came in the middle of a waterfall, for example.

I will get to it in part 2. Stay tuned.

Cognitive APIs. Vision. Part 3.

I have used Cognitive Services from Microsoft (part 1) and IBM Watson Services (part 2) to read my avatar image. There are two more APIs that I would like to put to the test - Google Cloud Vision API and Clarifai.

Google Cloud Vision

I already had a developer account. To use the Cloud Vision API I only had to enable it in the console and generate myself a browser key. When you sign up, Google asks for your credit card but they promise not to charge it without your permission. They also give you $300 in free trial credit and 60 days to use it.

The API itself is clearly designed for extensibility.

It’s a single endpoint that can do different things based on your request. An image can either be sent as a binary data or as a URL to a Google Storage Bucket. You can send multiple images at once and every image request can ask for different type of analysis. You can also ask for more than one type of analysis for a given image.

Google can easily add new features without adding new APIs or changing the endpoint’s semantics. Take a look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
const key = '<use-your-own-key>';
const url = `https://vision.googleapis.com/v1/images:annotate?key=${key}`;

fetch(url, {
method: 'POST',
headers: new Headers({
'Content-Type': 'application/json'
}),
body: JSON.stringify({
'requests': [{
'image': {
'source': {
'gcsImageUri': 'gs://pveller/pavelveller.jpeg'
}
},
'features': [{
'type': 'LABEL_DETECTION',
'maxResults': 10
}, {
'type': 'FACE_DETECTION'
}],
}]
})
}).then(function(response) {
return response.json();
}).then(function({ responses }) {
const labels = responses[0].labelAnnotations;

console.log(labels.map((l) => `${l.description} - ${l.score.toPrecision(2)*100}%`))
});

Here’s what I got:

1
2
3
4
5
6
7
8
[ 
"hair - 95%",
"person - 94%",
"athlete - 88%",
"hairstyle - 84%",
"male - 79%",
"sports - 72%"
]

A man who definitely cares about his hair, right? :) I am not sure where the sports and athlete bits came from. I also wonder if I would get more tags (like a microphone, for example) if I could ask for features with lower scores. The API doesn’t seem to allow me to lower the threshold. I asked for ten results but got only six back.

The face detection sent down a very elaborate data structure with coordinates of all the little facial features. Things like left eye, right eye, eyebrows, nose tip, and a whole lot more. The only thing is … you can’t see the left side of my face on my avatar.

Google also tries to detect emotions. Of all that it can see - anger, joy, sorrow, surprise - none came back with anything but VERY_UNLIKELY. You can also test an image for explicit content. Same VERY_UNLIKELY for my avatar.

Very pleasant experience but I honestly expected a little more from Google’s Vision API.

I expected more because I know Google does all kinds of crazy things with deep learning in their labs. With images as of two years go and very recently with video. Maybe as those models mature, the Cloud Vision will support more features? Time will tell.

Clarifai

The easiest setup experience by far!

I was ready to go in just a few seconds, no kidding! And it also felt like the fastest response from all the APIs I tried. Very easy and intuitive to use as well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
const key = '<use-your-own-key>';
const url = 'https://api.clarifai.com/v1/tag';

const data = new FormData();
data.append('url', image);
data.append('access_token', key);

fetch(url, {
method: 'POST',
body: data
}).then(function(response) {
return response.json();
}).then(function({ results }) {
const tags = results[0].result.tag;
const labels = [...tags.classes.keys()].map((i) => ({
'class': tags.classes[i],
'confidence': `${tags.probs[i].toPrecision(2)*100}%`
}));

console.log(labels.map((l) => `${l.class} - ${l.confidence}`));
});

Here’s what I got back:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
"music - 100%",
"singer - 99%",
"man - 99%",
"people - 98%",
"competition - 98%",
"musician - 98%",
"one - 98%",
"concert - 98%",
"pop - 97%",
"microphone - 97%",
"portrait - 97%",
"journalist - 96%",
"press conference - 96%",
"wear - 95%",
"administration - 95%",
"television - 94%",
"stage - 94%",
"performance - 93%",
"recreation - 92%",
"festival - 92%"
]

This is actually very close! Good feature detection with various plausible scenarios spelled out based on that. I would only question the absolute confidence in music and singer :) What about a… conference and a speaker?

Clarifai has another very interesting endpoint - Feedback. I haven’t used it but it seems that you can submit your own labels back to Clarifai and help them train and fine-tune the model. It won’t be your own classifier like Watson does. Feedback seems to be a crowdsourcing mechanism to train their main shared model(s). I only wonder how it will work without you having to specify the area of the image that each new label is attached to. In case of my avatar, conference and speaker would attach to the whole image. What about more involved images? Maybe I am missing something…


There’s a lot more computer vision APIs out there. Some are more generic and some a geared towards more specialized tasks like visual product search or logo recognition. Go give it a try!

It’s fascinating what kinds of things are just one HTTP request away.

Cognitive APIs. Vision. Part 2.

In part 1 of this blog series I had Microsoft’s Computer Vision analyze my avatar. Today I would like to ask Mr. Watson from IBM to do the same.

Setup

Same as last time, modern JavaScript and a modern browser.

Getting started with Watson APIs takes a few more steps but it’s still very intuitive. Once you’re all set with Bluemix account, you can provision the service you need and let it see your images.

API

IBM had two vision APIs. AlchemyVision has been recently merged with the Visual Recognition. If you use the original Alchemy endpoint, you will receive the following notice in the JSON response: THIS API FUNCTIONALITY IS DEPRECATED AND HAS BEEN MIGRATED TO WATSON VISUAL RECOGNITION. THIS API WILL BE DISABLED ON MAY 19, 2017.

The new unified API is a little weird. Similar to the computer vision from Microsoft, it can process binary images or can go after an image by its URL. Both need to be submitted as multipart/form-data though. Here’s an exampel from the API reference:

IBM Watson Visual Recognition API

It’s the first HTTP API that I’ve seen where I would be asked to supply JSON parameters as a file upload. You guys? Anyway. Thanks to the Blob object I can emulate multipart file upload directly from JavaScript.

Another puzzling bit is the version parameter. There’s a v3 in the URL but you also need to supply a release date of the version of the API you want to use. Trying without it gives me 400 Bad Request. There’s no version on the service instance that I provisioned so I just rolled with what’s in the API reference. It worked.

I also couldn’t use fetch with this endpoint. This time it’s not on Watson though. My browser would happily send Accept-Encoding with gzip in it and IBM’s nginx would gladly zip up the response. Chrome dev tools can deal with it but fetch apparently can’t. I get SyntaxError: Unexpected end of input at SyntaxError (native) when calling .json() on the response object.

Not sending Accept-Encoding would help but it’s one of the headers you can’t set. I had to resort to good old XHR.

And Here We Go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// my avatar
const image = 'http://media.licdn.com/mpr/mpr/shrinknp_400_400/p/7/000/22b/22b/32f088c.jpg';

const key = '<use-your-own-key>';
const url = `http://gateway-a.watsonplatform.net/visual-recognition/api/v3/classify?api_key=${key}&version=2016-05-20`;

const parameters = JSON.stringify({
'url': image,
'classifier_ids': ['default'],
'owners': ['IBM'],
'threshold': 0.2
});

const data = new FormData();
data.append('parameters', new Blob([parameters], {'type': 'application/json'}));

const request = new XMLHttpRequest();

request.onload = function () {
const data = JSON.parse(request.response);
const tags = data.images[0].classifiers[0].classes;

const describe = (tag) => `${tag.class}, ${Math.round(tag.score*100)}%`;

console.log(tags.map(describe));
};

request.open('POST', url, true);
request.send(data);

The response from Watson?

Person, 100%

Yep. That’s it. That’s all the built-in classifier could tell me. You can train your own classifier(s) but they all appear to be basic. No feature detection that would allow to describe images. I tried to see all classes in the default classifier but the discovery endpoint returns 404 for default. I guess I will have to check back later ;)


I have more computer vision APIs to try. Stay tuned!

Cognitive APIs. Vision. Part 1.

I’ve been actively looking at machine learning lately. Fascinating applications in day to day live! Often unexpected. Always amazing. More and more accessible every day. Google’s Motion Stills blew me away the other day. Classification of motion vectors and biased estimation models (in a temporal consistent manner, no less) - a lot of science and novel ideas in a free consumer mobile app. Enabled and powered by machine learning.

You have probably noticed a new breed of APIs popping up all over the web. Some simply call it Machine Learning APIs, others call it Cognitive Services, some simply call it Watson. Pre-trained models operationalized with an API layer and APIs to train your own.

This blog post series offers you a short tour of these new machine learning powered APIs. I am going to start with Vision and today I am tasting Microsoft Cogntive Services (aka Project Oxford).

Setup

I am going to use JavaScript and my browser. These are all HTTP APIs so I should be able to just talk to them with very little overhead or ceremonies. Besides, since the intent is to taste (and test) the APIs, I figured I would also take the latest JavaScript and browser APIs for a spin. No transpilation. No polyfills. No external dependencies. Hence a fair warning: the examples are likely to only work in latest evergreen browsers. I am using connect serve-static as my web server (here’s how).

Let’s see how much a computer vision can see in my avatar.

Pavel Veller

Describe

According to the documentation, the Describe endpoint generates a description of an image in human readable language with complete sentences. The description is based on a collection of content tags, which are also returned by the operation.

Getting started is just a few clicks and API reference is very transparent. Here we go:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
const explain = function(what, confidence) {
return `${what} with ${Math.round(confidence*100)}% confidence`;
};

// my avatar
const image = 'http://media.licdn.com/mpr/mpr/shrinknp_400_400/p/7/000/22b/22b/32f088c.jpg';

// Microsoft: Describe
const url = 'https://api.projectoxford.ai/vision/v1.0/describe?maxCandidates=3';
const key = '<use-your-own-key> ';

fetch(url, {
method: 'POST',
headers: new Headers({
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': key
}),
body: JSON.stringify({
'url': image
})
}).then(function(response) {
return response.json();
}).then(function({description: {captions, tags}}) {
console.log(captions.map((c) => explain(c.text, c.confidence)));
console.log(tags);
});

Very straightforward with no surprises. Everything just worked. Here’s what the Describe API sees in my avatar:

A man holding a cell phone (19% confidence)
A man holding a phone (17% confidence)
A young man holding a cell phone (12% confidence)

It definitely sees a man. It’s not sure whether a man is young. It probably doesn’t know what a microphone is but it vaguely remembers that phones used to look like this:

Old Phone

Here are all the tags:

["person", "man", "indoor", "holding", "looking", "cellphone", "hand", "phone", "young", "laptop", "sitting", "standing", "table", "boy", "computer", "shirt", "using", "brush", "red", "blue", "people"]

Tags

Another API endpoint can report tags and also provide the level of confidence in each. I sent the same request to /vision/v1.0/tag and here’s what I got back:

person (100%)
man (95%)
indoor (94%)
microphone (22%)

I wonder why microphone wasn’t detected by the Describe endpoint. I would expect that Describe gets the tags from Tag and then uses language generation algorytms to build the description. Apparently not.

Analyze

One more API endpoint that can process an image from many angles at once. It will report the most likely description and will send down the tags. You can also ask it to detect faces and more. I asked for Description, Tags, and Categories. This one does feel like an aggregation. I got the same set of tags as I got from Tags, same most likely description (with a cell phone) and a longer list of tags as I got from Describe. The category was identified as:

people_young with 81% confidence

Summary

Microsoft Vision API allows you to see one image at a time. You can either upload the binary or point it at a publicly accessible URL. Depending on what you’re after you can get different results. I am still puzzled by the difference in reported tags. It’s capable of working with domain models to do more specialized detection but right now has only one trained - celebrities. I am sure Microsoft will deploy more and will likely let you train your own. I don’t know when but I know that there are other vision APIs that do so.

Next time I will talk to Mr. Watson. Stay tuned!