Image Tagging Automation with Computer Vision

I have recently presented my explorations of computer vision APIs(part 1, part 2, and part 3) on the AI meetup in Alpharetta. This time I decided to do something useful with it.

Image Tagging

When you work with digital platforms (be that content management, e-commerce, or digital assets) you can’t go far without organizing your images. Tagging makes your assets library navigable and searchable. Descriptions are a great companion to the visual preview and can also serve as the alternate text. WCAG 2.0 requires non-text content to come with a text alternative for the very basic Level A compliance.

Computer Vision

When I played with the trained computer vision models from different vendors, I realized that I can get a good set of tags from either one of the APIs and some would even try to build a description for me. The digital assets management vendors started playing with this idea as well. Adobe, for example, has introduced smart tags in the latest release of AEM. Maybe I can do the same using Computer Vision APIs and integrate with a digital product that doesn’t have that capability built in yet? Let’s try with Sitecore.

Automation

I am going to use Computer Vision from Microsoft Cognitive Services and the Habitat demo site from Sitecore. I am also going to need Powershell Extensions to automate everything.

We will need the URL of the computer vision API, the binary array of the image, the Sitecore item representing the image to record the results on, and a little bit of Powershell magic to glue it all together.

Here’s the crux of the script where I call into the computer vision API:

1
2
3
4
5
6
7
8
9
10
11
$vision = 'https://api.projectoxford.ai/vision/v1.0/analyze'
$features = 'Categories,Tags,Description,Color'

$response = Invoke-WebRequest `
-Uri "$($vision)?visualFeatures=$($features)" `
-Body $bytes `
-ContentType "application/octet-stream" `
-Headers @{'Ocp-Apim-Subscription-Key' = '<use-your-key>'} `
-Method 'Post' `
-ErrorAction Stop `
-UseBasicParsing | ConvertFrom-Json

It’s that simple. The rest of it is using Sitecore APIs to read the image, update the item with tags and descriptions received from the cognitive services, and also a try/catch/retry loop to handle the API’s rate limit (in preview it’s limited to 5000/month and 20/minute). You can find the full script on github.

20/20

Some images were perfectly deciphered by the computer vision API as you can see in this example (the %% are the confidence level reported by the API):

Computer Vision can clearly see what's in the image

Legally Blind

But some others would puzzle the model quite a bit:

Computer Vision mistakes a person for a celebrity and the cell phone for a hot dog

Not only there’s no Shu Qi in the picture above, there’s definitely no hot dog and no other food items. Granted, the API did tell me that it was not really sure about what it could see. Probably a good idea to route images like that through a human workflow for tags and description validation and correction.

Domain Specific Models

The problem with seeing the wrong things or not seeing the right things in a perfectly focused and lit image is … lack of training. Think about it. There are millions and millions of things that your vision can recognize. But you have been training it all your life and the labeled examples keep coming in on a daily basis. It takes a whole lot of labeled images to train a generic computer vision model and it also takes time.

You can get better results with domain specific models like that offered by Clarifai, for example. As of the time of this writing you can subscribe to Wedding, Travel, and Food models.

Domain Specific Computer Vision model from Clarifai

I am sure you’ll get better classification results out of these models than out of a generic computer vision model if your business is in one of these industries.


Next time I will explore Text Analytics API and will show you how it can help tag and generate keywords for your content.

Cognitive APIs. Vision. Part 3.

I have used Cognitive Services from Microsoft (part 1) and IBM Watson Services (part 2) to read my avatar image. There are two more APIs that I would like to put to the test - Google Cloud Vision API and Clarifai.

Google Cloud Vision

I already had a developer account. To use the Cloud Vision API I only had to enable it in the console and generate myself a browser key. When you sign up, Google asks for your credit card but they promise not to charge it without your permission. They also give you $300 in free trial credit and 60 days to use it.

The API itself is clearly designed for extensibility.

It’s a single endpoint that can do different things based on your request. An image can either be sent as a binary data or as a URL to a Google Storage Bucket. You can send multiple images at once and every image request can ask for different type of analysis. You can also ask for more than one type of analysis for a given image.

Google can easily add new features without adding new APIs or changing the endpoint’s semantics. Take a look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
const key = '<use-your-own-key>';
const url = `https://vision.googleapis.com/v1/images:annotate?key=${key}`;

fetch(url, {
method: 'POST',
headers: new Headers({
'Content-Type': 'application/json'
}),
body: JSON.stringify({
'requests': [{
'image': {
'source': {
'gcsImageUri': 'gs://pveller/pavelveller.jpeg'
}
},
'features': [{
'type': 'LABEL_DETECTION',
'maxResults': 10
}, {
'type': 'FACE_DETECTION'
}],
}]
})
}).then(function(response) {
return response.json();
}).then(function({ responses }) {
const labels = responses[0].labelAnnotations;

console.log(labels.map((l) => `${l.description} - ${l.score.toPrecision(2)*100}%`))
});

Here’s what I got:

1
2
3
4
5
6
7
8
[ 
"hair - 95%",
"person - 94%",
"athlete - 88%",
"hairstyle - 84%",
"male - 79%",
"sports - 72%"
]

A man who definitely cares about his hair, right? :) I am not sure where the sports and athlete bits came from. I also wonder if I would get more tags (like a microphone, for example) if I could ask for features with lower scores. The API doesn’t seem to allow me to lower the threshold. I asked for ten results but got only six back.

The face detection sent down a very elaborate data structure with coordinates of all the little facial features. Things like left eye, right eye, eyebrows, nose tip, and a whole lot more. The only thing is … you can’t see the left side of my face on my avatar.

Google also tries to detect emotions. Of all that it can see - anger, joy, sorrow, surprise - none came back with anything but VERY_UNLIKELY. You can also test an image for explicit content. Same VERY_UNLIKELY for my avatar.

Very pleasant experience but I honestly expected a little more from Google’s Vision API.

I expected more because I know Google does all kinds of crazy things with deep learning in their labs. With images as of two years go and very recently with video. Maybe as those models mature, the Cloud Vision will support more features? Time will tell.

Clarifai

The easiest setup experience by far!

I was ready to go in just a few seconds, no kidding! And it also felt like the fastest response from all the APIs I tried. Very easy and intuitive to use as well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
const key = '<use-your-own-key>';
const url = 'https://api.clarifai.com/v1/tag';

const data = new FormData();
data.append('url', image);
data.append('access_token', key);

fetch(url, {
method: 'POST',
body: data
}).then(function(response) {
return response.json();
}).then(function({ results }) {
const tags = results[0].result.tag;
const labels = [...tags.classes.keys()].map((i) => ({
'class': tags.classes[i],
'confidence': `${tags.probs[i].toPrecision(2)*100}%`
}));

console.log(labels.map((l) => `${l.class} - ${l.confidence}`));
});

Here’s what I got back:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
"music - 100%",
"singer - 99%",
"man - 99%",
"people - 98%",
"competition - 98%",
"musician - 98%",
"one - 98%",
"concert - 98%",
"pop - 97%",
"microphone - 97%",
"portrait - 97%",
"journalist - 96%",
"press conference - 96%",
"wear - 95%",
"administration - 95%",
"television - 94%",
"stage - 94%",
"performance - 93%",
"recreation - 92%",
"festival - 92%"
]

This is actually very close! Good feature detection with various plausible scenarios spelled out based on that. I would only question the absolute confidence in music and singer :) What about a… conference and a speaker?

Clarifai has another very interesting endpoint - Feedback. I haven’t used it but it seems that you can submit your own labels back to Clarifai and help them train and fine-tune the model. It won’t be your own classifier like Watson does. Feedback seems to be a crowdsourcing mechanism to train their main shared model(s). I only wonder how it will work without you having to specify the area of the image that each new label is attached to. In case of my avatar, conference and speaker would attach to the whole image. What about more involved images? Maybe I am missing something…


There’s a lot more computer vision APIs out there. Some are more generic and some a geared towards more specialized tasks like visual product search or logo recognition. Go give it a try!

It’s fascinating what kinds of things are just one HTTP request away.

Cognitive APIs. Vision. Part 2.

In part 1 of this blog series I had Microsoft’s Computer Vision analyze my avatar. Today I would like to ask Mr. Watson from IBM to do the same.

Setup

Same as last time, modern JavaScript and a modern browser.

Getting started with Watson APIs takes a few more steps but it’s still very intuitive. Once you’re all set with Bluemix account, you can provision the service you need and let it see your images.

API

IBM had two vision APIs. AlchemyVision has been recently merged with the Visual Recognition. If you use the original Alchemy endpoint, you will receive the following notice in the JSON response: THIS API FUNCTIONALITY IS DEPRECATED AND HAS BEEN MIGRATED TO WATSON VISUAL RECOGNITION. THIS API WILL BE DISABLED ON MAY 19, 2017.

The new unified API is a little weird. Similar to the computer vision from Microsoft, it can process binary images or can go after an image by its URL. Both need to be submitted as multipart/form-data though. Here’s an exampel from the API reference:

IBM Watson Visual Recognition API

It’s the first HTTP API that I’ve seen where I would be asked to supply JSON parameters as a file upload. You guys? Anyway. Thanks to the Blob object I can emulate multipart file upload directly from JavaScript.

Another puzzling bit is the version parameter. There’s a v3 in the URL but you also need to supply a release date of the version of the API you want to use. Trying without it gives me 400 Bad Request. There’s no version on the service instance that I provisioned so I just rolled with what’s in the API reference. It worked.

I also couldn’t use fetch with this endpoint. This time it’s not on Watson though. My browser would happily send Accept-Encoding with gzip in it and IBM’s nginx would gladly zip up the response. Chrome dev tools can deal with it but fetch apparently can’t. I get SyntaxError: Unexpected end of input at SyntaxError (native) when calling .json() on the response object.

Not sending Accept-Encoding would help but it’s one of the headers you can’t set. I had to resort to good old XHR.

And Here We Go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// my avatar
const image = 'http://media.licdn.com/mpr/mpr/shrinknp_400_400/p/7/000/22b/22b/32f088c.jpg';

const key = '<use-your-own-key>';
const url = `http://gateway-a.watsonplatform.net/visual-recognition/api/v3/classify?api_key=${key}&version=2016-05-20`;

const parameters = JSON.stringify({
'url': image,
'classifier_ids': ['default'],
'owners': ['IBM'],
'threshold': 0.2
});

const data = new FormData();
data.append('parameters', new Blob([parameters], {'type': 'application/json'}));

const request = new XMLHttpRequest();

request.onload = function () {
const data = JSON.parse(request.response);
const tags = data.images[0].classifiers[0].classes;

const describe = (tag) => `${tag.class}, ${Math.round(tag.score*100)}%`;

console.log(tags.map(describe));
};

request.open('POST', url, true);
request.send(data);

The response from Watson?

Person, 100%

Yep. That’s it. That’s all the built-in classifier could tell me. You can train your own classifier(s) but they all appear to be basic. No feature detection that would allow to describe images. I tried to see all classes in the default classifier but the discovery endpoint returns 404 for default. I guess I will have to check back later ;)


I have more computer vision APIs to try. Stay tuned!

Cognitive APIs. Vision. Part 1.

I’ve been actively looking at machine learning lately. Fascinating applications in day to day live! Often unexpected. Always amazing. More and more accessible every day. Google’s Motion Stills blew me away the other day. Classification of motion vectors and biased estimation models (in a temporal consistent manner, no less) - a lot of science and novel ideas in a free consumer mobile app. Enabled and powered by machine learning.

You have probably noticed a new breed of APIs popping up all over the web. Some simply call it Machine Learning APIs, others call it Cognitive Services, some simply call it Watson. Pre-trained models operationalized with an API layer and APIs to train your own.

This blog post series offers you a short tour of these new machine learning powered APIs. I am going to start with Vision and today I am tasting Microsoft Cogntive Services (aka Project Oxford).

Setup

I am going to use JavaScript and my browser. These are all HTTP APIs so I should be able to just talk to them with very little overhead or ceremonies. Besides, since the intent is to taste (and test) the APIs, I figured I would also take the latest JavaScript and browser APIs for a spin. No transpilation. No polyfills. No external dependencies. Hence a fair warning: the examples are likely to only work in latest evergreen browsers. I am using connect serve-static as my web server (here’s how).

Let’s see how much a computer vision can see in my avatar.

Pavel Veller

Describe

According to the documentation, the Describe endpoint generates a description of an image in human readable language with complete sentences. The description is based on a collection of content tags, which are also returned by the operation.

Getting started is just a few clicks and API reference is very transparent. Here we go:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
const explain = function(what, confidence) {
return `${what} with ${Math.round(confidence*100)}% confidence`;
};

// my avatar
const image = 'http://media.licdn.com/mpr/mpr/shrinknp_400_400/p/7/000/22b/22b/32f088c.jpg';

// Microsoft: Describe
const url = 'https://api.projectoxford.ai/vision/v1.0/describe?maxCandidates=3';
const key = '<use-your-own-key> ';

fetch(url, {
method: 'POST',
headers: new Headers({
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': key
}),
body: JSON.stringify({
'url': image
})
}).then(function(response) {
return response.json();
}).then(function({description: {captions, tags}}) {
console.log(captions.map((c) => explain(c.text, c.confidence)));
console.log(tags);
});

Very straightforward with no surprises. Everything just worked. Here’s what the Describe API sees in my avatar:

A man holding a cell phone (19% confidence)
A man holding a phone (17% confidence)
A young man holding a cell phone (12% confidence)

It definitely sees a man. It’s not sure whether a man is young. It probably doesn’t know what a microphone is but it vaguely remembers that phones used to look like this:

Old Phone

Here are all the tags:

["person", "man", "indoor", "holding", "looking", "cellphone", "hand", "phone", "young", "laptop", "sitting", "standing", "table", "boy", "computer", "shirt", "using", "brush", "red", "blue", "people"]

Tags

Another API endpoint can report tags and also provide the level of confidence in each. I sent the same request to /vision/v1.0/tag and here’s what I got back:

person (100%)
man (95%)
indoor (94%)
microphone (22%)

I wonder why microphone wasn’t detected by the Describe endpoint. I would expect that Describe gets the tags from Tag and then uses language generation algorytms to build the description. Apparently not.

Analyze

One more API endpoint that can process an image from many angles at once. It will report the most likely description and will send down the tags. You can also ask it to detect faces and more. I asked for Description, Tags, and Categories. This one does feel like an aggregation. I got the same set of tags as I got from Tags, same most likely description (with a cell phone) and a longer list of tags as I got from Describe. The category was identified as:

people_young with 81% confidence

Summary

Microsoft Vision API allows you to see one image at a time. You can either upload the binary or point it at a publicly accessible URL. Depending on what you’re after you can get different results. I am still puzzled by the difference in reported tags. It’s capable of working with domain models to do more specialized detection but right now has only one trained - celebrities. I am sure Microsoft will deploy more and will likely let you train your own. I don’t know when but I know that there are other vision APIs that do so.

Next time I will talk to Mr. Watson. Stay tuned!