Content Work Automation with Text Analytics API

In my last post I used Computer Vision APIs to automate image tagging. Let’s see if machine learning APIs can help us automate tedious content work like SEO keywords generation and text proof reading.

Microsoft Cognitive Services offers Text Analytics API that can extract keywords from text and can also do sentiment analysis. I will again use Sitecore, its Habitat demo site, and Powershell Extensions to automate everything though the concepts should apply to any modern CMS.

Key Phrases

It’s probably not hard to come up with a decent list of keywords for a body of text that is a web page. As the size of your site grows, however, the task becomes very tedious very quickly if performed manually. Add to that the editorial calendar with frequent updates and you now run a risk of having obsolete keywords adversely impacting your SEO. Add to that a component based approach with proper content reuse and flexibility in the hands of your content teams and it’s even harder to track what exactly each page renders on the live site. Everything that can be automated should be automated,

Getting keywords for a given text fragment from Text Analytics API is very straightforward:

1
2
3
4
5
6
7
$keywords = Invoke-WebRequest `
-Uri 'https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases' `
-Body "{'documents': [ { 'language': 'en', 'id': '$($page.ID)', 'text': '$text' } ]}" `
-ContentType "application/json" `
-Headers @{'Ocp-Apim-Subscription-Key' = '<use-your-own-key>'} `
-Method 'Post' `
-UseBasicParsing | ConvertFrom-Json

Here’s how I am going to aggregate the content for a given page:

1
2
3
4
5
6
7
8
9
10
11
12
function GetContent($item, $layout = $False)
{
# TBD
}

$content = GetContent $page $True `
| Where { $_ -match '\D+' } `
| %{ $_ -replace '\.$', ''} `
| Sort-Object `
| Get-Unique

$text = [String]::Join('. ', $content)

Basically, I will get various content fragments concatenated together into one big blob of text.

Aggregating Content

The GetContent function will get all content fields off of the item and then will recursively process all the datasources that the layout references. It’s actually smart enough to also resolve links to other items like you would find in the content fields on the carousel panels, for example. It will go as deep as needed, will strip out rich text markup, will skip system fields, and will even handle cyclic references.

Take a look on github if you’re interested, I enjoyed writing this one.

Keywords That Matter

For my experiment I decided to limit the key phrases returned by the API to only those that have words capitalized. I figured it’s a good indication of a header or a subtitle plus it helps spot ALL CAPS text as you will see in a minute:

1
2
3
$keywords.documents[0].keyPhrases `
| Where { $_ -cmatch '^([A-Z]\w+\s?)*$' } `
| %{ Write-Host $_ }

Here are the results for the home page, for example. You probably would want to exclude things that you know are not your keywords (e.g. Search Resutls, Tweets):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
The text is 100.0% positive

Sitecore Package
Sitecore MVP
Sitecore Powered
Download Habitat
Github Habitat Repository
Design Package Principles
Simplicity
High Cohesion Domain
Low Coupling
Pentia
Search Results
Anders Laub Christoffersen
Tweets
Extensibility
Flexibility
News List
Latest News
Click
Introduction

Proof Reading

Text Analytics can also tell you how positive your text sounds. positivity is measured in percentage points from 0% to 100%. It’s also just one HTTP request away if you have your text readily available:

1
2
3
4
5
6
7
8
9
$sentiment = Invoke-WebRequest `
-Uri 'https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment' `
-Body "{'documents': [ { 'language': 'en', 'id': '$($page.ID)', 'text': '$text' } ]}" `
-ContentType "application/json" `
-Headers @{'Ocp-Apim-Subscription-Key' = '<use-your-own-key>'} `
-Method 'Post' `
-UseBasicParsing | ConvertFrom-Json

Write-Host "The text is $($sentiment.documents[0].score*100)% positive"

Many pages in the Habitat demo site are close to 100% positive. That’s to be expected for the elevated marketing speak I guess. A few, however, came back with just 16%. And it turns out that you don’t have to sound too negative to score that low. It’s enough to just be very dry and matter-of-factly. Like this:

1
2
3
4
The accounts module handles user accounts and user profiles including login, registration, forgot password and profile editing. 
A number of components are available to handle login, registration and password reset.
Links to specific pages showing these components are as follows.
Login, Register, Edit Profile (logged in users only), Forgotton Password

Imagine running a script like that for all the pages on your site and sending the results off to your content team? Maybe you will not be able to completely automate keywords generation but you will definitely help them spot content that needs improving.

I have been working with cognitive APIs for a while now and I am still surprised how easy it is to get stuff done. I am even more excited about what’s coming in the near future! So much so that I will be speaking about cognitive APIs and smart apps that one can built with them on the API Strategy conference this coming November. See you in Boston!

Image Tagging Automation with Computer Vision

I have recently presented my explorations of computer vision APIs(part 1, part 2, and part 3) on the AI meetup in Alpharetta. This time I decided to do something useful with it.

Image Tagging

When you work with digital platforms (be that content management, e-commerce, or digital assets) you can’t go far without organizing your images. Tagging makes your assets library navigable and searchable. Descriptions are a great companion to the visual preview and can also serve as the alternate text. WCAG 2.0 requires non-text content to come with a text alternative for the very basic Level A compliance.

Computer Vision

When I played with the trained computer vision models from different vendors, I realized that I can get a good set of tags from either one of the APIs and some would even try to build a description for me. The digital assets management vendors started playing with this idea as well. Adobe, for example, has introduced smart tags in the latest release of AEM. Maybe I can do the same using Computer Vision APIs and integrate with a digital product that doesn’t have that capability built in yet? Let’s try with Sitecore.

Automation

I am going to use Computer Vision from Microsoft Cognitive Services and the Habitat demo site from Sitecore. I am also going to need Powershell Extensions to automate everything.

We will need the URL of the computer vision API, the binary array of the image, the Sitecore item representing the image to record the results on, and a little bit of Powershell magic to glue it all together.

Here’s the crux of the script where I call into the computer vision API:

1
2
3
4
5
6
7
8
9
10
11
$vision = 'https://api.projectoxford.ai/vision/v1.0/analyze'
$features = 'Categories,Tags,Description,Color'

$response = Invoke-WebRequest `
-Uri "$($vision)?visualFeatures=$($features)" `
-Body $bytes `
-ContentType "application/octet-stream" `
-Headers @{'Ocp-Apim-Subscription-Key' = '<use-your-key>'} `
-Method 'Post' `
-ErrorAction Stop `
-UseBasicParsing | ConvertFrom-Json

It’s that simple. The rest of it is using Sitecore APIs to read the image, update the item with tags and descriptions received from the cognitive services, and also a try/catch/retry loop to handle the API’s rate limit (in preview it’s limited to 5000/month and 20/minute). You can find the full script on github.

20/20

Some images were perfectly deciphered by the computer vision API as you can see in this example (the %% are the confidence level reported by the API):

Computer Vision can clearly see what's in the image

Legally Blind

But some others would puzzle the model quite a bit:

Computer Vision mistakes a person for a celebrity and the cell phone for a hot dog

Not only there’s no Shu Qi in the picture above, there’s definitely no hot dog and no other food items. Granted, the API did tell me that it was not really sure about what it could see. Probably a good idea to route images like that through a human workflow for tags and description validation and correction.

Domain Specific Models

The problem with seeing the wrong things or not seeing the right things in a perfectly focused and lit image is … lack of training. Think about it. There are millions and millions of things that your vision can recognize. But you have been training it all your life and the labeled examples keep coming in on a daily basis. It takes a whole lot of labeled images to train a generic computer vision model and it also takes time.

You can get better results with domain specific models like that offered by Clarifai, for example. As of the time of this writing you can subscribe to Wedding, Travel, and Food models.

Domain Specific Computer Vision model from Clarifai

I am sure you’ll get better classification results out of these models than out of a generic computer vision model if your business is in one of these industries.


Next time I will explore Text Analytics API and will show you how it can help tag and generate keywords for your content.

Do Not Remove Unused Blobs On Save

I have not been actively hands-on with Sitecore lately. But once in a while I come across a question that sounds like a good puzzle to roll up my sleeves for, and then I just can’t help it.

Query

One of our engineers posted a question. Their client’s CM instance was running noticabely slow and the users were complaining. They quicky identified the bottleneck with the SQL profiler but the finding puzzled them:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
IF EXISTS (SELECT NULL 
FROM [SharedFields] WITH (NOLOCK)
WHERE [SharedFields].[Value] LIKE @blobId)
BEGIN
SELECT 1
END
ELSE IF EXISTS (SELECT NULL
FROM [VersionedFields] WITH (NOLOCK)
WHERE [VersionedFields].[Value] LIKE @blobId)
BEGIN
SELECT 1
END
ELSE IF EXISTS (SELECT NULL
FROM [ArchivedFields] WITH (NOLOCK)
WHERE [ArchivedFields].[Value] LIKE @blobId)
BEGIN
SELECT 1
END
ELSE IF EXISTS (SELECT NULL
FROM [UnversionedFields] WITH (NOLOCK)
WHERE [UnversionedFields].[Value] LIKE @blobId)
BEGIN
SELECT 1
END

Who Are You

I have once traversed basic item APIs all the way down to data providers and back so I just knew where to look. SqlServerDataProvider in Sitecore.Kernel has a method with a very telling name that runs this query.

The name of the method is - GetCheckIfBlobShouldBeDeletedSql(). Walking up the usages chain I found who runs it:

1
2
3
4
5
6
7
8
9
10
public override bool SaveItem(...)
{
// ...
if (Settings.RemoveUnusedBlobsOnSave)
{
ManagedThreadPool.QueueUserWorkItem((state => this.RemoveOldBlobs(changes, context)));
}
// ...
}

Every item save will call RemoveOldBlobs() that will end up running the mentioned SQL query if RemoveUnusedBlobsOnSave is set to true.

The method runs asynchronously so it doesn’t directly impact the executing thread, but it does put pressure onto the SQL server. Running LIKE logic looking for GUIDs (even without %) in a non-indexed nvarchar field across mutliple tables will take some cycles.

Recommendation

It’s good that this logic is protected with a feature toggle.

I suggested that the team turns off Settings.RemoveUnusedBlobsOnSave and contacts Sitecore Support.

This behavior was observed in 8.1 Update 2. I opened 8.0 Initial Release just out of curiosity and SaveItem() doesn’t go looking for old BLOBs. I didn’t go through more recent releases but it has got to be a relatively new addition. Probably added for a reason.


If we turn off running it on every item save, when should we run it? Maybe it’s missing the ID of the saved item in the WHERE to make it a lot more specific? Don’t know. I will update this post if/when we hear back from the support team.

Digital Experience Platforms. Where To?

I have been actively looking at a number of digital experience platforms and services lately. Some I have very recent encounters with and some are relatively new to me. As I put them all together on one plate and look at their roadmaps, certain trends become very apparent. In this blog post I would like to quickly share my thoughts about the state of the union in digital experience platforms and also make some projections as to where I believe we’re headed.

The Stage

Have you seen the most recent Forrester Wave for the Digital Experience Platforms? Let’s put it side by side with the Gartner’s latest magic quadrant for Web Content Management:

Gartner Magic Quadrant WCM 2015 and The Forrester Wave Digital Experience Platforms Q4 2015

Thanks to Adobe and Sitecore I learned to use Experience Management where I would previously say Content Management. The opposite is also true - I learned to think that Experience Management is primarily done by Content Management systems. I guess it’s time to re-learn what experience management means. While Sitecore and Adobe are clearly leading the pack on the Web Content Management field, SAP Hybris, Demandware, and Salesforce are breathing down their necks if not surpassing them in the Digital Experience Platforms race.

The proliferation of touchpoints and the emergence of new forms of digital interactions have blurred the lines between content, marketing, commerce, analytics, and mobile.

Acquia has an integrated content and commerce story. Sitecore is catching up with the acquisition of the Commerce Server and a very recent strategic partnership with the Dynamics AX. While Adobe AEM doesn’t have a pre-integrated transactional commerce backend yet, it has a very compelling integration-ready experience-driven commerce story. And now it also has a unique position in the world of enterprise mobile apps with their recent overhaul of the Digital Publishing Suite and AEM Apps (it is now AEM Mobile).

The appearance of Salesforce in the strong performers wave surprised me so I went to check their offering. They don’t have transactional commerce and neither do they have content management - just like I remember them - but they do have a very strong B2C and B2B digital marketing suite, analytics, unique apps platform, and have recently acquired prediction.io.

Expansion

Just a couple weeks ago SAP Hybris had their clients and partners summit in Munich, Germany. Take a closer look at their new products announcement - Hybris Marketing, Hybris Profile, Hybris CX. They are clearly expanding their portfolio to cover more ground in the big experience management space. No wonder Hybris is right in the middle of the strong performers wave and I’m sure will keep climbing.

Projection #1. I believe we are going to see less and less cross-vendor integrated-at-the-core solutions powering digital business transformations. We will sure see integrations, especially where companies have previous investments not yet capitalized on, but the majority of greenfield overhauls will be primarily single vendor led or at least centered around one suite of products. The shape of it will change as well.

Acceleration

Retail was traditionally one of the main focus for the e-commerce vendors and lately for the larger ecosystem of digital experience management players. That said, retail is far from being the only industry that converts users, provides shopping experience, or otherwise engages customers online. Entertainment, Travel, Utilities, Automotive - just to name a few - are next in line.

SAP Hybris, for example, is planning on releasing a number of new industry-focused accelerators to augment their traditional set of B2C, B2B, and Telco. Once one player starts targeting the industries with tailored solutions the others will likely follow.

Projection #2. It will be increasingly more important to be able to accelerate the build. It will shorten the expected ROI cycle and help win the business over. Implementation partners were the ones doing it traditionally but I believe leading digital experience product vendors will be getting into the driver seat. It will naturally shift the cost from implementation services to licenses and as-a-service subscriptions.

Commoditization

Speaking about as-a-service. On-premise got old long time ago. IaaS has been commoditized and those who run on-premise most of the time just run their own IaaS on in-house-virtualized hardware. All vendors I mentioned have as-a-service offerings but they differ in flavor and in the direction they are going.

Sitecore is very naturally focusing on Azure PaaS but it’s not yet a managed services offering - it’s rather a technology enablement. Adobe goes a little further with the AEM managed services. So does Acquia. Oh, and all Adobe Marketing Cloud products are, of course, as-a-service. Salesforce pioneered the model, they now own heroku, and all their products are naturally as-a-service.

But wait, There’s more!

Have you looked at YaaS - Hybris as a service? Unlike other as-a-service offerings that I mentioned YaaS is actually a marketplace. Functionalities such as Loyalty, Coupons, Order Management, for example, packaged and priced as-a-service. It’s very young and we’re yet to see if it has legs but SAP makes big investments and bets its Hybris strategy on it.

What do you think in context of digital experience when you hear IBM? If you think WebSphere Commerce please think again. Look at IBM Bluemix and the application services catalog. My first impression was that it’s like heroku but then I realized that it goes beyond infrastructure and application development. Take a closer look at Mobile Application Content Manager, for example.

Last but not least, please take a closer look at Azure. Here’s fersh from the press - new Dynamics AX is Azure-first. And it’s also a marketplace where third parties can host industry-tailored services and solutions.

Projection #3. I believe we will see further commoditization of implementation and operation aspects of digital business platforms. It will happen via advanced and innovative as-a-service strategies well beyond technology enablement and managed services. Vendors will have to keep up before someone releases their current key differentiators as-a-service. Imlementation partners will have to pivot their models towards service providers as well.

Intelligence

Machine Learning, Natural Language Understanding, Real-Time Analytics are no longer bleeding edge R&D. So much so that Amazon published Alexa Skills SDK and opened a marketplace and you can do Cognitive Commerce as-a-service. No wonder Salesforce rushed to acquire prediction.io and Sitecore is tinkering with Azure ML for segmentation and multivariate content testing.


Everything is elastic and will soon be as-a-service and soon after will become cognitive. 360 degree customer view will be “so yesterday”, flat, static. Businesses will be pursuing a predictive-real-time-3D customer view instead. That’s what the next breed of cloud-first machine-learning-powered digital experience platform with voice-first interface will promise you.

Projection 4. It’s only going to get better. And it’s going to happen fast.

Exciting times!

Digest of my Sitecore blogs

Before I launched this blog I was actively writing on www.jockstothecore.com and before that on pveller.blogspot.com

Heres’s a collection of everything I have pubished on JocksToTheCore in the last two years:

My Favorites

Sitecore 8 Versioned Layouts

Dynamic Product Details Pages

Sitecore 8 Experience Editor

To The Controller And Back

Web Forms For Marketers

xDB

Custom Field Types

TDS

Assorted