Content Work Automation with Text Analytics API

In my last post I used Computer Vision APIs to automate image tagging. Let’s see if machine learning APIs can help us automate tedious content work like SEO keywords generation and text proof reading.

Microsoft Cognitive Services offers Text Analytics API that can extract keywords from text and can also do sentiment analysis. I will again use Sitecore, its Habitat demo site, and Powershell Extensions to automate everything though the concepts should apply to any modern CMS.

Key Phrases

It’s probably not hard to come up with a decent list of keywords for a body of text that is a web page. As the size of your site grows, however, the task becomes very tedious very quickly if performed manually. Add to that the editorial calendar with frequent updates and you now run a risk of having obsolete keywords adversely impacting your SEO. Add to that a component based approach with proper content reuse and flexibility in the hands of your content teams and it’s even harder to track what exactly each page renders on the live site. Everything that can be automated should be automated,

Getting keywords for a given text fragment from Text Analytics API is very straightforward:

1
2
3
4
5
6
7
$keywords = Invoke-WebRequest `
-Uri 'https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases' `
-Body "{'documents': [ { 'language': 'en', 'id': '$($page.ID)', 'text': '$text' } ]}" `
-ContentType "application/json" `
-Headers @{'Ocp-Apim-Subscription-Key' = '<use-your-own-key>'} `
-Method 'Post' `
-UseBasicParsing | ConvertFrom-Json

Here’s how I am going to aggregate the content for a given page:

1
2
3
4
5
6
7
8
9
10
11
12
function GetContent($item, $layout = $False)
{
# TBD
}

$content = GetContent $page $True `
| Where { $_ -match '\D+' } `
| %{ $_ -replace '\.$', ''} `
| Sort-Object `
| Get-Unique

$text = [String]::Join('. ', $content)

Basically, I will get various content fragments concatenated together into one big blob of text.

Aggregating Content

The GetContent function will get all content fields off of the item and then will recursively process all the datasources that the layout references. It’s actually smart enough to also resolve links to other items like you would find in the content fields on the carousel panels, for example. It will go as deep as needed, will strip out rich text markup, will skip system fields, and will even handle cyclic references.

Take a look on github if you’re interested, I enjoyed writing this one.

Keywords That Matter

For my experiment I decided to limit the key phrases returned by the API to only those that have words capitalized. I figured it’s a good indication of a header or a subtitle plus it helps spot ALL CAPS text as you will see in a minute:

1
2
3
$keywords.documents[0].keyPhrases `
| Where { $_ -cmatch '^([A-Z]\w+\s?)*$' } `
| %{ Write-Host $_ }

Here are the results for the home page, for example. You probably would want to exclude things that you know are not your keywords (e.g. Search Resutls, Tweets):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
The text is 100.0% positive

Sitecore Package
Sitecore MVP
Sitecore Powered
Download Habitat
Github Habitat Repository
Design Package Principles
Simplicity
High Cohesion Domain
Low Coupling
Pentia
Search Results
Anders Laub Christoffersen
Tweets
Extensibility
Flexibility
News List
Latest News
Click
Introduction

Proof Reading

Text Analytics can also tell you how positive your text sounds. positivity is measured in percentage points from 0% to 100%. It’s also just one HTTP request away if you have your text readily available:

1
2
3
4
5
6
7
8
9
$sentiment = Invoke-WebRequest `
-Uri 'https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment' `
-Body "{'documents': [ { 'language': 'en', 'id': '$($page.ID)', 'text': '$text' } ]}" `
-ContentType "application/json" `
-Headers @{'Ocp-Apim-Subscription-Key' = '<use-your-own-key>'} `
-Method 'Post' `
-UseBasicParsing | ConvertFrom-Json

Write-Host "The text is $($sentiment.documents[0].score*100)% positive"

Many pages in the Habitat demo site are close to 100% positive. That’s to be expected for the elevated marketing speak I guess. A few, however, came back with just 16%. And it turns out that you don’t have to sound too negative to score that low. It’s enough to just be very dry and matter-of-factly. Like this:

1
2
3
4
The accounts module handles user accounts and user profiles including login, registration, forgot password and profile editing. 
A number of components are available to handle login, registration and password reset.
Links to specific pages showing these components are as follows.
Login, Register, Edit Profile (logged in users only), Forgotton Password

Imagine running a script like that for all the pages on your site and sending the results off to your content team? Maybe you will not be able to completely automate keywords generation but you will definitely help them spot content that needs improving.

I have been working with cognitive APIs for a while now and I am still surprised how easy it is to get stuff done. I am even more excited about what’s coming in the near future! So much so that I will be speaking about cognitive APIs and smart apps that one can built with them on the API Strategy conference this coming November. See you in Boston!

Sitecore Catalog Export for Azure Recommendations API

Azure Recommendations API requires a product catalog snapshot and the transactions history to train a model. This blog post will show you how you can export a Sitecore Commerce reference storefront catalog using PowerShell Extensions.

Bare Minimum

Let’s start small. At a minimum, the Recommendations API needs your SKU #s, product name, and the category name:

1
2
3
AAA04294,Office Language Pack Online DwnLd,Office
AAA04303,Minecraft Download Game,Games
C9F00168,Kiruna Flip Cover,Accessories

The following script will give us the data we need:

1
2
3
4
5
6
7
8
9
10
$catalog = '/sitecore/Commerce/Catalog Management/Catalogs/Adventure Works Catalog'
$product = '{225F8638-2611-4841-9B89-19A5440A1DA1}' # Commerce Product Template

$products = Get-ChildItem -Path $catalog -Recurse `
| Where { $_.Template.InnerItem['__Base template'] -like $product }

$products | Select 'Name', `
'__Display Name', `
@{Name = 'Category'; Expression = {$_.Parent.Name}} `
| Sort 'Name' -Unique

The result looks like this:

1
2
3
4
5
6
7
Name        __Display name                   Category
---- -------------- --------
22565422120 Gift Card Departments
AW007-08 Black Diamond Quicksilver II Carabiners
AW009-08 Black Diamond Quicksilver II SaleItems
AW013-08 Petzl Spirit Adventure Works Catalog
...

Adding Features

Features need to be exported in a special format. Different products in a given catalog may have different features and even have different number of them. Azure solves this by requiring features as a comma separated list of name value pairs:

1
2
3
AAA04294,Office Language Pack Online DwnLd,Office,, softwaretype=productivity
BAB04303,Minecraft DwnLd,Games,, softwaretype=gaming, compatibility=iOS, agegroup=all
C9F00168,Kiruna Flip Cover,Accessories,, compatibility=lumia, hardwaretype=mobile

The following addition to the script will add features to the list:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
function ExtractFeatures($product)
{
$fields = $product.Template.OwnFields `
| Where { $_.Name -notlike 'images' -and $_.Name -notlike '*date'} `
| Where { $product[$_.Name] -ne '' }

$features = @()
foreach($field in $fields)
{
$features += @{Name = $field.Name; Value = $product[$field.Name]}
}

return $features
}

function ApplyFeatures($product)
{
foreach($feature in $product.Features)
{
$product | Add-Member -Name $feature.Name `
-MemberType NoteProperty `
-Value "$($feature.Name)=$($feature.Value)"
}

$product.PSObject.Properties.Remove('Features')

return $product
}

# ... (see above)

$products | Select 'Name', `
'__Display Name', `
@{Name = 'Category'; Expression = {$_.Parent.Name}}, `
'Description', `
@{Name = 'Features'; Expression = {ExtractFeatures($_)}} `
| Sort 'Name' -Unique `
| %{ ApplyFeatures $_ }

PSObject is a dynamic type that you can modify on the fly. First, I extracted a collection of features into a new Features property. Then I applied features to become new properties on the product object. CSV export will be able to pick it up transparently. I hope.

CSV

It should now be easy to export the list as CSV. There’s a caveat though.

Both ConvertTo-CSV and Export-CSV will happily export the list for you but will normalize every record to the common set of fields.

You won’t see the features in the list. Here’s a trick to get every product in the export have its own features:

1
2
3
4
5
6
7
8
9
10
# ... (see above)

$products | Select 'Name', `
'__Display Name', `
@{Name = 'Category'; Expression = {$_.Parent.Name}}, `
'Description', `
@{Name = 'Features'; Expression = {ExtractFeatures($_)}} `
| Sort 'Name' -Unique `
| %{ ApplyFeatures $_ } `
| %{ ConvertTo-CSV -InputObject $_ -NoTypeInformation | Select -Skip 1 }

Instead of piping the entire set to the ConvertTo-CSV, I basically processed the list one by one in the foreach loop. I also removed the type info and the CSV headers. Azure doesn’t need labels anyway. Works like a charm!

1
2
3
4
5
"AW007-08","Black Diamond Quicksilver II","Carabiners","Straight","BasePrice=10.0000"
"AW009-08","Black Diamond Quicksilver II","SaleItems","Straight"
"AW013-08","Petzl Spirit","Adventure Works Catalog","Straight"
"AW014-08","Petzl Spirit","Carabiners","Straight","BasePrice=14.0000"
"AW029-03","Women's woven tee","Shirts","Short-sleeve, breathable henley, 100% cotton knit","BasePrice=35.0000","Brand=Litware"

Commas and Quotes

There’s one more thing that I needed to do for Azure Recommendations API to absorb the catalog. As you could tell, the catalog format is not exactly CSV. Every line can have different number of fields basically. Neither does Azure backend use CSV parsing to read it.

The double quotes in the export above were taken literally. Azure would think that the SKU # is "AW007-08", for example. And then the commas in the descriptions where messing up the parsing as well. My next post will be about the Recommendations API itself and I will write more about it, but here’s the final version that produces a clean catalog export ready to go:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
function ExtractFeatures($product)
{
# ... (see above)
}

function ApplyFeatures($product)
{
# ... (see above)
}

function CleanUpCommas($product)
{
foreach ($prop in $product.PSObject.Properties)
{
$src = $product.PSObject.Members[$prop.Name].Value
$product.PSObject.Members[$prop.Name].Value = $src -replace ",", ";"
}

return $product
}

function CleanUpQuotes($line)
{
return $line -replace """", ""
}

# ... (see above)

$products | Select 'Name', `
'__Display Name', `
@{Name = 'Category'; Expression = {$_.Parent.Name}}, `
'Description', `
@{Name = 'Features'; Expression = {ExtractFeatures($_)}} `
| Sort 'Name' -Unique `
| %{ ApplyFeatures $_ } `
| %{ CleanUpCommas $_ } `
| %{ ConvertTo-CSV -InputObject $_ -NoTypeInformation | Select -Skip 1 } `
| %{ CleanUpQuotes $_ }

Got to love PowerShell. Cheers!

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×