Posts tagged ‘algorithm’

November 14, 2014

LOL targeted search

YouTube is something of a cesspool, with pockets of exceptional quality here and there. Even the higher quality videos have an ephemeral aspect, mysteriously vanishing or being marked Private, from one day to the next. Others succumb to the more prosaic, account suspended due to multiple copyright violations. Illegal uploads of major recording label artists abound, or did. YouTube is also becoming a go-to destination for low-fidelity live concert recordings.

There’s no shortage of fee-based alternatives, so I’m not complaining.

YouTube LOL search algorithm

Google Research developed an aLOLgorithm, “Quantifying comedy on YouTube: why the number of o’s in your LOL matter” to measure YouTube videos’ hilarity. Let’s just refer to it as the LOLgorithm, for my ease of typing. Initially, I thought it was a prior year’s April Fool’s Day post. It isn’t!

I watched three of the five most LOL inducing videos, as determined by the humor-seeking LOLgorithm. I was pleasantly surprised. The LOLgorithm selected videos with themes having universal appeal: A fisherman arguing with a grizzly bear, Annoying Orange, and a charming (well, sort of) video about an Italian man’s language misunderstandings while vacationing in Malta.

Discovery is challenging

Google began by identifying the humorous videos, which is easier said than done.  YouTube’s search engine is not the greatest. I have two theories about that.

First: YouTube was an acquisition. Yes, I realize that many Google services are. There was, still is, a Google Video media player, which offers a better user experience. YouTube just seems… unstable, kludgy. I think, but am not certain, that it crashes less often now with HTML5 than with Adobe SWF.

Second: The content bar is set low. That is, YouTube channel owners can enter any old thing they want as a title, complete with misspellings or contextual mismatches. My current favorite example of an appalling spelling error is a cover of AC DC’s Thunderstruck, performed by The Vitamin String Quartet. The title is listed as TUNDERSTRUK. Looks like the LOLgorithm is working, because that’s what I’m doing now.

Another amusing example of contextual/semantic mismatch is a remixed melody from Brittany. The channel owner is from eastern Europe and thought the song’s origin was Scottish. To make matters worse, he labelled it as dubstep but it was actually hardstyle trance. The comments are full of good-natured corrections, in various languages, and alphabets. I haven’t a clue how any algorithm, even the LOLgorithm, could parse that! Admittedly, it is an edge case.

Methodology

Google started with the semantic meaning of the title, designated by the uploader, and the video description and tags if provided. Next, they used viewer reactions as indicated by comments to categorize the humor videos into sub-genre.

Viewers emphasize their reaction to funny videos in several ways: capitalization (LOL), elongation (loooooool), repetition (lolololol), exclamation (lolllll!!!!!), and combinations thereof.

A “loooooool” indicates greater viewer amusement than a “loool”. The final step was ranking the selected videos by relative funniness. Google described their approach as follows:

We then trained a passive-aggressive ranking algorithm using human-annotated pairwise ground truth and a combination of text and audiovisual features.

Raw view count is insufficient as a ranking metric, as it is biased by video age and possibly by prior viewer exposure on an external website.

LOLgorithm accuracy

The Google Research blog post is terse. The LOLgorithm seems accurate to me.  There’s an alternative explanation, though. Maybe I enjoy similar videos as many other YouTube viewers, and we’re an easily amused and homogeneous lot?  There’s plenty of pre-selection bias.  In other words, most viewers of YouTube comedy videos have a not-too-subtle preference profile, myself included. For example, I’ve been an Annoying Orange channel subscriber on YouTube since 2010.

The video about the Italian tourist reminded me of a literary passage that is hilarious.

Have a look. Maybe it will elicit a LOL or two from you.

September 3, 2011

Prediction API Part 2

Motivation

In my initial coverage of the Google Prediction API, I was very curious why Google would be so magnanimous as to open up this API for public use. This is a plausible answer from Google:

We do not describe the actual logic of the Prediction API in these documents, because that system is constantly being changed and improved. Therefore we can’t provide optimization tips that depend on specific implementations of our matching logic, which can change without notice.

An older version of a prediction API

Based on some of the user comments in the Google group for the Prediction API, I would guess that it is one of the more difficult of all Google APIs to understand and use. Similarly, it will probably be challenging to get meaningful results.

Requirements

Google advises that all the following are prerequisite for using the Prediction API:

  • an active Google Storage account
  • an APIs Console project with both the Google Prediction API and the Google Storage for Developers API activated

And of course, a Google account! See getting started for further details.

Free but not forever

Nor is the Prediction API free of charge indefinitely. According to the initial terms, usage is free for all users for the first six months, up to the following limits per project:

  • Predictions: 100 predictions/day
  • Hosted model predictions: Hosted models have a usage limit of 100 predictions/day/user across all models
  • Training: 5 MB trained/day
  • Streaming updates: 100 streaming updates/day
  • Lifetime cap: 20,000 predictions

This free quota expires at the end of the six month introductory period. The introductory periods begins the day that Google Prediction is activated for a project in the Google APIs console. Remember that charges associated with Google Storage must be included to figure total cost!

Presumably this is an API that Google won’t be deprecating without replacement any time soon. However, there is a separate Terms of Service for the Prediction API, which does give Google the right to do exactly that. I think that is standard language though, as Google is not contractually bound to support a free, or even paid but unprofitable service unless explicitly specifically stated.

Conclusion about the Prediction API

A great deal more information is available from the Prediction API developer guide including an example application for movie recommendations.

The Google Prediction API is probably best used as a sandbox. It may be helpful for deciding whether one wants to use machine learning for predictive purposes. If one decides to go ahead with this approach, there are probably more suitable alternatives than the Google Prediction API for an application intended for production use.

July 10, 2011

Prediction API

The recent release of the Google Prediction API Version 1.2 seemed oddly, well, magnanimous to me! Given the investment of intellectual capital and resources, I am surprised that Google would be so generous.  Allowing access to the Prediction API means that Google is giving access to its in-house machine learning algorithms to external users.

1939 Ford pick-up truck

1939 Ford pick-up truck will not likely use the Google Prediction API though other Ford products will

The official Google Code blog post, Every app a smart app, dated 27 April 2011, suggested many possible uses for the Prediction API. Some of the more interesting included:

The last item on the list has the potential, but not certainty, of causing serious privacy concerns. I’m guessing that customer feedback based on structured data is another potential use for the API.

I noticed that Ford Motor Company has plans for the Prediction API, specifically for commuters driving electric vehicles (EV). Apparently, there is a fair amount of “EV anxiety” due to limitation on range of travel. The Prediction API could be used to mitigate those concerns. AutoBlog is an online publication for automobile enthusiasts. It featured a great slide show demonstrating how Ford intends to make use of the Google Prediction API.

The Prediction API is available on Google Code. This is not the first release of the Prediction API. I’m uncertain whether versions before 1.2 were restricted in some way. (Google often grants API access to developers initially, and later, after ironing out any bugs or unexpected problems, opens the product to the public.)

Do be aware that a Google Storage account is required for access. Visit the Google API Console to get started.

June 27, 2011

Google Translation Story Continues

Last month, developers whose applications and websites depended on the Google Translate API and the underlying Google machine translation were shocked by an unexpected announcement.

Google Says Translate and other APIs WILL be deprecated

Google APIs are deprecated all the time. Usually they are replaced with comparable services or APIs.

But that morning was not like anything else. That morning became cruel and sad when the world heard the news. The linguists and webmasters were taken aback, shocked and stuttered in disbelief. The world learnt on May 26, 2011 that Google is no longer going to support its free machine translator also known as Google Translate

via Lackuna.com: Slaughtering Machine Translators – Who Is Going To Replace Google? 

The Translate API documentation on Google Code makes the situation very clear:

The Google Translate API has been officially deprecated as of May 26, 2011. Due to the substantial economic burden caused by extensive abuse, the number of requests you may make per day will be limited and the API will be shut off completely on December 1, 2011.

Regional languages of India by geographical location on the map

Regional map of India

Google suggests the Translate Element as an alternative to the API for website translation and similar needs.

Welcome to the Indic web

Deprecation of the Google Translate API does not mean an end to human usage of Google Translate.

This becomes very clear with this June 21 announcement on the official Google blog, Google Translate welcomes you to the Indic web. Google Translate announced support of five languages, in alpha* status: Bengali, Gujarati, Kannada, Tamil and Telugu.  According to the post,

In India and Bangladesh alone, more than 500 million people speak these five languages.

Special fonts need to be downloaded to use Google Translate with these Indic languages. The post has links to get access to these fonts, free of charge.

It is not clear whether these five alpha languages will be included in the deprecated Translate API before it is taken offline permanently on December 1, 2011.

* Google Translate introduced nearly a dozen alpha languages since 2009. At present, Google Translate supports 63 languages.

March 12, 2011

War on Content Farms Now in Progress

Farmer's market, Jul 2009 - 01

Content fresh from the farm

Google Declares War on Content Farms:

Google has announced a major algorithmic change to its search engine. Impact on users will be subtle while dramatically improving the quality of Google’s search results…

Google is targeting content farms.

This update is designed to reduce rankings for low-quality sites — sites which copy content from other websites or sites that are just not very useful…. It will provide better rankings for sites with original content, such as research, in-depth reports, thoughtful analysis and so on.

The change should make it easier to find high quality sites.

Google did not give details of the change, which should impact 11.8% of Google’s queries (currently only in the U.S., with plans to roll it out elsewhere over time), but it does say that it will affect the ranking of many sites on the web.

The list of related articles I have hand selected (just like I dredge through string beans in order to find the very best ones) may be of further interest to those with a sense of humor. Or without a personal stake in content farming.

December 9, 2010

Source Meta Tags to Identify Original Publisher Content

In December 2009, the Official Google Webmaster Central Blog responded to publisher concerns about page rank penalties imposed by Google’s search algorithm due to legitimate cross-domain content duplication. Most websites would rarely (if ever) have valid reasons for displaying identical content on multiple and distinctly different domains.

Journalists of the past

Journalists in Radio-Canada newsroom, via Wikipedia

However, it is a common occurrence for news media sites with multiple syndication channels to legitimately publish duplicate cross-domain content.

Source Meta Tags

Google announced an extra feature for news publishers, to differentiate between the first version of a “breaking story” versus the re-distribution by others that follows. Such redistribution is legitimate, but publishers wanted to make sure that there was a way to give credit where credit was due to the most enterprising journalist for a given news story. Google responded with this suggestion:

News publishers and readers both benefit when journalists get proper credit for their work. That can be difficult, with news spreading so quickly and many websites syndicating articles to others. That’s why we’re experimenting with two new meta tags for Google News: syndication-source and original-source. Each of these meta tags addresses a different scenario, but for both the aim is to allow publishers to take credit for their work and give credit to other journalists.

original versus duplicate

Original versus Duplicate Website Content

Further details about Google’s introduction of “source” meta tags to help find original news was covered in the Google News Blog, and an even more in-depth description can be found in this excellent Search Engine Land article about meta tags including discussion of a recent algorithm patent granted to Google.

UPDATE

There is good reason for Google’s decision to implement these meta tags on a trial basis. Best practice, for both bloggers and publishers alike, requires attribution if using another source’s original work. Most reputable online content producers have credited their source with a link until now. However, there is some concern that they could stop doing that, and instead, merely use the meta tag. That would be a much worse outcome for the original writer, in terms of receiving much-deserved credit for their work.

The meta tags are useful to Google, as they give input to the page rank algorithm (which seeks to reward providers of original content). Yet I do believe that this is a good-faith effort by Google. It would be unfortunate if these new meta tags have the opposite effect from what Google intended.

Follow

Get every new post delivered to your Inbox.

Join 527 other followers