Posts tagged ‘security’

May 12, 2011

Google users pressed into service in war against spam

Google (GOOG) recently made an official announcement offering a Personal Blocklist extension for Chrome browser users. I am weighted down with far too many Chrome browser extensions already, so I haven’t tested this one. Technology press coverage of the news slightly surprised me:

Google (GOOG) is concluding that if people are so up in arms about its declining search results, then it will let the masses get to work in helping refine its search technology…

Users to spot Spam Sites

Spam Protection Extension for Chrome browser

While amusing (I’ve supplemented my TechCrunch reading with GigaOM lately), it was more in line with what I expect from The Onion. Yet it is correct. The size and growth of the spam problem warrants this reaction from the press, as well as the public and many businesses. All express frustration with spam and electronic detritus.

Google is addressing spam with a two-pronged initiative, it seems to me. The Google War on Content Farms  of a few weeks earlier was directed at particularly spammy e-commerce merchants and services. The Personal Blocklist browser extension is the second part, and directed at e-commerce consumers and users in general.

Basic search

Search!

In a worst case scenario, this can be viewed as a sign that the internet will soon become almost unusable due to clutter from impenetrable volumes of advertisements and duplication of once original but now outdated content. That is the most generalized definition of spam. As a matter of quality control Google DOES need to provide meaningful resultswith a minimum of spam, to Google Search 2.0. users.

What can be done?

Is Google evil?

Is it Google’s fault? Is Google greedy and betraying the pubic’s best interests? No, not particularly.

Google is a publicly traded company, a business with stockholders. It is not a public utility. Google employees and Google operations are not funded by the taxpayers of any nation. It is very easy to forget that. The model of free online services is wonderful, and benefits everyone, everywhere, particularly in countries where what is considered a nominal cost in the U.S.A. would be prohibitively expensive. Much of the U.S. and global economy, as well as the public in general, are dependent upon free Google services to some degree. This is analogous to physical infrastructure. It is digital infrastructure.

Infrastructure is usually part of the public sector

In order to fund the model of free internet search, and free Google products, Google sells online advertising. And so the World Wide Web’s spam problem reduces in some part, though not entirely, to the principal agent problem. Moral hazard. Conflict of interest.

Avoidance of moral hazard is a major benefit of having a public sector, and government. When the public sector functions as it should, it reduces biased behavior due to profit-seeking and other motives.

The dilemma for Google as a company

Google needs the advertising revenue provided by AdSense customers (some of whom are the Content Farmers). That is why Google must offer a quality product to the public. Not because the public are Google customers. Google search is free of charge. While it may be unethical to sell a poor-quality product, there is no law against offering crummy goods and services free of charge. That happens all the time. No one wants something that is useless or gives much less value than an alternative provider.

Good corporate citizenship is a consideration, but only a minor one. Google must provide a quality product because the public’s use of free Google products drives revenue from customers. Google is obligated to:

  • Customers. Primary customers are advertisers and revenue-generating businesses, for-profit and otherwise
  • Employees. The people whose paycheck it provides for going to work every day

Remember though that the motivation for these obligations is that they may in turn give value to shareholders in the company itself.

The war against the Content Farmers is dangerous for Google. The Google anti-spam efforts must be targeted enough to cut spam and increase search user satisfaction while not alienating the source of funding that sustains Google and allows the company to offer services at all.

April 3, 2011

reCAPTCHA definition and history

reCAPTCHA example

reCAPTCHA and OCR for digitization projects

What does a CAPTCHA do?

Humans can read the distorted text in CAPTCHA challenges* but current computer programs cannot.

A CAPTCHA is a program that protects websites against bots by generating and grading tests that humans can pass but current computer programs cannot.

What does CAPTCHA mean?

CAPTCHA is an acronym for Completely Automated Public Turing Test To Tell Computers and Humans Apart. It was coined in 2000 by Carnegie Mellon University computer science research staff who invented CAPTCHA originally.

What is the difference between CAPTCHA and reCAPTCHA?

This is how the reCAPTCHA Project explains the difference:

ReCAPTCHA helps prevent automated abuse of your site (such as comment spam or bogus registrations) by using a CAPTCHA to ensure that only humans perform certain actions.

Generally a CAPTCHA is a single word, whereas a ReCAPTCHA is two words. The reCAPTCHA project page explains this in greater detail. There are research papers, in *.pdf format available for download on the Google ReCAPTCHA website.

Google purchased CAPTCHA in 2009 and describes usage and further background on reCAPTCHA FAQs:

ReCAPTCHA is a free CAPTCHA service that helps to digitize books, newspapers and old-time radio shows.

ReCAPTCHA is free

While free to use, including the API, be aware that ReCAPTCHA is not open source software.

Other uses

ReCAPTCHA is best known for historic text digitization and spam filtering, which is an information security measure.

Answers to reCAPTCHA challenges are used to digitize textual documents… a combination of multiple OCR programs, probabilistic language models, and the answers from millions of humans on the internet, reCAPTCHA is able to achieve over 99.5% transcription accuracy at the word level….

OCR is an acronym. It means Optical Character Recognition. Compare the accuracy of standard OCR versus reCAPTCHA transcriptions of a medium quality scanned document on the reCAPTCHA digitization accuracy website. See some humorous reCAPTCHA examples from the official Google reCAPTCHA blog. Google announced an audio version of reCAPTCHA in 2009.

MailHide is another application, where potential for spam is reduced by requiring a reCAPTCHA challenge in order to disclose an otherwise partially obscured email address. More details are available in my post about MailHide from last month.

Recent developments

Recent research in the area of computer security led to some surprising discoveries about CAPTCHA and spam. Initially, it appeared that the CAPTCHA challenge had been defeated on a large scale, but localized very regionally. That was not true though. Human interaction of an unanticipated sort was still required to evade the CAPTCHA, on each and every spam comment and email that got through.

*Work continues on the original CAPTCHA project.

March 20, 2011

How to add a Google Gadget

Image by jblyberg via Flickr

Suite of Google Gadgets for Libraries

Google offers a service called home page gadgets which are little pieces of content and web functionality that you can put on the main Google page (as it appears to you) or your own iGoogle page.

This is useful if you visit google.com often. If so, you can customize the page with Google Gadgets that give whatever information you find most personally relevant. Some examples are weather reports for your area, news headlines or the latest entries from your favorite blogs or websites.

I don’t have an iGoogle page

That’s OK. You don’t have to have an iGoogle page to use Gadgets. You can add Google Gadgets to the classic Google page.

How do I install a Google Gadget?

Go to the Google Gadgets catalog. To prevent any spyware or virus issues, only install gadgets actually created by Google.

If you are certain the gadget you want to install is from a trustworthy source, such as National Public Radio, consider installing it.

Before installing any gadget, remember this: The only gadgets that are guaranteed safe by Google are those that were made by Google!

March 13, 2011

Mailhide

If you’ve ever looked at an open-source development project hosted by Google servers, usually on  http://code.google.com sites, Mailhide will be familiar. It is a less well-known application of the reCAPTCHA detection challenge.

reCAPTCHA now owned by Google

reCAPTCHA Turing test

Mailhide conceals part of an email address

This is how it prevents spammers from accessing email addresses using automated programs. Typically, the first few letters, or numbers, of the username part of the email is visible, followed by an ellipsis i.e. three dots, and then the domain name.

Most Google employees* use Mailhide. Mailhide is offered as an option to developers using Google Code sites.

Mailhide type functionality is also offered by Slashdot for user accounts. Slashdot is not necessarily using Google reCAPTCHA for encryption, however. There are other Turing tests besides reCAPTCHA.

reCAPTCHA is a Google product. It was not developed by Google, though. Google purchased the reCAPTCHA algorithm from Carnegie-Mellon University a few years ago, in 2008.

reCAPTCHA Mailhide API

Are you running a web application that lists users’ email addresses? Do your users a favor by shielding them from spam with reCAPTCHA Mailhide.

Google will give you an API (cryptographic) key. Use it to encrypt user email addresses. Google supplies full documentation for the Mailhide protocol. Everything is free of charge.

I am uncertain whether API restrictions on usage apply. That is a familiar restriction for applications developers relying on the Twitter API. It should not be a binding constraint in this case, as Mailhide is far less transactional that Twitter. Unless one is very, very popular!

reCAPTCHA comes in many flavors!

Libraries are available for PHP, Perl, Ruby and Python programs.

*Google employee accounts in the U.S.A., and many but not all other countries, have the format  userid@google.com.  Non-employee Google mail accounts are  userid@gmail.com.

 

March 7, 2011

Authentication and Authorization

Access control has two components, referred to collectively as auth.

Third-party applications often require limited access to a user’s Google Account… all requests for access must be approved by the account holder.

via Authentication and Authorization for Google APIs.

Authentication services

Authentication refers to the process of allowing users to sign in to websites. In the context of this blog, it also refers to sign in to applications using a Google Account, or an OpenID 2.0 based protocol. When Google authenticates a user’s account, it returns a user ID to the web application. This allows user information to be stored and collected. Open ID also allows access to certain user account information, with the user’s approval.

Authorization services

OAuth Logo

OAuth

Authorization is often confused (by me, maybe others) with authentication. Authorization lets a user authorize access by applications to specific data associated with the user’s Google account.

OAuth 2.0 Protocol

The OAuth 2.0 open-standard protocol allows users to authorize access to their data, after successful authentication. Google supports the OAuth 2.0 protocol with bearer tokens for web (and installed) applications. Regular Google account data and Google Apps account data are accessible with OAuth 2.0. OAuth 2.0 relies on SSL for security instead of direct cryptographic signing that would otherwise be necessary for such access.

Note that OAuth 2.0 has not been finalized, according to IETF (version 13). Google cautions that it’s OAuth 2.0 support is in an early preview and may change at any time, or as the final specifications evolve. Google considers OAuth experimental.  However, “experimental” does not have the same tentative connotation associated with Google Labs projects.

OAuth 1.0 Protocol

There is also an OAuth 1.0 for web applications. OAuth 1.0 can be used for authorization to user data by all Google API’s. Google continues to support OAuth 1.0.*

* OAuth 1.0 is sometimes referred to in documentation without version number, only as OAuth.

Other protocols

The OpenID-OAuth hybrid protocol provides authentication and authorization in a single-step process. Open ID provides authentication services, and OAuth provides authorization to Google APIs.

AuthSub API is Google’s proprietary protocol. It is mostly used for Google APIs. AuthSub is similar to OAuth. OAuth is more generally applicable and Google recommends that developers use OAuth instead of AuthSub API.

Registration

Registering a web application is optional. It is also free and straightforward. Web applications that are not registered with Google can still use OAuth 1.0 or AuthSub interfaces. However, registered web applications are recognized by Google and receive a correspondingly higher level of trust designation. This is communicated to users on the login screen.

Example of access request screen for OAuth or AuthSub web app

Sample Google access request screen for unregistered web application

Summary

These are the three levels of registration:

  1. Unregistered These applications conduct transactions at a lower security level.  Google flags the user login page with a precautionary message.  See image above with yellow-shaded advisory.
  2. Registered and recognized but not configured for secure requests
  3. Registered with enhanced security These applications have a security certificate and can use secure tokens.
January 21, 2011

Search engine spam

… a decade ago, the spam situation was so bad that search engines would regularly return off-topic webspam. For the most part, Google has successfully beaten that—even while some spammers resort to sneakier or even illegal tactics such as hacking websites. Today, English-language spam in Google’s results is less than half what it was five years ago, and even lower in other languages.

However, we have seen a slight uptick of spam in recent months

We recently launched a document-level classifier that makes it harder for spammy on-page content to rank highly. The new classifier is better at detecting spam on individual web pages, e.g., repeated spammy words—the sort of phrases you tend to see in junky, automated, self-promoting blog comments. We’ve also radically improved our ability to detect hacked sites.

We’ll explore …  new ways for users to give more explicit feedback about spammy and low-quality sites.

As “pure webspam” has decreased over time, attention has shifted instead to “content farms,” which are sites with shallow or low-quality content. In 2010, we launched two major algorithmic changes focused on low-quality sites. Nonetheless, people are asking for even stronger action [on such sites]. We can and should do better.

via Official Google Blog: Google search and search engine spam.

October 19, 2010

The Erosion of Online Anonymity

The Erosion of Online Anonymity

Follow

Get every new post delivered to your Inbox.

Join 527 other followers