Celebrity’s Anonymous Pen Name ‘Outted’ by Software

JGAAP (Java Graphical Authorship Attribution Program)

JGAAP (Java Graphical Authorship Attribution Program)

The role that software plays in stylistic analysis of text is perhaps less surprising to high school and college students than to the general public. The former must submit essays they write to style analysis performed by software which looks for plagiarism and sometimes also makes quality assessments.

In the recent outing of J.K. Rowling as the writer behind the pen name Robert Galbraith, it was mentioned that software had been used to analyze the text of the Galbraith novel.  There exists a family of software used by academics for “authorship attribution,” e.g., to discover, for example, whether a recently discovered manuscript was a missing chapter of Don Quijote (a fabricated example). One of these applications is JGAAP, for Java Graphical Authorship Attribution Program. The JGAAP wiki page explains the project as

. . . Java-based, modular, program for textual analysis, text categorization, and authorship attribution i.e. stylometry / textometry. JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques’ performance on text analysis quickly and easily. JGAAP is developed by the Evaluating Variation in Language Laboratory (EVL Lab) and released under the AGPLv3.

How this was accomplished was explained by one of two academic investigators credited with the analysis (along with some suspicions by reports at the Sunday Times) at . Patrick Juola, in the blog Language Log. Juola refers to this subdiscipline as “forensic stylography.”

A one-paragraph extract from Juola’s blog post follows. Note that, in the usual sense of the word, the analysis doesn’t look directly at “meaning.”

The heart of this analysis, of course, is in the details of the word “compared.” Compared what, specifically, and how, specifically. I actually ran four separate types of analyses focusing on four different linguistic variables. While anything can in theory be an informative variable, my work focuses on variables that are easy to compute and that generate a lot of data from a given passage of language. One variable that I used, for example, is the distribution of word lengths. Each novel has a lot of words, each word has a length, and so one can get a robust vector of <X>% of the words in this document have exactly <Y> letters. Using a distance formula (for the mathematically minded, I used the normalized cosine distance formula instead of the more traditional Euclidean distance you remember from high school), I was able to get a measurement of similarity, with 0.0 being identity and progressively higher numbers being greater dissimilarity.


Will $100M Trickle Watson Down to SMB Enterprises?

IBM Watson

Bloomberg News reported that IBM plans to invest an additional $100 million in its Watson technology. Earlier in 2011, Watson exceeded previously unmet expectations for artificial intelligence by easily overwhelming two Jeopardy!champions on national TV. While Watson-like technologies could be used in a variety of settings (e.g., network management or health care), the steep investments IBM has already made suggest that global services giant has its eye on a revenue stream whose major tributaries are large enterprises: Proctor and Gamble, Pfizer, ExxonMobil, JPMorgan Chase.

ArnoldIT’s April Holmes put it this way:

IBM has a Tundra truck stuffed with business intelligence, statistics, and analytics tools [SPSS, InfoSphere Streams and Cognos come to mind – ed.] IBM has no product. IBM . . . has an opportunity to charge big bucks to assemble these components into a system that makes customers wheeze, “No one ever got fired for buying IBM.”

Promising but out of reach? Few have been fired for asking, “Can we afford IBM?” In a recentTechnology Review interview, IBM Analytics head Chid Apte admitted that “This technology will form the basis of a new product we will in the future be able to offer all of IBM’s big customers.”

The reasons for the anticipated cost are readily apparent. It has been widely reported that Watson took four years to build, runs on around 2,800 Power7 processor cores, has 15 terabytes of main memory, can operate at 80 teraflops (80 trillion operations per second), and employs IBM’s SONAS file system with a capacity of 21 terabytes. Watson software components included some familiar open source technologies IBM had already adopted elsewhere, such as Eclipse and Apache Hadoop, but new ground was broken in creating a natural language understanding system tailored to perform in the Jeopardy! question and answer format. The cost for that capability alone was considerable.

IBM believes this revenue stream will be substantial. According to the Bloomberg article, IBM projects $16B from “business analytics and optimization.” This estimate is probably not unfounded. A 2011 IBM-sponsored study of 3,000 CIO’s reportedly found that 4 out of 5 executives indicated that applying analytics to IT operations was part of their “strategic growth plans.”

But what are the prospects for small and medium sized enterprises (SMB’s)? Large data warehouses are not only associated with large enterprises. Small firms – even a one-person consultancy — can easily amass huge quantities of data, and may be even more highly motivated to make sense of that data. However, they are unlikely to have Watson-scale budgets.

Still, there are a few possible scenarios in which Watson technology could reach SMB’s:

  • Cloud-based Watson resources, with cost reductions made possible by scale (a la Google search), could become more widely available
  • “Watson Light” — Restricted vocabularies and data sources, possibly sold through IBM partners
  • Bundling of certain Watson components with existing, more affordable IBM products
  • A la carte offerings, such as the CRM-integrated “Next Best Action” recommender systems envisioned by Forrester’s James Kobielus
  • Industry-specific offerings in which the raw Watson capabilities are harnessed behind the scenes by IBM specialists

The challenge of providing robust hardware and software capabilities to collect, host and access large scale data warehouses using Watson-like technologies is not a near term possibility for smaller enterprises. It should be remembered that existing natural language technologies, such as the highly effective speech recognition technology Microsoft seamlessly integrated into Vista and Windows 7,  have not been widely adopted, even though for many types of human-computer interactions, it is an efficient and easy to use technology. Other obstacles await earlier adopters: problems of data quality, provenance, standardization, consensus building for metadata, and dealing with special scalability problems such as DR and privacy concerns. Early adopters may rely on third party specialists to pull many of the levers.

Nevertheless, some steps can be taken by SMB’s to lay a foundation for the Watson Era.
  • Identify the most high-payoff opportunities, then refine enterprise-specific use cases to match
  • Develop canonical, standardized systems for metadata and taxonomies
  • Leverage existing standards while monitoring current work on evolving standards
  • Develop small, prototype projects using current technologies to assess where payoffs are likely to be for your organization (e.g., low cost experiments with Hadoop or similar technologies)
  • Include nontraditional sources, such as email, web traffic, internal and external documents and project management artifacts
  • Begin to address data quality and provenance by improving internal processes and assigning metrics (even if initially manual)
  • Plan for scaling out warehouses several orders of magnitude beyond current forecasts
  • Collaborate with other groups, especially within industry-specific subcommunities
  • Be on the lookout for template-based “blueprints” that work for industry-specific needs (e.g., subscription-based businesses with periodic renewals, or importers whose margins depend greatly upon shipping costs, etc.)
  • Through internal education, networking, consultants and recruitment, improve staff capabilities and awareness

Watson technologies are a force to be reckoned with. Just when they will make themselves felt in the marketplace is still guesswork, but savvy early adopters will likely seize opportunities that won’t be so easy to pluck later in the adoption curve.