Celebrity’s Anonymous Pen Name ‘Outted’ by Software

JGAAP (Java Graphical Authorship Attribution Program)

JGAAP (Java Graphical Authorship Attribution Program)

The role that software plays in stylistic analysis of text is perhaps less surprising to high school and college students than to the general public. The former must submit essays they write to style analysis performed by software which looks for plagiarism and sometimes also makes quality assessments.

In the recent outing of J.K. Rowling as the writer behind the pen name Robert Galbraith, it was mentioned that software had been used to analyze the text of the Galbraith novel.  There exists a family of software used by academics for “authorship attribution,” e.g., to discover, for example, whether a recently discovered manuscript was a missing chapter of Don Quijote (a fabricated example). One of these applications is JGAAP, for Java Graphical Authorship Attribution Program. The JGAAP wiki page explains the project as

. . . Java-based, modular, program for textual analysis, text categorization, and authorship attribution i.e. stylometry / textometry. JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques’ performance on text analysis quickly and easily. JGAAP is developed by the Evaluating Variation in Language Laboratory (EVL Lab) and released under the AGPLv3.

How this was accomplished was explained by one of two academic investigators credited with the analysis (along with some suspicions by reports at the Sunday Times) at . Patrick Juola, in the blog Language Log. Juola refers to this subdiscipline as “forensic stylography.”

A one-paragraph extract from Juola’s blog post follows. Note that, in the usual sense of the word, the analysis doesn’t look directly at “meaning.”

The heart of this analysis, of course, is in the details of the word “compared.” Compared what, specifically, and how, specifically. I actually ran four separate types of analyses focusing on four different linguistic variables. While anything can in theory be an informative variable, my work focuses on variables that are easy to compute and that generate a lot of data from a given passage of language. One variable that I used, for example, is the distribution of word lengths. Each novel has a lot of words, each word has a length, and so one can get a robust vector of <X>% of the words in this document have exactly <Y> letters. Using a distance formula (for the mathematically minded, I used the normalized cosine distance formula instead of the more traditional Euclidean distance you remember from high school), I was able to get a measurement of similarity, with 0.0 being identity and progressively higher numbers being greater dissimilarity.

 

Cool Socnet Visualization from MIT’s Immersion Project

A previous post considered some practical implications for privacy and government surveillance stemming from the Snowden revelations about the Prism program. The point was made that some people who think they have nothing to hide could easily become ensnared in webs not of their own making, and could find it difficult to untangle themselves.

Interest in metadata patterns in social networks is not limited to the NSA. Prism is one of a number academic, Homeland Security and Department of Defense programs that have studied how to make sense of social communication patterns to identify and track suspects. One of these is MIT’s Immersion project.

Following a tip from Slashdot,  the Immersion project was given the keys to the author’s hyperactive Gmail account (~ inbox = 169,000, 120 filters, 250 labels).  Immersion analyzes a Gmail account without directly accessing one’s Gmail password.

The attached images were produced by Immersion after analyzing 277,843 emails.  As the MIT project team explains,

Once you log in, Immersion will use only the From, To, Cc and Timestamp fields of the emails in the account you are signing in with. It will not access the subject or the body content of any of your emails.

The point? As Slashdot’s “Judgecorp” points out, Immersion gives even a casual observer a sense for what the NSA Prism initiative could do with metadata.

Immersion can also objectively respond to your Mother’s “Why don’t you ever write?” complaint. When used to analyze a single contact, Immersion produces a graph of interactions by year. Also depicted in the screenshots is a plot of the interactions by year.

Yes, writing my sister more often would be a good idea.

As often highlighted at GlitchReporter.com, things in information technology can sometimes go wrong. Spam, misaddressed email, malware or sheer coincidence could put your name on the receiving end of an arrow in an Immersion diagram.

First posted at Port Wash Patch

 

 

Nothing to Hide? Or Afraid of a ‘Metadata Sweep’

FBI TSC Watch List flowchart

FBI Terrorist Watch List Flowchart

This post first appeared on the Port Washington Patch.

In a recent discussion of the Edward Snowden Affair with family members,  two basic attitudes toward the government’s selective spying on U.S. citizens emerged:

The Innocence Argument “I have nothing to hide, so I don’t care what the federal government wants to know about me.”

The Privacy Argument “The government should keep out of my personal life.”

The Fallibility Argument “The federal government’s systems can’t (yet) be trusted to avoid false positives and expeditiously remediate errors.”

Blogger Jeff Jonas noted that:

The underlying problem is that the information on these watch lists typically have low fidelity (i.e., limited data points like only name and date of birth).  If you want to see an example of a government watch list check out the Office of Foreign Asset Control’s Specially Designated Nationals Watch List.  You will find this frequently contains only a name, date of birth and place of birth.

Consider the case of Sean Kelly, who somehow found himself on the TSA watch list a few years ago. The TSA has since rolled out “Secure Flight,” but even a cursory glance at the system’s complexity and scale —  2 million passengers daily moving through 450 ports across the U.S. – instills a healthy skepticism that false positives can be avoided.

As the public debate over Snowden and PRISM rages on, consider the ways in which a citizen’s name could appear in a possible watch list data set:

  • A friend’s email list was corrupted by a spambot and you were sent an email from a person on the watch list
  • Your name was adjacent to a person on the watch list and a DHS analyst accidentally selected your record
  • Your name was misspelled in the government records• You used to live at an address once occupied by a person on the list• You have a common name
  • The software performing compiling the lists and/or extracting candidate metadata contains undetected bugs that have compromised data integrity (See GlitchReporter.com for examples)
  • A disgruntled insider within the government could scramble the underlying data, a problem which could remain undetected for months or even years
  • Recourse software, designed to give citizens an opportunity to appeal false positive classifications when disclosed, is inadequately tested
  • Across-the-board government cutbacks have affected program staffing understaffed and software supporting citizen recourse systems are no longer well maintained

Recently NPR’s This American Life chronicled the sequence of bureaucratic bumbling, auto-responders and inadequate supervision and training that apparently led to the beheading of an Iraqi national who had worked for a U.S. contractor.

Imagine that your name or account number appeared on a search of the metadata collected as part of the Prism program. Assuming you had recourse, consider the sort of correspondence needed to extricate yourself from the web of trouble in which you find yourself entangled. It is all too easy to imagine receiving messages from government agencies worded thusly:

Kindly be informed that we checked your case and found that it is in processing pending verifying your employment documents. Once it is completed we will move forward with your case. Your patience does assist us in accelerating the process.

The Orwellian message was repeated often, even after the Iraqi national had reportedly provided the verifications requested.

The Fallibility Argument isn’t a Paranoia Argument. It merely recognizes the limitations of systems created on this scale and run by a very large organization with unclear oversight. It can be assumed that some of the deficiencies have been corrected, but the Department of Justice Inspector General report on issues at the FBI’s Terrorist Watch Center is worth reading. After all, recent revelations about Prism indicate that there are “117,675 active surveillance targets.”

If a two year old toddler could end up on a list, it’s conceivable that the FBI’s data is telling them that one of those targets knows you.

Bush the Elder’s “Vision Thing”

slider-image-2
A colleague suggested a TED talk by Simon Simek on “leadership.”  Can any talk or book about “leadership” be credible?
I am suspicious of someone who casually proposes that humans are motivated “by biology not psychology.” As if these could be cleanly partitioned off from one another.

I can perhaps overlook that oversimplification.

But most organizations “believe” many things. Concurrence of employee/vendor teams, if it could be measured, would surely cut across many beliefs and ideas. It would be difficult to prove that what motivates people is a directly causative to success of a given enterprise. Being motivated can lead to good as well as bad results. There are good and bad, successful and unsuccessful visions that can be communicated (or mis-communicated) to prospective cult members. Many a startup with great vision, collective commitment, and focus on “why,” not just “what” — will fail to make the cut.

Inspirational, powerful rhetoric is great (and its absence is painful), but show me what Simek in his talk disparagingly refers to as “the 12 point plan,” too. A core principle in understanding how people operate, I believe, is the notion that knowledge, and the pursuit of it in an enterprise, is intersubjective. That means, at some level, distrusting not only the expressed beliefs of others, but one’s own instincts to believe.

Maybe Simek it simply reiterating what Bush the Elder was said to have commented about “the vision thing.” Give it its due, but no more.

Recruiting #fail: On Recruiting for Proficiency

slider-image-1

What follows is a position description received this month from a firm  — not a recruiter.

Required Technical Skills:

  • Proficiency in all MS Office applications including MS Project
  • Front end development (HTML, Flash, Ajax, Javascript – templates)
  • Back end development (XML, HTTPS, Web Services, Web dav, data mapping)
  • Experience with implementing and managing Demand Ware solutions a plus, Demandware Business Manager, DemandWare UX studio (Eclipse based development environment), DemandWare control center
  • Clear understanding of web technologies like Java, DotNet, PHP, Ruby, SQL, MYSQL, MSSQL, HTML 5, Javascript, IIS, Apache, Performance fine tuning techniques, Flash, AJAX, Mobile platform, CRM, Web services, XML
  • Understanding of Informatica, SAP, Biztalk is a plus

A piece of work, but not about getting work done.

Will $100M Trickle Watson Down to SMB Enterprises?

IBM Watson

Bloomberg News reported that IBM plans to invest an additional $100 million in its Watson technology. Earlier in 2011, Watson exceeded previously unmet expectations for artificial intelligence by easily overwhelming two Jeopardy!champions on national TV. While Watson-like technologies could be used in a variety of settings (e.g., network management or health care), the steep investments IBM has already made suggest that global services giant has its eye on a revenue stream whose major tributaries are large enterprises: Proctor and Gamble, Pfizer, ExxonMobil, JPMorgan Chase.

ArnoldIT’s April Holmes put it this way:

IBM has a Tundra truck stuffed with business intelligence, statistics, and analytics tools [SPSS, InfoSphere Streams and Cognos come to mind – ed.] IBM has no product. IBM . . . has an opportunity to charge big bucks to assemble these components into a system that makes customers wheeze, “No one ever got fired for buying IBM.”

Promising but out of reach? Few have been fired for asking, “Can we afford IBM?” In a recentTechnology Review interview, IBM Analytics head Chid Apte admitted that “This technology will form the basis of a new product we will in the future be able to offer all of IBM’s big customers.”

The reasons for the anticipated cost are readily apparent. It has been widely reported that Watson took four years to build, runs on around 2,800 Power7 processor cores, has 15 terabytes of main memory, can operate at 80 teraflops (80 trillion operations per second), and employs IBM’s SONAS file system with a capacity of 21 terabytes. Watson software components included some familiar open source technologies IBM had already adopted elsewhere, such as Eclipse and Apache Hadoop, but new ground was broken in creating a natural language understanding system tailored to perform in the Jeopardy! question and answer format. The cost for that capability alone was considerable.

IBM believes this revenue stream will be substantial. According to the Bloomberg article, IBM projects $16B from “business analytics and optimization.” This estimate is probably not unfounded. A 2011 IBM-sponsored study of 3,000 CIO’s reportedly found that 4 out of 5 executives indicated that applying analytics to IT operations was part of their “strategic growth plans.”

But what are the prospects for small and medium sized enterprises (SMB’s)? Large data warehouses are not only associated with large enterprises. Small firms – even a one-person consultancy — can easily amass huge quantities of data, and may be even more highly motivated to make sense of that data. However, they are unlikely to have Watson-scale budgets.

Still, there are a few possible scenarios in which Watson technology could reach SMB’s:

  • Cloud-based Watson resources, with cost reductions made possible by scale (a la Google search), could become more widely available
  • “Watson Light” — Restricted vocabularies and data sources, possibly sold through IBM partners
  • Bundling of certain Watson components with existing, more affordable IBM products
  • A la carte offerings, such as the CRM-integrated “Next Best Action” recommender systems envisioned by Forrester’s James Kobielus
  • Industry-specific offerings in which the raw Watson capabilities are harnessed behind the scenes by IBM specialists

The challenge of providing robust hardware and software capabilities to collect, host and access large scale data warehouses using Watson-like technologies is not a near term possibility for smaller enterprises. It should be remembered that existing natural language technologies, such as the highly effective speech recognition technology Microsoft seamlessly integrated into Vista and Windows 7,  have not been widely adopted, even though for many types of human-computer interactions, it is an efficient and easy to use technology. Other obstacles await earlier adopters: problems of data quality, provenance, standardization, consensus building for metadata, and dealing with special scalability problems such as DR and privacy concerns. Early adopters may rely on third party specialists to pull many of the levers.

Nevertheless, some steps can be taken by SMB’s to lay a foundation for the Watson Era.
  • Identify the most high-payoff opportunities, then refine enterprise-specific use cases to match
  • Develop canonical, standardized systems for metadata and taxonomies
  • Leverage existing standards while monitoring current work on evolving standards
  • Develop small, prototype projects using current technologies to assess where payoffs are likely to be for your organization (e.g., low cost experiments with Hadoop or similar technologies)
  • Include nontraditional sources, such as email, web traffic, internal and external documents and project management artifacts
  • Begin to address data quality and provenance by improving internal processes and assigning metrics (even if initially manual)
  • Plan for scaling out warehouses several orders of magnitude beyond current forecasts
  • Collaborate with other groups, especially within industry-specific subcommunities
  • Be on the lookout for template-based “blueprints” that work for industry-specific needs (e.g., subscription-based businesses with periodic renewals, or importers whose margins depend greatly upon shipping costs, etc.)
  • Through internal education, networking, consultants and recruitment, improve staff capabilities and awareness

Watson technologies are a force to be reckoned with. Just when they will make themselves felt in the marketplace is still guesswork, but savvy early adopters will likely seize opportunities that won’t be so easy to pluck later in the adoption curve.

Use (Corporate Knowledge) or Lose It

Danger Sidekick (credit Wikipedia Commons)

Danger Sidekick (credit Wikipedia Commons)

When a firm decides to shutter operations, the loss of knowledge capital in the form of talent should appear somewhere in the risk assessment. While significant short term savings may be achieved by closing a division (in the case of Microsoft, perhaps to save $$$ to purchase Skype?), one side effect can be a brain drain to bonanza to well-heeled competitors. A report from CNN Money today identifies several members of the original Danger (Sidekick) team who are now working at Google’s new innovation wing, “Android Hardware”:

Hershenson and Brit were part of the trio that founded Danger in 2000. The third partner: Android chief Andy Rubin. The three engineers launched pioneering consumer smartphones, like the once-ubiquitous-among-celebrities T-Mobile Sidekick in 2000.

Now all three are working for Google, perhaps with added incentive.

Following was my post to David Pogue’s NY Times story announcing the closing the Cisco’s Flip operation.

DP, you’ve got this mostly right, though I think there is a more disturbing back story that goes beyond this one. It’s the life cycle of smaller to medium sized technology firms whose founders and investors cash out by selling to a major (usually public) company. Another example that comes to mind is Microsoft’s killing off the Sidekick, another neat device paired with an even better cloud service to back it up. What’s gone is more than the idea — seen in its pre-acquisition form, these firms are living, breathing entities, with expert sales and marketing groups, engineers, an idea-makers. Listen up, politicians: THIS is the real “job growth,” not stringing fiber into empty office suites and hosting MS Office training classes for the unemployed. Killing off firms like Danger and Pure Digital aborts the creative offspring that their collective intelligence could manifest. A few among them will have cashed out, but most of those 550 workers will be consigned to endure a personal version of the Flip tragedy. Writ large, it’s the U.S. version of capitalism shooting itself in the foot just when job growth is needed most. Markets dump capital mainly into mega-firms like Cisco, whose far-flung, unwieldy enterprises are far less efficient at converting that cash into good ideas and jobs” (April 14, 2011).