Algorithm Arrogance at Facebook

Pope Paul V - wikipedia, portrait by Caravaggio | to a Marketplace report on the most recent content stream tweak by Facebook:

It’s algorithm arrogance. There are many data science specialists working at Facebook, but there is reason to believe the new stream tweaks will not improve appreciably. One reason: users have no way to designate content you *do not* want to see (perhaps ever). Another: Facebook search is so unfriendly that search is rarely used to discover what you *do* want to read. (It’s part of the ever-popular toilet paper roll user interface). In other words, there’s plenty of data but not enough of the right sort to improve personalized relevance. Sure, not everyone would use a recommendation / search facility, but for those who do, the results would improve. The data “science” folks have become so algorithm-arrogant that you’d be hard pressed to even find a resource to personalize and improve your feed — with more data.

Chasing Big Data Variety: Predictive Analytics, Meet Your Market Foe


Linkedin Stock Price Graph - Yahoo Finance via Google Search 20150430 (screenshot)

The graphic shows the market behavior of LinkedIn’s stock price late afternoon of 2015-04-30. Did your analytics engine (What’s an analytics engine? See International Institute for Analytics) predict this? If not, what (big?) data were you missing?

If not, chances are, yours was a Big Data Variety problem. Correlating with, for example, only Facebook, Pinterest and other social media platforms may have been a tipoff, but not enough to forecast a 25% single day plunge.

And before you reach for the “Sell” button, you might want to revisit this two-year-old story on Forbes, when the stock price also fell. Did your analytics take that into account? The loss was less dramatic, but the cause was similar.

You may need data from other sources, and more than just sniffing URLs from corporate PR departments a la Selerity. Perhaps your forecasting engine treated that as just a day’s or a quarter’s data point, without consideration of the underlying cause. A mix of complex event processing combined with other types of machine intelligence might have had better results.

Celebrity’s Anonymous Pen Name ‘Outted’ by Software

JGAAP (Java Graphical Authorship Attribution Program)

JGAAP (Java Graphical Authorship Attribution Program)

The role that software plays in stylistic analysis of text is perhaps less surprising to high school and college students than to the general public. The former must submit essays they write to style analysis performed by software which looks for plagiarism and sometimes also makes quality assessments.

In the recent outing of J.K. Rowling as the writer behind the pen name Robert Galbraith, it was mentioned that software had been used to analyze the text of the Galbraith novel.  There exists a family of software used by academics for “authorship attribution,” e.g., to discover, for example, whether a recently discovered manuscript was a missing chapter of Don Quijote (a fabricated example). One of these applications is JGAAP, for Java Graphical Authorship Attribution Program. The JGAAP wiki page explains the project as

. . . Java-based, modular, program for textual analysis, text categorization, and authorship attribution i.e. stylometry / textometry. JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques’ performance on text analysis quickly and easily. JGAAP is developed by the Evaluating Variation in Language Laboratory (EVL Lab) and released under the AGPLv3.

How this was accomplished was explained by one of two academic investigators credited with the analysis (along with some suspicions by reports at the Sunday Times) at . Patrick Juola, in the blog Language Log. Juola refers to this subdiscipline as “forensic stylography.”

A one-paragraph extract from Juola’s blog post follows. Note that, in the usual sense of the word, the analysis doesn’t look directly at “meaning.”

The heart of this analysis, of course, is in the details of the word “compared.” Compared what, specifically, and how, specifically. I actually ran four separate types of analyses focusing on four different linguistic variables. While anything can in theory be an informative variable, my work focuses on variables that are easy to compute and that generate a lot of data from a given passage of language. One variable that I used, for example, is the distribution of word lengths. Each novel has a lot of words, each word has a length, and so one can get a robust vector of <X>% of the words in this document have exactly <Y> letters. Using a distance formula (for the mathematically minded, I used the normalized cosine distance formula instead of the more traditional Euclidean distance you remember from high school), I was able to get a measurement of similarity, with 0.0 being identity and progressively higher numbers being greater dissimilarity.