The Three Pillars of Social Reader Relevancy (I)

In Web Search, the ranking of results is primarily determined by their freshness, relevancy (in regard to a search query) and content quality. Freshness is indisputable and needs little explanation, relevancy is an approximation of how much data a web site contains that have something to do with the user’s query and content quality is an indication of how “good” the site’s information is, given factors like PageRank, spam scores and so on.

Once crossed the line into the mobile word, however, these three factors lose their usefulness drastically. Text input on mobile devices is largely impractical and traditional web pages don’t render well, so media discovery and consumption on mobile devices is generally inferior compared to the same experience from printed mediums like newspapers and magazines. While big name players attempt to tackle the issue simply by snapping on extra features (Google Mobile Voice Search, Google Instant Preview for Mobile.etc.), the underlying problems remain resolved as its ranking algorithm is the same as its desktop counterpart. Flipboard is so great because, in my opinion, it has found and defined the new three pillars of relevancy for mobile content consumption and they are freshness, social, and readability* – and they work wonderfully.

With this in mind, I put some work to the server components of Cassius. From a simple script that turns a Tweet into a JSON feed, the pipeline now includes saving documents into a transitional store (MongoDB) and a series of quality measurement calculations. While the extra processing means we won’t be able to serve the feed in realtime, the cost should be worthwhile and I hope the results justify that.

How well does your article read?

In Zite or Flipboard, it’s not uncommon to run into articles with summary texts that resemble gibberish (see below). The issue is often a result of incorrect identification of raw HTML elements as meaningful content, and is very hard to avoid. I have seen attempts to solve this problem using NLP and machine learning classification methods, to varying degrees of success. Since those are beyond my capabilities, I opted to use some traditional methods to measure the quality of a piece of writing – by taking its readability metrics. From Wikipedia, readability evaluation refers to “the ease in which text can be read and understood“, and “…various factors to measure readability have been used, such as speed of perception, perceptibility at a distance, perceptibility in peripheral vision, visibility, the reflex blink technique, rate of work, eye movements, and fatigue in reading…“.  Readability metrics measurement tools are widely available, and embedded in word processors and email clients.

source: corporategeek.info

results of bad scraping

In a nutshell, the tools apply different statistical formulas on a piece of English writing, and the resulting scores form an impression of its understandability. The formulas typically break text into syntactic components such as words and sentences and count their distribution or frequency in relation to the text being analyzed. The most common readability formulas and descriptions are given below:

I found it more pleasing to read blog posts and articles on Flipboard/Zite that are about a page in length. Contents that span multiple pages are too demanding for casual reads, while short tweets or one liners aren’t worth the two clicks effort to expand and shrink them from the page (yes really). For simplicity, let’s take my reading habits as standard, and use the following thresholds for computation:

  • Flesch50 (Times magazine has a score of about 52)
  • Flesch-Kincaid13 (pre-college level)
  • Gunning Fog12 (texts for wide audience have fog index of less than 12)
  • SMOG13 (pre-college level)
  • Coleman Liau 13 (pre-college level)
  • ARI13 (pre-college level)
The gist here takes the readability scores, evaluate their distances from the threshold and combine the scores as a mean. Very straightforward.

In the next post we’ll continue to explore the three pillars, and look at some test results to see whether the additional aspect of readability would help us create a feed that is better optimized for the user’s final reading experience.

Share
Read More

Knitting a page

When set out to build the prototype, there’re many things in the design I considered fundamental, chief among them being a template system flexible enough so that no re-installs or updates are necessary if a new page layout combination is desired.

References on the topic is plentiful, but surprisingly the most useful one I came across was a paper published in 1977 titled “Computer Assisted Layout of Newspapers” by the MIT. You can find the full 184 pages here. The paper is a gem to read and goes into detail on even how ads and pictures layouts could be automatically assigned to a theoretical newspaper page. I shall definitely return to it for more inspiration, but so far I have based the design of the prototype on Chapter 6, A Symbolic Graphics Language For News Layout.

The diagram below lifted from P.84 of the paper tells it all. Pages on Flipboard largely employ a rows/columns layout combination, and the powerful template language described in the paper should be able to cover all variations effortlessly .

a simple yet powerful layout language

Note that I cheated a little and defined my version of the template language in JSON, mainly for easier parsing in Objective-C.

Therefore,
P1 || (S1 = S2) is represented with {"columns": [{"type":"P1"}, {"rows":[{"type":"S1"}, {"type":"S2"}]}]} in my app,

and

S3 || (S4=(S5 || S6 || S7)) becomes {"columns":[{"type":"S3"}, { "rows": [{"type":"S3"}, {"columns":[{"type":"S5"},{"type":"S5"}, {"type":"S7"}]}] }]}.

With a structure like this, we could simply parse the JSON into multi-dimensional arrays (e.g. {“P1″, “{S1, S2}”}), then write classes to traverse the array and return suitable UIViews or collections of UIViews. Only two-level nesting is supported in the code right now.

The UIView generation process itself is just as crude at present. While looping through the array, the type of value stored is examined, and if it’s a definition like “P1″ or “TIA”, a helper class would create the corresponding UIView, with arguments being the article itself and attributes like the size of the array passed in for presentation purposes. All these take place in the PageLayoutManager class. A whole lot more work will be put in around these classes.

I’m hoping that more help from the server-side will be used for both the templates definition and articles selection process. Analysis on word count, images in the article, source authority, social signals and other relevancy factors should already been taken into account by the time these articles and templates arrive at the client app.

Finally, here’s the template used for generating the pages shown in the first video. There are four pages altogether, with pages 1 and 4 being row-based and pages 2 and 3 column-based. These layout designs are quite similar to the ones used heavily on tweets-display pages on Flipboard.

{"pages":[
{"rows":[{"type":"TIA"},{"columns":[{"type":"TIA"},{"type":"TIA"}, {"type":"TIA"}]}]}, {"columns":[{"type":"TIA"},{"rows":[{"type":"TIA"},{"type":"TIA"}]}]},
{"columns":[{"type":"TIA"}]},
{"rows":[{"type":"TIA"}, {"type":"TIA"}]}
]}




The rendering:

page 1 - row 1 is article, row 2 three columns of articles.
page 2 - column 1 is article, column 2 is 2 rows of articles.
page 3 - 1 column, 1 article.
page 4 - 2 rows of articles

Remote or Local?

A colleague pointed out the template definitions must be defined and stored on the client app locally, as the app shouldn’t need to fetch a new template from the server when the device changes orientation. I haven’t thought about that yet. To me, it makes more sense to have the server picking templates that are more suited to the content being served. I’m totally not thinking about how to deal with landscape orientations yet.

Extended Reading:

Cassius is on github

Finally got round to putting together a decent enough client! Although the code is pretty rough now and the app would crash after a while, but at least it’s a start innit?!

Repos:

The article pages were generated by a custom template that’s defined in JSON, and the image on the cover page is grabbed from Instagram’s API:

The iconic page flip effects were lifted straight out from AFKPageFlipper. Thanks again Marco!

Share
Read More
Get Adobe Flash playerPlugin by wpburn.com wordpress themes