31 December 2010
Using the term computationalism from Jaron Lanier’s 2010 book fits well enough that it merits elaboration before continuing to the main topic.
Many other companies utilized the same core algorithm and data structure as was likely originally used by Google– not for ranking pages but tracking phrases for purposes of rendering suitable results for search terms.
This algorithm may be found in Chapter 8 of Paul Graham’s ANSI Common Lisp book. It’s a simple chain of word neighbors with additional meta-data to suit your particular needs. For tracking distance between words found on web pages, the data structure may be used as-is.
For extracting common phrases within a single document, the number of occurrences for each word pair must also be tracked.
Techniques for enriching results include use of synonyms and sentence diagramming, such as the work familiar to primary school students.
The approach is simple enough. The key to success, however, is volume of content used for comparison.
This is significant because it’s important to realize that Google’s core algorithms aren’t the core to their business. This is usually the case with software as service models.
Their basic algorithms for searching and sorting results produce rich enough results largely due to the volume of content they have in their pool. Therefore, the true complexity lies in their ancillary components like Big Table or leveraging that other Lisp notion: map/reduce.
So while one may respect the magnitude of their true accomplishments, results from their search engine are outside of that.
Google’s search results is a paper tiger; however, the puppeteer has teeth and claws to contend with.
Consider a simple experiment to illustrate the point, which artist Herbert Hoover of Harlem has demonstrated in the early 1990’s. Take video with fast cuts, such as a scene change approximately once per second. This could be simply using an automated slide-show. Next, add fast music such as 140 BPM techno, 120 BPM disco or sufficiently complex salsa with rhythm layers beyond the usual structure. (Note: It’s important to compile the video before adding the music, despite the normal practice of video editors to begin with an audio click track for distinct purposes of timing video to audio, but we’re intent on the opposite effect here.)
You’ll observe that for any slice of approximately 10 seconds, the music will appear to be synchronized to the music. Longer sampling will reveal a mismatch, but few will notice unless somebody points it out.
Key to understanding the effect is that a human (i.e., you) is in the loop. Most people today have relatively short attention spans, with respect to tracking all variables objectively such as exact duration of each video clip or shifts in melody or rhythm of the audio throughout the track. You’re likely too involved with the music or images to afford yourself that level of objectivity unless trained in video editing, audio mixing, etc.
This same effect becomes a factor when viewing internet search results. With a large enough sample of documents, relevance is almost trivial to produce when using the structure and code presented in one and a half pages of Graham’s book.
This becomes obnoxiously apparent when using that same algorithm with sufficiently small data sets.
The larger issue is not the technical challenge but outmaneuvering those who seek to game the system. (This is aimed at you, SEO consultants!)
That is, all other factors being equal, internet search engines would be straight-forward enough. (Note: This of course begs the question about what the heck Alta-Vista was thinking with their lame search results in the 1990’s. Back then, it was far more productive to take an educated guess at a possible domain name and simply enter that into your browser than wasting time with their results. That approach served me well until a friend directed me to google.com in 1998 or early 1999. Thanks again, Duncan!)
Caveat emptor: whenever you see a phrase like, “all other things being equal,” consider this a red flag that completely discounts human potential to abuse or otherwise exploit the system.
Fending-off those who see everything as a loophole to be leveraged is part of the ugliness required for running a search engine today.
And this brings us to the point.
Computation only goes so far. It’s the people in the mix.
It’s the people who make the content that appears in search results. It’s people who communicate using the social network. It’s the people who comment on a forum.
A company like Google would like you to believe that their algorithmic approach is superior to, say, Yahoo’s hierarchical catalog organized by human classification schemes.
The answer is that it depends on what you need at that particular moment, for that particular inquiry.
For some things, the algorithm may produce the best choice for you. Other times, the catalog will. It doesn’t have to be either-or, all-or-nothing for ever more.
Even “black or white” arguments tend to be just shades of gray in the end. (Note: Stating the obvious– the human eye cannot perceive absolute black so we will point to various dark gray colors believing each to be “black,” and anyone who didn’t sleep through middle school science class knows that white is actually composed of a rainbow of colors.)
The conclusion here is that 1) fear no paper tigers if you have a better approach, which doesn’t have to even change the fundamental algorithm; 2) understand that building a better mousetrap isn’t always about the trap/algorithm; 3) people will always be a wild-card, either as your customer, partner or competition.
On the first point, you might simply have a more cost-effective way of doing the same thing than someone else. Others state that you merely have to “suck less” than the current champion, but that’s a depressing perspective that only contributes to the sleaze factor when we’re capable of so much more.
Some prefer the remix approach, which is an idea often lost amidst the “mash-up” noise. In an earlier time, results of the new mix or synthesis would lead to a paradigm shift. This is just another way to say: change the rules.
The one who eventually overtakes Google in search won’t be merely another search engine.
Instead, search results will become part of something else. Not “something else” in the sense of offering different ad types alongside results but perhaps results within hierarchies of context. Of course, this is already underway by various companies…
For clues on how to get there, look at things differently.
Apply objectivity, such as through skills of the video editor who would see through the accidental timing of a dance track dumped on random clips of our exercise above.
The real guidance here: if a programmer, try ballroom dancing. If a dancer, try astronomy. If an astronomer, write poetry. If a writer, take up ice skating. The counterpoint to what you ordinarily do becomes priceless by offering another point of view into your daily routine.