How Things Work: Fifty Shades of a Search Engine

Steve Zheng, Contributor
Anton McGonnell, Technology Editor

What is a shared attribute among the following items? The digital archive of the Baker Library, the endless emails from LinkedIn suggesting new connections, the “you may also like” section on Amazon, and the top three most-used mobile apps on your smartphone. One shared attribute may be that they are all powered by information retrieval systems, or search engines. Indeed, wherever there is information need, there is most likely a special-purpose search engine running in the background. And we haven’t even mentioned the company that wants to “organize the world’s information” while promising to “do no evil.”

In this How Things Work Harbus column, we shall together disentangle the mystical inner-workings of a search engine from the perspectives of TOM and LCA. Search Engines deliver computationally generated content, but a human comprehension of meaning is required to select appropriate information from the syntactically generated options. Therefore, we shall survey two main technical components of a modern information retrieval architecture: Intent Extraction and Semantic Ranking.  We will then pull a mini “LCA in the news” by investigating how each technical component can potentially either create or destroy values to its users … and to the society.

Part 1: Intent Extraction

What is the underlying intent for the search term [snowden]? How about [truump universiity]? Extracting user intents from raw search terms has been a red hot academic research topic in the realms of natural language processing and general artificial intelligence in the past decade. Business applications of intent extraction include auto-correction (rewriting [truump universiity] into [trump university]), auto-completion (rewriting [harvard business] automatically into [harvard business school] or [harvard business review]), part-of-speech tagging (identifying the nouns and verbs in [how long does an hbs exam normally take?]), and entity extraction (inferring that the “paris” in [paris net worth] might actually mean Paris Hilton).

A reliable intent extraction algorithm gives “emotional intelligence” to a search engine: its users will find the search engine empathetic, efficient, and personal. In an ideal world, the user no longer needs to worry about spelling mistakes, murky memories (e.g., vaguely recalling only fragments of lyrics when search for a song), or getting too lazy to type in the full-length search phrase.

In reality, however, intent extraction is an area troubled by technological and ethical challenges. The first challenge is ambiguity. When typing [fb], was the user searching for the home page of the social media website or the FB stock? When typing [snowden], was the user searching for the Snowden music band or the political dissident? Another challenge is aggressive query alterations for unverified topics. Ever since the 2017 presidential election, [the pope] is auto-completed into [the pope endorses donald trump] in Google Search and Microsoft Bing.

The punchline: intent extraction is no longer (and may have never been) a pure machine learning research topic in the ivory tower. Since intent extraction algorithms are “tuned” with massive user behavior data, search engines deduce user intents (especially for the more ambiguous search terms) that maximizes the likelihood of satisfying the information need of the aggregate user base. In other words, intent extraction models are dictated by the “popular votes” from the users, exhibiting a form of “data Darwinism”. As is the case for all forms of Darwinism, the existence of a moral compass is rarely a central theme when it comes to evolution. The evolution of intent extraction algorithms is no exception.

Part 2: Semantic Ranking

Semantic ranking is what bridges user intent with the sea of contents in the physical and digital worlds. For Google and Bing, the contents are crawled from across the internet on a global scale. For Yelp search, the contents are user-generated local business information and reviews. For Uber, the “content” is the live update of vehicle details given a user’s travel need and contextual information.

The success metrics for semantic ranking are deadly simple, at least on the surface. Using internet search as an example, ranking effectiveness is measured by the level of semantic proximity of each web document to the original user intent, with higher weights assigned to the top ranked documents. Given the search phrase [harvard business school], a good semantic ranking algorithm will likely rank the official HBS website on top, HBS Admissions at the second place, and perhaps the HBS Wikipedia page at the third place. A semantic ranking algorithm that ranks on top an Instagram photo of Section X’ers on a private jet is probably not as effective in addressing the information need of a typical internet user.

Behind the simple success metric above looms a tremendous challenge: it is extremely difficult to measure the semantic proximity of an arbitrary web page to a user intent in a scalable way. Reading the web page by naked eyes alone will take minutes, while commercial search engines are fielding thousands of incoming search queries per millisecond. To bypass the curse of scalability, the Google’s, Bing’s, and Baidu’s took a leap of faith. Instead of directly measuring semantic relevance, they measured the “clickability” of a web page and use that as a corollary indicator. If a user clicked through a web page and stayed on that page for more than, say, 20 seconds, that web page was implicitly meeting the intent of the user.

How good is such click-based semantic ranking paradigm? On one hand, it is insanely scalable and generalizable: web pages clicked through by the last 100 users should indeed be more likely to meet the information need of the 101st user issuing the same search term. The end result is a small set of highly click-worthy content ranked in top positions, with a long tail of web document that may never be discovered by the user. On the other hand, a similar form of “data Darwinism” kicks in, incentivizing content creators to pursue clickability as the sole truth north. Consequently, the race to ultimate virality becomes a one-way street. Higher content click-through rates -> higher digital advertising revenue for search engine business -> employees get compensated handsomely by “successfully” addressing the search intent of the users -> even less incentive to incorporate non-click signals such as content authority and credibility into the ranking algorithm. As the world hails the arrival of the “semantic web” (a web that uses big data, metadata, and other descriptors to replicate more accurately human meaning), will click worthiness eventually be trumped (no pun intended) by the intrinsic value of contents?

What information theory does not tell us is that the human brain, on sight of a written word, does not process it in terms of its individual letters, but as a singular pattern within itself. We associate meaning with these words and create links with other words, not based on similarity of pattern, but based on similarity of meaning. This is something that would be very difficult computationally and which is an advantage that human description has over automatic indexing.

But what are the limitations of search engines capacity to predict and determine human meaning? In 2008, Google co-founder Larry Page told us, “The ultimate search engine would basically understand everything in the world and it would always give you the right thing. And we’re a long, long way from that.” Whilst we are much closer nine years later, search engines intuitively understanding exactly what we are looking for seems a long way away, and will undoubtedly transcend text-based search terms, which are still the prevalent tool for intent extraction. For those of us joining or rejoining the tech world prior to HBS, this evolution in search engines will undoubtedly be a problem worth tackling.

Search engine is the killer application of the internet era, and it has unlocked significant customer and societal values by enabling an efficient and personal way of content discovery and consumption. However, as Professor Cynthia Montgomery of HBS proposed, it is imperative that we guide technological advancements with human insights and a deep sense of responsibility to the society. To the Class of 2017 and the HBS community: time has called upon us.

Steve Zheng (HBS ’18) was a program manager at Microsoft Silicon Valley Office, driving product development in AI and Search. Born and raised in Shanghai, he studied computer science at the Institute down the Charles River.

Anton McGonnell (HBS ’18) is from Ireland and has spent his career to date in health tech and enterprise tech and has contributed to innovation policy in Northern Ireland. He is passionate about artificial intelligence and its application in the future. He plans to found a tech company after HBS and make trillions. He also wants to make Belfast the new startup epicenter of Europe.