The goal for any content driven website is to provide its users with relevant content based on Information Discovery. Most of the websites choose ‘search’ as a tool for information discovery (Netflix is a mind blowing exception to this) and in search, relevance is heavily dependent on context rather than just strings present in them. Not many websites realise this and I was surprised to see that HBR.org , known for having one of the best content repositories, giving barely useful results for my search queries.
Now, I am a happy subscriber of Harvard Business Review and have been using it often to enhance my limited knowledge by discovering interesting content around technology, business, leadership etc. and have absolutely no doubt on the high quality of its content and really great contributors. But, I am disappointed at how that content is becoming difficult to discover in the first place. Here is how its search looks like:
All looks well, it has a very familiar interface, like most of the other sites have and the first result does have the word I searched for “search”. Don’t go away yet, stay with me till the end 🙂
In my quest to learn more and follow everything about Artificial Intelligence I keep looking everywhere for anything related to AI. I thought let’s try what all is there in HBR.org repository on AI. I search first for artificial intelligence and look at the results:
First result is a “Sales and Marketing” case study about an early stage company Empathetics (an organization that teaches empathy to healthcare professionals and staff to improve the patient experience) and I wonder why is that the top result (when sorted by relevance!) for what I searched for. I open it, scratch my head really hard to figure out what exactly is related to AI there but could not find anything. I scroll down and to my despair, the other results are also out of the world for me. Here they are:
Then I thought of trying the ‘exactness’ trick – I searched for “Artificial Intelligence” and boom!
ZERO results! Apparently it does not support exact string search, which ideally it should. I knew this is simply not true as HBR.org does have articles on AI. Examples:
It gives irrelevant results in search even though the right content is very much there in repository.
Moving on, I tried taking my chances on AI – I search for ‘AI’. Here is what happens:
Do you notice it? There is an author Ai-Ling Jamila Malone whose name contains “AI” and HBR.org simply is showing me all articles from the author. This means it is giving a higher score (probably) to words found in a wrong field (author field) than the content itself and that too without any context.
Now it could have been deliberately done assuming most of the people want to search for names of authors but hey, that use case CAN be handled in a better way.
Moving further, I check for another hot topic – Deep Learning – and default (sorted by relevance) results appear to be relevant (see now it shows me results for artificial intelligence as well).
But the articles are old and I needed the latest ones – the moment I try sorting by publication date I see this:
the first result is this – https://hbr.org/2018/01/the-5-things-your-ai-unit-needs-to-do – which may seem somewhat relevant given it has AI in its title but it is not even remotely related to “Deep Learning” and the only reason it appears in search results is because there is a word “deep” in one paragraph somewhere
and there is another paragraph with the word “learning” somewhere.
The other results are same –
https://hbr.org/product/ch%C3%A2teau-margaux-launching-the-third-wine-abridged/518070-PDF-ENG — this tops the chart of weirdness for me.
they have the words “deep” and “learning” somewhere and hence it is being shown. An important point to consider – I want latest but still relevant results and HBR.org fails at it. It does not identify ‘deep learning’ as a concept made of two words and is simply looking up the words appearing somewhere in the content.
The root causes for all of it can be summarised as follows:
- The search at HBR.org is still relying on basic keyword based scoring and has no Ontology of concepts like “Artificial Intelligence” or “Deep Learning” or the relationships between the concepts.
- It does not account for synonyms and hence is unable to understand that “Artificial Intelligence” and “AI” are same concepts. An Ontology makes it much easier to maintain all synonyms of any given concept.
- It does not identify entities so is unable to differentiate between name of a person “Ai-Ling Jamila Malone” and a concept “AI”
Based on how we have designed information discovery for our Data as a Service (DaaS) platform iPlexus.ai I can say that the primary reason for all of the problems is the missing Ontology leading to missing Entity Recognition and disambiguation.
Search is important but in itself is not always the best way for information discovery. Users don’t get happy at seeing millions of search results for what they search – that only adds to information overload. What matters is, if you are telling me there are so many possible results out there, then tell me how they are distributed across different dimensions around my ‘interest’. Let me choose a direction and don’t force me to keep going through all the results in a linear fashion – nobody will live through to get to the end of the millionth result page. And not just Innoplexus but I know there are few other companies out there who are following this philosophy and making Information Discovery easier for their users. One of the examples I quoted above as well is Netflix.
Thanks to Ravi Ranjan for reviewing the article and helping with the headline 🙂