Knowledge graphs hit the peak of their Gartner hype cycle in 2020. And much as KG teams celebrated the rise to prominence, “hype” carries an implicit risk. What if the touted promise never emerges? After all, network-based databases (including graph DBs) have been in use since the 1960s, just largely confined to academic settings. What’s working for graph adoption today?
For starters, what do we mean by the “knowledge” in knowledge-as-a-service?
In common parlance, knowledge has come to be known as information and context. This is one of the roots of the term “knowledge graph.” While a graph is simply a database structure composed of nodes (entities) and edges (relationships), knowledge graphs are populated to provide information and context.
To jump to the present day, knowledge-as-a-service has grown an ecosystem of players providing services at a range of layers.
Build your own knowledge graph:
- Layer one knowledge-as-a-service: graph database management systems
- Layer two knowledge-as-a-service: graph analytics and extraction; query languages and processing engines
- Layer three knowledge-as-a-service: visualization and analysis tools; integrations with BI tools
Utilize an existing knowledge graph:
- Full service knowledge-as-a-service: knowledge graph APIs, visual search and integrations
It’s also worth noting that these two categories aren’t mutually exclusive. Plenty of companies seed, enrich, or update knowledge graphs they’ve built with existing external knowledge graphs. Additionally, existing knowledge graphs are a particularly well suited format for giving machine learning models context; machine learning models that can in turn help augment or populate internal knowledge graphs with additional (custom) fields.
Whether building, utilizing an external service or pursuing a hybrid approach, there are two high-level issues that most knowledge projects must wrangle with. The first involves the fact that meaningful knowledge graphs are often of considerable scale. This is the issue of price (and correspondingly process) for fact accumulation. The next involves the actionable output: how do we get existing knowledge in front of the right person at the right time (knowledge workflows)?
The Cost of Fact Accumulation
As mentioned previously, knowledge graphs have been utilized in academic and research settings for a number of decades (beginning in the 1970’s). While this has been useful for sketching out the theoretical strengths and weaknesses of this form of data modeling, few academic knowledge graphs were of any size to be useful in commercial settings.
As our ability to quickly store and retrieve data increased exponentially, several enterprise knowledge base companies attempted to build databases of a size useful for a wide range of enterprise applications. In the early 1980’s, a long term project called Cyc was compiled. The aim: create a knowledge base large enough to remedy the issue of AI systems with an initial promise that falters due to a lack of ongoing data sources. Cyc was able to compile around 21M fields manually (and at a great cost).
In the 1990s and with the proliferation of the internet, (unstructured) data was no longer an issue, with exponential data growth each and every year. But unstructured web data doesn’t equate to useful information (let alone “knowledge” – particularly on topics of enterprise interest). Metaweb attempted to remedy this through crowdsourcing knowledge. Building off of Wikipedia, they were able to increase Cyc’s field count 100x before being acquired and folded into Google.
Together these two enterprise-focused projects exemplified both the promise of enterprise knowledge graphs, and the shortcoming of manual fact accumulation. Because there’s a relatively hard cap on how quickly humans can curate answers, there’s a hard floor on how low you can drop the price of manual fact accumulation. You need more people or more time, both of which mean more money.
Today the internet has grown significantly. And we do have examples of massive knowledge bases compiled by humans (IMDB, Yelp). But they tend to need to be about topics that are of widespread interest and shared knowledge. Another dead end for large-scale manual knowledge graphs of enterprise use. So how do we go about structuring enterprise knowledge efficiently today?
Powerful web crawlers, natural language processing, and machine vision with machine learning computations on top have managed to massively increase the size of enterprise-use knowledge bases for a fraction of the price. A huge range of useful enterprise data is available online as well as in the news. It simply needs to be structured.
Among general use knowledge graphs, GDELT and Diffbot provide the largest coverage areas. Where GDELT automates the organization of facts from worldwide newspapers, Diffbot crawls the entire web and pulls out entities and facts.
Natural language processing pushed to layers one through three listed earlier in this article also enable enterprises to pick and choose where to source their knowledge graphs from. The following sources are commonly parsed into enterprise knowledge graphs:
- Firmographic data
- Technographic data
- Internal documents
- News or social media mentions
- Product catalogs
- Review data
- Real world events
- Customer data
- Prospect data
- Investigative data
- Supply chain data
- Among others
While a work-in-progress in many hard problem areas, AI has proven to be the first step change in the cost per fact accumulation in modern times. For example, looking at Diffbot – largest commercial Knowledge Graph provider – cost per fact accumulation compared to Cyc or Freebase shows an improvement of roughly x2500.
From Knowledge Graphs to Knowledge Workflows
One of the main issues that persists even after organizations have well populated and topical knowledge graphs is that of not thinking in terms of knowledge workflows. Similarly to some of the often touted downsides of data lakes, knowledge graphs can fall prey to accumulating huge amounts of information that doesn’t get in front of the right team members at the right time.
Like any other data store, the accumulation of a “larger” knowledge graph can be more of a liability than an asset.
- Data becomes stale
- Data “islands” emerge
- No understanding of overall data structure
- Messy data provenance
The real benefits of knowledge graphs surface when high volume and velocity data can be structured in an automated way to solve data gathering issues for discrete teams.
While many unique use cases have not been explored by major knowledge graph providers, strengths of present knowledge graph architectures include the following:
- Graph databases are flexible, containing many different entity types
- Graph databases are centered around relationships between entities (linked data providing context)
- Unlike relational databases which may require complex joins to retrieve linked data, graphs offer performance benefits
- New fact types can be added “on the fly” as the world they’re representing changes
- Each entity has a universal identifier (disambiguation)
- Data provenance providing explainability to each constituent fact
- High expressivity for modeling complex domains
While knowledge graphs power major services we depend on (Netflix, Spotify, Google search, etc.), for most organizations successful implementations of knowledge-as-a-service involves pairing specific KG strengths with specific needs. The largest portion of future knowledge-as-a-service promise involves freeing workers from specific drudgeries related to fact accumulation and trend spotting in their role. And that’s a win that’s well underway!