Guest Author: André Vermeij, Founder of Kenedict Innovation Analytics & Developer of Kenelyze

Organisations focused on innovation come in many forms, including corporations with large Research & Development (R&D) departments, universities, research institutions active in advancing science, and startups working on the potentially next big thing. Innovation-related data has become increasingly important for each of these organisations to inform decision-making and stay ahead of market developments. For example, an R&D-intensive corporation could use data to benchmark its own technology portfolio with its direct competitors, while a startup might be analysing data to assess previous activity and potential market entry in a sector of interest.

Traditional Innovation Analysis

The traditional way to look at innovation-related data is to report on output within a topic or organisation of interest based on counts and sums of variables of interest. When analysing its competition, a business may for example gather information on a competitor’s recent output and report on the number of documents in each technology domain, produce a list of the companies the competitor has worked with, or generate an overview of the most active inventors or researchers in a field of interest. Although all these analyses can be valuable in their own right, they’re missing out on a key aspect of an innovation ecosystem: the connections between technologies, organisations, and people.

A Graph of Innovation

Viewing innovation and its output as a graph of interconnected data points allows us to get a much deeper understanding of the technology and knowledge structures in a context of interest. Using the metadata in a wide array of innovation-related data sources, which will be discussed more in the following section, it is possible to create graphs of connected documents, organisations and people and gain new insights into the actual underpinnings of innovative activity.

For example, innovation graphs allow us to answer questions, such as:

Which clusters of activity can we distinguish within a topic or organisation of interest, and how has this evolved?
What do the organisational collaboration networks in an area of interest look like, and who are the key players in network connectivity?
How are teams of individual experts in a specific field composed, and who are the leading experts in a given topic?

Open Data Sources for Innovation Analytics

Until just a few years ago, quality innovation data was quite hard to come by without a subscription to an expensive database hosting patent information or scientific publications. Luckily, in recent years, there has been a move towards more openly available data, which can serve as an excellent basis for setting up a wide variety of innovation graphs.

Here’s a quick overview of common data sources:

Patents: organisations apply for patents to protect their inventions against commercialisation by third parties. Patent applications and grants are published online by national patent offices around the world, with databases gathering data from all jurisdictions and providing a wide array of metadata. A great open data source is the European Patent Office's Open Patent Services (OPS) API, or the EPO's search platform Espacenet.
Scientific publications: journal publications, conference proceedings, book chapters and various other types of scientific output are gathered in databases which bring together output from many sources. Paid databases such as Scopus are still used often by large organisations – great open alternatives include OpenAlex and Semantic Scholar.
Subsidies & funding programmes: governmental subsidies to stimulate innovation and R&D in specific areas are often structured in openly available data sources. A good example is the European Union’s CORDIS data for the Horizon Europe programme. Many national enterprise agencies also publish their granted subsidies and projects online.
Internal data: the above data sources are often augmented with internal, unpublished data (e.g., internal project reports, unfiled patent applications, scientific output in the review stage) to get a view on very recent activity within an organisation. This is especially valuable when creating knowledge graphs within organisations or carrying out an innovation portfolio analysis for a specific client.

In a typical Innovation Analytics project, combining data from multiple of the above data sources is often key to gaining the best insights. For example, organisations applying for patents often also have scientific output related to the same theme and may also apply for governmental funding. To get a picture of innovative activity that is as complete as possible, it is therefore important to look at activity from multiple data sources and graph perspectives.

Graphs of Documents: Insight into Technology and Knowledge Clusters

The analysis and visualisation of innovation graphs often starts with looking at the relationships between documents based on a shared characteristic.

Depending on the goals of the analysis, there are various ways to link documents together:

Text similarity: unstructured text data in the form of document titles, abstracts and summaries can be used to connect documents when there is a high similarity between their contents. This relies on vectorisation of the text of interest and subsequent calculation of pairwise cosine similarities, where a link is then drawn between documents based on a minimum similarity score.
Knowledge flows / shared authors: another way to generate clusters of connected documents is to link them when the same people have worked on them. The authorship data on documents can be used to accomplish this. The key assumption here is that documents are part of the same “knowledge cluster” when persons with specific expertise have (co-) written them.
Citations: numerous citations to other documents can be found in both scientific publications and patent applications. We can use these citations to create various types of graphs:
- Shared references: connect documents when they cite the same sources, often with a minimum number of shared citations set as the weight for the links.
- Shared citing documents: connect documents when they have been cited by the same other documents, again often with a minimum weight set.
- Direct citations: creation of citation graphs where links are drawn between documents when they cite each other.
Technology classifications: patent documents are categorised using classification codes designating the technology areas which they fall into. These can be used to connect documents when they share one or multiple codes, essentially creating clusters of documents based on technological overlap.

The following graph is an example of a text similarity approach, where scientific publications in the area of autonomous vehicles are connected when they share significant textual content. Colors depict clusters of activity based on the outcomes of a community detection algorithm, and nodes are sized based on the number of times they were cited by other papers:

Figure 1: Graph of scientific publications linked based on text similarity approach

Graphs of Organisations: Insight into Collaboration Ecosystems

Another graph perspective, which is very common in innovation analysis, focuses on mapping the connections between organisations (businesses, universities, research institutions, public bodies, hospitals, etc.).

Many of the data sources above hold extensive metadata on the organisations responsible for the documents—scientific authors are affiliated with their employers, patents are applied for by the parties seeking protection of their invention and governmental subsidies are often received by consortia of collaborating organisations.

It is common to attach weights to the links based on the number of collaborations between two organisations. Using these weights, it is then possible to filter the graph to focus only on the strongest / most frequently occurring collaborations.

The graph below shows an example of collaboration in radiotherapy innovation, where colors are based on the type of organisation (e.g. blue = universities, green = hospital and medical centers) and node sizes based on their betweenness centrality scores:

Figure 2: Collaboration in radiotherapy

Graphs of People: Insight into Expertise and Knowledge Networks

This is a graph perspective that often follows after mapping organizational collaboration networks, focusing on the actual person-to-person collaborations taking place to produce the analysed output.

Using the author/inventor metadata on documents, we draw links between people when they have co-authored a document. Similar to the organisational networks, we can also attach weights to the links, which correspond to the number of documents which have been worked on jointly by two authors. This perspective can provide a deep understanding of the actual team structures and knowledge networks within and outside of organisations.

Here’s an example of the (relatively large!) network of inventors who have worked on Apple patents. Nodes are sized based on their betweenness centralities, and colors are based on clusters detected by a community detection algorithm:

Figure 3: Apple's inventor network

Graph Metrics & Innovation Insights

The above examples show various ways to convert innovation data into actionable graph visualisations. In the actual analysis and interpretation of these graphs, it is important to make good use of the many metrics available in graph analytics. These metrics can help us understand which clusters are present in a network, and can aid in determining the importance of nodes based on centrality measures.

The following metrics are valuable for analysing the overall graph structure in innovation analysis:

Component analysis: determining the components (interconnected subsets of nodes) in the graph to be able to see how far the graph is interconnected (how many nodes can reach each other directly or indirectly) and to determine the impact of the largest connected components versus smaller components.
K-Cores: to determine highly connected subsets of nodes in graphs, k-Cores can be used to highlight subgraphs in which all nodes have at least a degree of k. This can be used to focus on so-called cliques of nodes quickly and is especially valuable when analysing collaboration and knowledge networks.
Community detection: using an algorithm such as the Leiden community detection algorithm to determine which clusters we can distinguish within the components. These clusters then serve as the basis for graph annotation, where clusters are labeled based on their actual contents (see the labels in the autonomous driving graph above).

On the individual node level, degree and betweenness centrality measures can be used to determine the importance of nodes in innovation graphs:

Degree Centrality: determining simple connection counts per node to quickly see which actors are most important in terms of the number of other nodes they are connected to. Since most innovation graphs are weighted (links have weights associated with them), weighted degree centrality is also used regularly.
Betweenness Centrality: this is a frequently and often used metric to determine who holds key positions in a graph in terms of hub positions – which organizations/people are the “key connectors” between clusters/teams? It is calculated by determining how often each node appears on the shortest paths between all other nodes in the network.

Up Next: Use Cases

Now that you have an initial idea of the main ideas behind innovation graphs, we will showcase practical use cases, real-world client examples and common challenges in innovation graph analysis in the next blog post. Stay tuned!

Innovation as a Graph: Improved Insight into Technology Clusters, Collaboration and Knowledge Networks