A machine learning approach for identifying clusters in real-time

Cluster theory forms a key dimension of our Data Diamond model, but how does existing research measure up when defining the boundaries of a cluster as industries evolve? Building on foundational cluster theory, researchers Lucien Chaffa and Thierry Warin have developed a novel quantitative method to identify and dynamically redefine the boundaries of clusters and their constituent industries.

Academics have used various methods to operationalize Porter’s definition of a cluster—a concentration of interconnected businesses, suppliers, and associated institutions within a region—by relying on factors like co-location, shared inputs and similarities in employment and patents to measure industry connectivity. While these factors capture visible inter-industry linkages, unsupervised learning algorithms can uncover hidden structures and relationships that may not be immediately apparent.

Firm creation, previously identified as a measure of interconnectedness, is used by the authors to develop a metric for industry growth rates. Industry growth rate is an interesting metric, as it reflects the various economic and non-economic factors influencing an industry. The authors employ a k-means clustering algorithm, along with other unsupervised machine learning techniques, to group industries with similar growth patterns into clusters, while also defining dynamic cluster boundaries that indicate strong inter-industry interconnectedness. This approach captures shifts and co-movements in industry performance over time, transforming static cluster definitions into dynamic ones.

To test their methodology, the authors use firm-level data from the Registre des Entreprises du Québec (REQ), which is updated every fortnight, to calculate industry growth rates. By leveraging near real-time data, changes in an industry’s health and competitiveness are captured almost immediately. This approach stands in stark contrast to traditional cluster definition methods that rely on outdated cross-sectional datasets, which overlook the importance of considering nascent industries.

The granularity of this dataset extends across three dimensions: industrial classification, geographical location, and temporal attributes. This allows clusters to be identified at various levels of industrial classification while also considering the geography they occupy. The geospatial data of firms is particularly valuable for understanding clustering behaviour beyond political boundaries. Lastly, the temporal dimension of the dataset offers insights into the evolution of clusters over time.

To sum up, this machine learning-based methodology not only modernizes cluster theory but also paves the way for more dynamic, data-driven analyses of economic geography and regional competitiveness. For policymakers, this approach provides crucial insights into the health of regional industries, enabling informed and targeted interventions to enhance regional economic development.