Methodology

SOURCE DATA, CLEANING AND ENHANCEMENT

Yewno Knowledge Graph

Source data is constructed from the universal reading of patents and news documents, using Yewno AI and machine learning capabilities. The methodology creates a Knowledge Graph, whereby the nodes, or Concepts, are created by the machine from the reading of all types of documents.
“Yewno’s KnowledgeGraph is a powerful and dynamic representation of knowledge across a vast corpus of documents and evolves in real-time.
Yewno’s Inference Engine detects and explains the relationships and changes in connections over time. Connections can be traced back to the source document, even down to the sentence.”
A set of scores bound the documents and the Concept. The repetition of the document in the Concept and its diffusion in the Knowledge Graph set the score.
“A total of five factors are computed measuring exposure of entities to the augmented list of target concepts, allowing for a matrix representation of the data :
Importance Scores are based on the number of co-occurrences between the entity and the target concepts or publication of documents by the entities with mentions of target concepts; two scores are provided :
- Contribution Score is a measure of how much each company was mentioned or published documents related to the target connected Concept relative to all the companies in the asset universe.
- PurePlayScore is a measure of the percentage of each company’s mention or document publication to the target concept relative to its mentions or publications across other concepts.
- Centrality Score is based on centrality diffusion (local PageRank) of the network constructed from mentions/publication between concepts. Incorporates second order connections that favors central nodes connected to other central nodes.
- Similarity score is based on how close the companies and concepts projections in the semantic space are.
- The aggregated score is a weighted linear combination of the previous scores normalised by the maximum.
Data Description and Data Organisation :
- Inferences Data Description: Besides the numeric exposure scores, text snippets from the content sources are provided. Those inferences showcase the connection between the different concepts and the corresponding entities.
- Corporate Structure Incorporation: The exposure scores of the corresponding entities are mapped to the parent entities, reflecting the proper corporate structure at any given date. »
- Concepts can be any string: a keyword, a policy, a technology, a company. There are millions of Concepts; hence it is vital that as Concepts are requested, the list is narrowed down to what is useful.

Concept selection, Requested Concepts in the Decarbonisation framework

Since the objective is to identify companies innovating within the overall race to zero-emission theme, the Requested Concept (RC) list will initially aim at Patents as a document source. In that universe, technologies link to companies through the patents themselves. And the Contribution and Pureplay scores will then be able to rank companies, for each RC, based on how specialised they are in the technology in question (Pureplay) and how much of the RC’s loading they represent (Contribution). The scores are aggregated from document reading over a period of 365d.

The list of RCs has to be complete enough to capture all aspects of the Decarbonisation theme in question, yet sufficiently orthogonal to avoid repetitions. Indeed, a single Patent can catch multiple keywords.

Decarbonisation Themes and SubPortfolios

To facilitate the selection of the requested concepts, they are organised through a classification tree as follows. Each branch of the tree comprises Class, SubClass, SubPortfolio and RC.

The proprietary RC list used the following inputs :

Known technologies for each topic gathered through research
Incremental concepts from experts
Nearby nodes in Knowledge Graph with help from Yewno data scientists.

The tree updates for request purposes twice a year, for end December and end June data, the Pivot Dates. In March and September, Rebalancing Dates, the RC set of the previous Pivot Date is used.

The Subportfolio level is a scoring aggregation level used in the portfolio construction process, hence its name. A SubPortfolio is a Theme since it groups RCs within a specific topic, such as Solar Power or Self Driving Vehicle.

Entity Cleaning and Data Enhancement

The entity list generated by the files includes different identifiers within the same company, such as bond tickers. These are non-standard tickers that need to be converted to a usable format. Since we focus on equity investment, each “ticker” is checked against a pricing database, leaving out non-equity identifiers, for the exchanges we can trade. As of early 2021, there are 962 single tickers identified. Further cleaning leaves out double listing (e.g. ADRs) and ambiguous tickers (Bearer shares) reducing the sample by an additional hundred.

Once we have that short(-ish) list, a fundamental company database is updated, including complemented identifiers such as Bloomberg tickers and ISIN codes as well as country/region and sector classifications. Price, market cap and volume data are requested, translated daily in USD and EUR. Further down the process, a volume filter is applied depending on the objective when requesting the score list. At $10mio ADTV (over 50days), about 20% of the stocks are excluded. Note that there are 14 currencies represented as of end 2020.

DATA NORMALISATION AND AGGREGATION

Patents

Each line of the dataset comprises five critical elements complemented with unique IDs :

Reference date (31.12.20)
Document source type (Patents)
Requested Concept (Lithium Battery)
Company (3M)
Scores

A few days after each Pivot and Rebalancing dates (Date), Bisfico fetches files in Yewno’s AWS S3 bucket and updates its AWS SQL database. Each RC has its file for three years of data, representing over 250 files with zero to 20000 lines each. Once aggregated, the full score table has over a million lines.

Period Aggregation
For each Date, three years of data is aggregated.
Normalisation
The Contribution and Pureplay, which are used in Patents sources, scores are not statistically distributed.
The distribution is also inconsistent across RCs. The goal is to identify the dominant companies in each RC, and SubPortfolio, not those with the highest score since those are not normalised. To normalise, we use percentile buckets as ranks. The top 5%, for instance, would be ranked the best and bottom 5% the worst. As integers, classes are then used as scores at the RC level and further aggregated at the SubPortfolio, the next level up.
RC and Subportfolio Aggregation
Before creating the rank scores, RCs scores per company are aggregated as the sum of the Pureplay and Contribution scores for that RC for the three years. For Pureplay and Contribution specifically, that means a company that has a superior Pureplay/Contribution score for many years will be favoured.

Rank scores are created, grouping each RCID and creating equal-sized bins from 1 (best sum(score)) to 15 (worst) combining Contribution and Pureplay with an allocation key.

For Subportfolios the aggregation is done by the median of three years of Pureplay and Contribution scores. Using the median (vs sum) minimises the effect of larger companies having multiple RCID presence or themes with a large number of RCIDs. The goal is to find the dominant quality player in the technology or theme, not the larger one.

Rank scores are created, grouping each SubPortfolio and creating equal-sized bins from 1 (beast sum(score)) to 15 (worst) combining Contribution and Pureplay with the same allocation key as RCs.

Figure 1 : RankScore Distribution, Number of Entities per Score junction, June 2020

As shown in this figure, while there is an unsurprising correlation between the two types of groupings (RCs and SP), there are clear differentiations in distribution. A company with a decent RC score may or may not have a good SP score as well.

Figure 2 : Region and Sector density, Patents, June 2020

As can be seen in the regional distribution, there is an apparent under-representation of European entities in the scoring file. One standout 8pink) feature is the lack of technology stocks, compared to the US and Asia, and Industrial companies. While the theme is prevalent in Europe, it is represented by large industrial companies that see their score diluted because they do lots of other, unrelated developments.

News

News sources will use a subset of the RCs, which would be more generic than the specific technologies involved. News filter is used at the end of the process to discriminate large-cap stock based on how positive news flow is.

SCORING
Similarity and Centrality scores best capture News impact. Pureplay and Contribution are pointless for such a wide array of documents, whereas Similarity will naturally capture a particular company's news loading in the space. Also, these scores can be negative, further allowing for differentiation.
Figure 3 : Similarity adjusted Density Function, June 2020
The distributions are more normally spread than for Pureplay and Contribution but require a transformation in rank buckets as we can see some right skewness. In any case, the sample size of entities that have News related to the topics is smaller than Patents and biased to Large Cap. Some further work needs to be done in the RCs to narrow it down to generic enough concepts that resonate more than the narrow, technology-focused list used now (i.e. primarily the ones in Patent). There is value in the news core since it shows trending for the companies in the topic, and it can be negative, indicating the nature of the link.
Aggregation

Aggregation is done at SubPortfolio level only and uses the median as score highlighting how strong the company is, on average, in the particular theme. The sum would lead to larger companies being overwhelmingly represented since their name would resonate in many topics. The aggregation leads to one unique SP score for news, for the Pivot Date. A score of 7 (average) is filled into companies that have no score. In subsequent selection score above are used as a screen.
Each square shows the density of unique entities for each cross of News (rows) and Patents (col.) SP rank scores (ignoring seven scores). As can be seen, there is no relationship between News and Patent scores. It explicitly shows that innovation, as captured by patents, does not reflect in the news. While there could be many reasons, companies' communication control can change a company's perception versus its underlying reality. It could also be that an event or the industry a company is in (think oil & gas and mining companies) leads to a negative news flow that overwhelms talk about innovation and transition efforts. On any case, Patent as a source is inherently different than News. We suspect it would also differentiate from E scores in ESG scoring frameworks. More on that later.

Figure 4 : News Rank Score Regional Distribution, June 2020

As can be seen, most European companies score well in News whereas US/Canadian companies are more evenly distributed, as technology companies tend to have evenly spread ranks.

SELECTION

At this juncture, we have generated a normalised rank score for Patents (RCs and SP) and News (SP). We also have the price and fundamental data for each company. More than 650 companies are available for screening, after selecting for ADTV and null Pureplay scores.

Objective

Since the portfolio construction aim is to minimise factor exposures and idiosyncratic risk, we translate those objectives in the context of our dataset. Factors are considered to be the themes or Subportfolios. Hence the process will force diversification by constructing “sub-portfolios” at the Subportfolio level and manage entity fuzziness or repetitions so that the idiosyncratic objective is respected.

SubPortfolio construction

Based on the normalised rank score per RC and SP, a multi-step filtering process is applied :

Select a subsample per region, aiming at balancing regional exposure. For example, 100% of the European based entities will be considered, 90% of Asia and 80% of US. The actual level is determined by the sample size per region needed for portfolio construction.
A quantile filter for each RC level scores, i.e. technology level, and SP, i.e. theme. Since the focus is on finding technology leaders vs. broad participants, the SP filter is more aggressive than the RC filter. Having tested the efficacy of such balance, using only the RC as a filter outperforms using only SP, as it captures smaller large-cap-to-be companies.
As an example, in December 2020 4% SP levels are at 4% (4% of the entities are kept) while RC levels closer to 8%. Indeed that means 92% of the companies, at each RC level, are excluded. It also means the sample of companies that make it through is very sensitive to a change in the filter level. These levels are determined through an iterative process, so that the final portfolio is around 100 stock. Around 350 securities are in the sample after the RC and SP filter.
In order to minimise turnover, entities that are already in the portfolio benefit from a less aggressive filter. The precondition is that they are still in the afore mentioned circa 350 entities filter.
All remaining entities are regrouped at the SubPortfolio level, i.e. all RC filtered and SP filtered entities are piled into the SP level, carrying their scores. Each SP therefore has a different number of entities. From a total sample of around 300 stocks, each SP has between 135 and less than 10 entities. Hence a proportional filter cannot be applied otherwise some SP will be vastly overrepresented.
In order to mitigate this phenomenon, two methods are applied the filter level is more aggressive for sample sizes over 50 than under.

Also at that level, and only for those securities that have a News source score, a scoring filter is applied. This filter is thus segregating larger companies based on the dynamic of their news flow in each theme.

Portfolio construction an positions sizing

At this stage of the process each SP is made of 2 to 12 stocks, but the total numbers of securities is around 100.

Each entity in each SP is assign an equal weight, i.e. 1% if the portfolio as 100 stocks. Of course that means the sum total of the weights is above 100 since many securities are repeated across the SPs (around 65%). To satisfy the 100% target, the following steps are applied :

Group securities and add their weight across the SPs. At this point the notion of SubPortfolio is melted into the portfolio. If security A is present in 7 SPs, its weight is 7%..
Reduce the weights above 2.5% to that level. Security A now has 2.5%.
Rebalance so that the total weight is 100% AND the emerging market exposure is 0.5% below its limit (10% limit in the GBI-Decarb certificate so 9.5% is the aggregated weight).

BackTesting

The methodology described is applied pari passu for each reference date. Since the sample size increases over time, the actual filter numbers are different. The first reference date is 30/06/2020, with data starting from 30/06/2014.

With the turnover minimisation methodology, which favours existing stocks, the portfolios are to some extent auto correlated to the first one. That effect dissipates through time, since around 20% of the entities turn over every rebalancing date (6 months).

We are at your disposal for any information.

Discover also

Innovation in decarbonisation

Research

Bisfico applies decades of expertise in financial markets, investments, governance and technology to solve your problems.