Rely on UK Companies House data? You may have a BIG problem.

This is a guest post from Regulation Technologies, a South West startup that is leading the way in transforming unmanageable open datasets into meaningful networks to uncover high-value insights and help data teams innovate.

We’ve completed research that indicates at least 3 million director records are not linked to the right people, 20% of the total. (Happily, we have solved the problem.)

First off, let’s state this is not the fault of Companies House. They do a solid job of maintaining the UK’s corporate registry and their various data products are a great success. In the main they are free and, as a result, billions of searches are conducted each year. But Companies House is currently limited by law as to how much ‘post-filing’ processing they are permitted to do.

See this summary of The Registrar’s Powers, p9 states the legal position set down in the Companies Act 2006 “The Act specifies the circumstances where the register can be amended or clarified. In most cases the circumstances are very specific and the registrar’s powers are limited.”

And that is why, over time, the data has become less reliable and why, if you depend on it, you are likely exposed to greater risks than you assume.

In this article, we will explain the research we undertook, present the findings, demonstrate how we solved the problem using connected data techniques 🎉, and conclude by discussing the implications.

The Research

We used the Companies House bulk data product ‘Companies Appointments Snaphot’ which contains all company officer and appointment details as of 29th September 2021. Approximately:

45 million appointment records
14 million company records
20 million company officer records

Step 1: Select ‘Real’ Directors

Our focus is directors who are ‘real people’; excluding companies acting as company officers and people acting as company secretaries.

That leaves us with 15,847,532 records for directors who are (or, sadly, were) real living, breathing humans. Our challenge is to correctly assign these records into one of two categories:

i) single records containing all appointments data for a given person, or
ii) multiple records which, when linked together, contain all appointments data for a given person.

The incorrect assumption made by many users of Companies House data is that all records— or at least the vast majority — fall into the first category.

Step 2: Select & Prepare A Representative Sample

We select 50 surnames using https://britishsurnames.co.uk/random and then add a combination of initials and first names. Next, we filter the sample down to those names with more than 10 officer records but less than 100 so the processing work is of a manageable size. This leaves us with a sample of 16 unique names with, on average, 50 director records for each.

For each name we group the records by date of birth. Companies House do not include directors’ full dates of birth, they abbreviate to YYYYMM format.

The starting condition for the research, therefore, are records with the same name and same YYYYMM birthdate. And there are two main reasons records may share these properties, either:

i) these are records for two or more individuals who share the same name and YYYYMM birthdate, or
ii) these are multiple records for the same person not linked in the dataset.

Step 3: Estimating Multiple Records

In this step, we estimate the likely incidence of the dreaded ‘multiple’ records and in Step 4 calculate the actual number.

This is our sample data which shows for each name:

Officers: the number of director records that exist,
DOBs: the number of unique birthdates in YYYYMM format,
Est Multiples %: the estimated rate of multiple records (we included one legitimate YYYYMM match for each name; read more about the ‘Birthday Paradox’ statistical quirk for why 1 record is an appropriate choice).

Whilst this estimate indicates there is a significant rate of ‘multiple’ records (nearly 18%), it is not much help at a practical level. What anyone really wants to know — after establishing the scale of the problem — is which records belong to the same real person, because once linked together they will have a complete, not a partial picture, of a director’s activity.

Step 4: Calculating Actual Multiple Records

This is where our connected data shines. Because we have connected all UK companies to their company officer records— a network of close to 100m relationships with 20k daily updates — we can use network features to establish which records belong to which people. In this case, we use ‘proximity’: how close entities are in relation to each other.

Here we identify 5 separate director records for an ADAM ROSE with the same birthdate. The fact that 5 records exist in a network of around 20, which is part of a total set of 100m+, makes the chances of any of these being a separate ADAM ROSE vanishingly small. This is a more accurate approach than, say, looking for static properties like an occupation that may occur many thousands of times more frequently.

An added bonus to this approach is that we extract static properties from any of these ADAM ROSE records and use them to identify other records for the same person which are not connected by company relationships. We use location data, personal data, and temporal data (the dates of the known appointments) to capture the stragglers.

The Findings

Processing the sample in the manner described identified the number of multiple records and the actual incidence rate is 20.46%, 1 record in 5, slightly higher than the original estimate.

Applying this rate to the total number of director records of 15,847,532 gives 3,241,948 records that belong to existing directors but are not connected to them.

The Implications

1.There are many reasons why people use Companies House data. A 2019 report by the Department for Business, Energy & Industrial Strategy states the goal of 45% of users (both individuals and data solution providers) was to obtain/provide basic company and director information. A lot of users are not getting the full picture.

2. What of decisions based on this fragmented data? There are two scenarios worth mentioning:

i) decisions taken assuming all the information is known when it is not, and
ii) decisions not made because the information is known to be incomplete.

In either case, these are sub-optimal outcomes as additional work will be invested in searching for, collating, and verifying information to get to an acceptable standard of coverage (what % of the total information do we have) and veracity (what % of the information can be relied on as true) — or abandoned because it falls short.

3. In our work with a P2P lending client, we found their due diligence checks showed a greater rate of ‘multiple records’, over 30%. This was due to active directors having more appointments, featuring in more Companies House filings, and thus having a greater chance of falling through the cracks. Active directors by definition are more likely to apply for financial products which is where data quality is critical but is also where — if Companies House data features — it could be dangerously incomplete.

4. And relevant to data teams: normalising data and entity resolution is a regular preparatory step prior to running downstream analysis, so beware: rubbish in = rubbish out.

Our Solution

Below is how our approach performed on the sample data. The green column shows what percentage of officer records we correctly identified for each name — a 97.76% success rate. This was achieved with an automated process using our proprietary database queries and algorithms. Not perfect but in practice, only 1:40 records would need manual review.

If you want to put us to the test we offer a free sample audit. Please email me, mike@regulationtechnologies.com to get get started.

A rudimentary version of the algorithm is available for free testing at our public API docs