Over the past week, I have written about the Canadian Expeditionary Force and the Commonwealth War Graves Commission data sets. After loading both data sets in a common database, it is finally time to bring them together.
Meet John Smith
One table contains the records of every enlistee in the Canadian Expeditionary Force. The other contains records of every known First World War grave of Canadian soldiers.
A naïve approach to this problem would be to use the name to join the tables. A man who enlisted as John Smith likely has John Smith written on his grave, right?
Well... yes (actually no, but more on that later), but there were 172 John Smiths in the Canadian Expeditionary Force. In fact, there were 17646 Johns and 6498 Smiths on record. In fact, the Johns accounted for 3% of all the Canadian Expeditionary Force.
The Johns could form a division to themselves. As such, first and last names alone cannot be used to differentiate enlistees.
Fortunately, we have other useful information to tell apart our 172 John Smiths.
First, let's have a look at the ID-like values we have in each table. Both tables have a regimental number column. It is called a service number in the war graves table, but a little research confirms that it's simply a more recent term for regimental numbers.
Each man in the Canadian Expeditionary Force was assigned a regimental number, save for nursing sisters and commissioned officers. Each unit had a block of regimental numbers it would assign sequentially to its enlistees. As the Library and Archives Canada documentationpoints out, enlistees with sequential numbers were likely standing next to each other in line at the recruiting booth. This changed in 1917, as numbers were from then on assigned to men in the order in which they were called up for service.
Members of the Royal Newfoundland Regiment did not receive numbers, as they were not part of the Canadian Expeditionary Force (Newfoundland joined Canada in 1949).
Regimental numbers might appear to be great candidates for identifying soldiers, but as Michael O'Leary from The Regimental Rogue points out, these numbers were frequently reassigned to multiple enlistees across the nation, and not every unit assigned numbers within their assigned blocks.
Our data set supports this claim. Some regimental numbers appear over a dozen times in the CEF data set. There are also several thousand records without regimental numbers.
Moreover, many enlistees have more than one regimental number. The most likely reason is that these enlistees were transferred during the war.
Michel, Mike and Michael
Despite the lack of a proper unique identifier for enlistees, it might still be possible to link somerecords.
Some people might share the same regimental number, so let's only merge records where the names match.
|Criteria for merging||Merged records|
|Same regimental number||69 270 (100%)|
|Same regimental number and last name||50 693 (73%)|
|Same regimental number and first name||25 586 (37%)|
|Same regimental number, first name and last name||23 319 (34%)|
By matching only by regimental number, we get more matches than we have war graves, which means several graves match more than one enlistees. The next step is to match soldiers that have the same regimental number, first name and last name.
At a glance, it seems like most records have mismatching first names. A quick look at the mismatching records reveals that empty first name values caused 70% of the first name mismatches.
|CEF first name||CWGC first name|
The remaining first name mismatches seem primarily caused by spelling differences and middle names.
|CEF first name||CWGC first name|
|REGINALD SYDNEY||REGINALD SIDNEY|
|LEONARD J||LEONARD JAMES|
|MATTHEW HUMPHREY||MATHEW HUMPHREY|
|WALTER WILLINGALE||WALTER W.|
|RUSSELL KIRBY||RUSSELL KERBY|
|CLIFFORD H||CLIFFORD H.|
|ALFRED GEORGE EDWARD||ALFRED G. E.|
However, matching by last name alone is a very bad idea, not to mention that they are also prone to spelling differences.
John Smith and John Smith
In addition to these inconsistencies, I discovered that several people with the same first name, last name and regimental number in the Canadian Expeditionary Force data set.
SELECT regiment_nr1, surname, given_name, count(*) FROM cef_enlistees WHERE regiment_nr1 IS NOT NULL GROUP BY regiment_nr1, surname, given_name HAVING count(*) > 1 ORDER BY count(*) DESC
It would be easy to dismiss these as duplicates, but there were indeed two men named Lionel Degre with the same name regimental number. Perhaps this is a transcription error, but they do have distinct documents.
In other cases, there are indeed duplicates. Stephen James Burns appears twice in the database, but both records share the same enlistment papers (based on URL) and birth date.
There are only 70 such duplicates in the database, and many of them can be eliminated by comparing their birth dates and document URLs. We can simply avoid merging war grave records for the remaining duplicates, if they exist.
According to the official documentation on regimental numbers, a soldier can have multiple numbers following a unit transfer during the war. In the original database schema, is stored multiple numbers in separate columns, and that prompted me to store the regimental numbers in a separate table.
What have we learned so far?
- First names and last names are not unique.
- Regimental numbers are not unique, either, but they can help differentiate people with the same name.
- The same person can have its name spelled differently in different data sets.
- A combination of first name, last name and regimental number is not unique either, but such combinations are fairly rare. We can still merge a few thousand other records.
With a better idea of what we are dealing with, it's time to formulate a plan.
First, let's eliminate duplicates in the Canadian Expeditionary Force data set. This will be added to the import script so that we don't have to worry about this in the future.
Second, let's go for the easy victories and merge CEF and CWGC records that have the same first name, last name and regimental number, except for those who are in our list of duplicates.
Third, merge CEF and CWGC records that have variations of the same first and last name using fuzzy matching. More on that later.
To be continued
The methods I used to eliminate duplicates are worth their own article, so I decided to cut this article in half. In the meantime, have a look at the GitHub repository for this project. Every operation I performed on the database are reproducible by using the code hosted there, so I invite you to have a look.