Tag Archives: defendants

What’s in a Name?: Details and Data Linkage

A year in to the Digital Panopticon project we have begun record linkage with some of our key sources relating to Transportation. With several innovative iterations of initial linkage completed, thanks to Jamie McLaughlin, we have been able to trace more than three quarters of those sent for transportation from the Old Bailey, linking them to their voyage details in the British Transportation Registers. For some, we have also been able to link onwards to the Convict Indents compiled for them on board convict ships and once they arrived in Australia. This iterative process has taught us much about the nature of our different record sets, and about the complex job of connecting them together.

One of the biggest challenges in the linking process has been differentiating between the multiple cases of identical names and trials in the Old Bailey. However, with a schedule of record linkage due to connect not just our transportation datasets, but also imprisonment data and eventually civil data, such as the census and birth marriage and death information, in the coming months, the certainty of what to link and how becomes increasingly difficult.

When confronted with a sea of names, and no consistency in the recording of other contextual information between our diverse datasets, how are we to make the right choices and make sure that the correct history is connected to the right offender?

Between 1780, and 1900 there was only one Mary Ann Dring convicted at the Old Bailey she was sentenced to five years penal servitude in 1865 for feloniously uttering counterfeit coin. She had appeared in the old Bailey once previously in 1863 as a witness in the coining trial of another Woman, and twenty years later in 1885 might well have acted as a witness in a manslaughter case.

From a linkage perspective we are fortunate. In all of our criminal datasets there should only be one Old Bailey Mary Ann Dring. Indeed, this is very lucky because owing to just two lines of text for her own trial, the information we start off with in order to trace her is minimal:

Name: Mary Ann Dring

Approximate year of birth: 1817

Location: London.

Step one, is to link to the next big dataset for those who stayed in England to be imprisoned. In this case that is the PCOM 4 female licences for parole. By searching with the available information from Mary Ann Dring we took from the Old Bailey data, there is no problem in locating her licence. Those familiar with the licences will know that these documents give us the opportunity to, collect a vast amount more information on her. Confident that the right link has been made we can collect some key contextual detail that will allow us to identify Mary Ann Dring in further datasets.

Licence fields

The future datasets we link to will not, of course, contain the majority of this information. So we must utilise a few key details that will help us link to new records. For civil data we could certainly use information such as the fact that Mary Ann Drink was recorded as married with two children in 1865. She worked as a Charwoman, and had been resident in London, under her married name, since at least 1863 when she had her first conviction.

In the nearest census to Mary Ann’s Old Bailey conviction in 1865 (1861) there are 183 returns for a Mary Ann Dring born on or around 1817. If we make the not unreasonable assumption that our Mary Ann Dring was living in London for the five years prior to her Old Bailey appearance, we can rather luckily reduce that to four viable matches.  To most academic researchers or family historians, this is a small and manageable selection of information in which to choose.

MAD census entries

Yet even though we know she was married with two children, we are faced with four married women, two with two children, two with three, all living in London (and none with any occupation listed which is not unusual for a census entry with a male head of household). Given the parameters of most automated systems that might be required to make such a match, any of these census entries could be considered a valid match. Manually, it is possible for an individual researcher to reduce the choices to two viable matches. They are, from a linkage point of view, almost indistinguishable. The dates of birth for the two most likely candidates fall one year either side of 1817. Both are married, both have two children. Both are residents of London. Both have identical names.

In the 1871 census, six years from Mary Ann’s conviction and four years after her release from Prison, there are no records that would directly match to either of the entries for the 1861 census. Instead there is a choice of five women who all fall within five years of the original Mary Ann Dring’s birth year, but have notable differences in their personal information. Furthermore, depending on which links are made to census data, and what extra contextual information is added to May Ann’s case, there is the potential for relevant death records from London and the surrounding counties, spanning a fifteen year period.

The choices we would be faced with if we just looked for Mary Dring, without the middle name Ann would be several times the volume. If we looked for a Mary Smith with the same level of contextual detail we could well be faced with exploring hundreds of potential matches with no way to choose between them.

Each individual record linked to a convict has ramifications for future links. On the micro level this is the dilemma faced by every genealogist or family historian. The difficult decisions that have to be made in matching records to individuals. However, the Digital Panopticon’s task of linking almost 90,000 convicts across multiple datasets is not a micro history, nor a task that can be managed manually. The design of an automated system that can navigate and discern between multiple similar (or even identical) entries in a given dataset is essential. Or perhaps it is a question of ranking and displaying the multiple possible links in case of conflict?

It would seem that our challenge now is that of developing a suitably complex data linkage system, that can simultaneously maintain a high rate of matches that we can be confident in, and one that at the same time allow us to incorporate possible, contradictory, and conflicting data. Those with common names will no doubt prove our greatest challenge, but even someone as seemingly unique as Mary Ann Dring poses challenges about how we match, what we match, what we keep, and how to store and rank conflicting information across such a wide variety of datasets.


Seeing things differently: Visualizing patterns of data from the Old Bailey Proceedings


An edition of the Old Bailey Proceedings

The Old Bailey Proceedings are a rich historical resource, almost unimaginably so. They constitute the largest body of texts detailing the lives of non-elite people ever published. Words alone can’t quite do justice to the magnitude of the Proceedings – 197,745 accounts of trials covering 239 years (1674-1913); some 127 million words of text (at an average reading rate of 250 words per minute, this would take eight hours’ solid reading every single day for nearly three years to get through!); details of some 253,382 defendants, including name, gender, age and occupation, as well as details of 223,246 verdicts passed by the juries and 169,243 punishments sentenced by the judges.

The Proceedings clearly contain a huge amount of information, but they don’t record everything – like any historical source, they are selective in what they document. The amount of information that was recorded in the Proceedings on crimes, verdicts, punishments, defendants and so on also varied over time. And whilst the digitization of the Proceedings by The Old Bailey Online has revolutionised the way in which we search and use this rich historical resource, this also has its limits. The marking-up of the text of the Proceedings (assigning tags to particular pieces of information in the text – such as name or crime – so that this information can be systematically searched) makes it possible to undertake sophisticated statistical analysis. Crimes, verdicts, punishments, defendant age and defendant gender can all be counted at the click of a mouse. Nevertheless, marking-up inevitably involves choices (about what information to tag and the level of detail that is tagged), and those choices limit the ways in which the Proceedings can be studied using computers.

Statistical searches of the Proceedings can be carried out through The Old Bailey Online

Statistical searches of the Proceedings can be carried out through The Old Bailey Online

The question that we might ask, then, is what are the limitations of the Proceedings as a source of data on such things as punishments, defendant age and gender? Taking the Proceedings in their entirety, what are the limits in terms of the information that was recorded in the original trial reports? How frequently, for example, was the age of the defendant recorded? And what are the limits in terms of what we can actually search for systematically using digital technologies? Can we, for instance, systematically determine the lengths of imprisonment which offenders were sentenced to?

These are crucial questions for us because the Digital Panopticon will rely so heavily on the Proceedings as a source: in our effort to trace the life histories of offenders who were sentenced to transportation or imprisonment at the Old Bailey between 1787 and 1875, the Proceedings will obviously be a vital source of information. After identifying those who were sentenced to transportation or imprisonment recorded in the Proceedings we will then try to trace such individuals both before and after their conviction by linking the Proceedings with other sets of records.

In trying to better understand the limitations of the Proceedings as a source of data for the Digital Panopticon project, I have recently been making use of data visualization (‘dataviz’) – using computers to create visual representations of numbers. This includes the traditional graphs and pie charts that we are all familiar with, and which I will be talking about here. But it also includes more complex forms of visualization which I will be looking at in future posts (watch this space!).

Since the Proceedings contain such a vast amount of information, manual counting and tables are therefore inadequate in making sense of the data. Turning the raw numbers into a visual form makes it much easier to see overall patterns in the data. Here I give just a brief example of how dataviz has helped me to see the Proceedings differently, to appreciate the limits of this immense historical resource, and to think about how information from the Proceedings can be used most effectively in the Digital Panopticon project.

A data visualisation of the length of trial reports in the Proceedings over time, created by The Datamining with Criminal Intent project

A data visualization of the length of trial reports in the Proceedings over time, created by  William J. Turkel as part of the Datamining with Criminal Intent project (created using Mathematica 8)

One of the key things we want to know on the Digital Panopticon is how useful age data might be in helping us to link offenders recorded in the Proceedings with individuals documented in other sets of records (such as the convict transportation registers or census records). In the first instance, links will be made through name searches of the different types of records. But how can we be sure that the John Smith recorded in the Proceedings is the same individual as the John Smith recorded in the prison parole registers, for example? Age data might help us here. If John Smith is recorded as being 24 years’ old in the Proceedings at the time of his sentence to two years’ imprisonment at the Old Bailey, and the John Smith recorded in the parole registers is stated to be 26 years’ old, then we can be confident that this is indeed the same person. By the same token, if the John Smith recorded in the parole registers is said to be 60 years’ old, this would suggest not.

Ages could then be extremely useful, but it depends on how extensively, and how accurately, age data is recorded in the Proceedings (and our other sets of records). By visualizing the results of quantitative searches of the Proceedings we can get a clear sense of this, far more so than through the use of text-heavy tables which can be hard to “read” for patterns. A statistical search using The Old Bailey Online reveals that 171,168 defendants are recorded in the Proceedings in the years 1755-1870. Of these, age is recorded for 101,364 (59.3%) of them. So for the entire period of our study, we have age data for just over half of all the defendants at the Old Bailey.

Further digging into the data and visualisation of the findings reveals some of the deeper patterns in the age data. In the first instance, the recording of ages only began in the year 1790 for defendants found guilty, and from the 1860s for those found not guilty, as shown in the graph below. In the 1790s, we have age data for 65% of guilty defendants, increasing to 90% and above thereafter. By contrast, age data for the not guilty is missing until at least the 1850s, and in earnest until the 1860s.

Visualisation demonstrating the extent of age recording over time and by verdict

Visualization demonstrating the extent of age recording over time and by verdict

This gives a sense of how extensively ages are recorded in the Proceedings over time, and according to which categories of offenders. By visualizing the patterns of recorded ages we can also get a feel for how ages were actually recorded. The graph below, for instance, suggests that there was a tendency to revise the defendant’s recorded age up or down slightly to match a round figure. The numbers of defendants whose ages are recorded as 30, 40, 50 and (to a lesser extent) 60 are all significantly above the number we might expect according to the moving average (in other words, when the yellow bar goes above the green line in the graph). By contrast, ages just either side of these figures (such as 29, 31, 39, 41 and 51) are systematically below the average (when the yellow bar is below the green line). It may well also have been the tendency for those in their early twenties to have their recorded ages revised down to 18 or 19, since these two ages are also well above the expect number. In short, many more defendants were recorded as being 30 rather than 31, or 40 rather than 41, and the scale of the difference suggests that this resulted from a deliberate policy of revising the defendant’s age up or down to match the nearest round figure.

Visualisation demonstrating the “bunching” of recorded ages at 30, 40, 50 and 60

Visualization demonstrating the “bunching” of recorded ages at 30, 40, 50 and 60

Together this suggests that age data in the Proceedings will be of much use to us in the Digital Panopticon, particularly for the defendants found guilty and subsequently sentenced to transportation or imprisonment. In this instance we have extensive amounts of age data from 1790 onwards. In the case of our not guilty control group, however, we have no age data available in the Proceedings to work with before the 1860s. In this instance we will be reliant on other categories of information to link the not guilty defendants across datasets. And in light of the seeming tendency for recorded ages to be rounded up or down, this suggests that when we use age data to link individuals across datasets it would be more effective to work within age ranges rather than trying to compare specific numbers.

From these early explorations it seems clear that visualization will be invaluable in helping us to identify the overall patterns in the data of the Proceedings. The first step in this is identifying some of the limitations in terms of the information recorded in the Proceedings. Traditional forms of visualization are useful to this end. But there are also potential benefits in going beyond this, by using more complex forms of visualization to uncover deeper patterns in the data – patterns that would be difficult to detect through simple graphs or charts. This is what I will be turning to next.