14 Apr

Adverts and Gobbledygook

AdvertsThe project is currently hurtling towards the end of stage one and our team of researchers are working flat out to be able to log and process all of the new search pages and results that have been recently uploaded onto the British Newspaper Archive’s online database (you can find out more about this unexpected increased in source material here.) This new body of source material, however, is not exactly like the previous search results that we have been processing. In fact, there have been two issues encountered by our team as they have begun to log these stories.

The first issue is that many of the search results that have been categorised as articles are in fact advertisements. As we are not looking at advertisements in this heritage project they are irrelevant to our research – however, as one of our researchers has noted, you could have made a good living at the time by locating lost dogs! We’re not too sure why they have been categorised as articles – perhaps, the British Newspaper Archives will correct this at the some point (we’ll probably drop them an email to let them know about our discovery!) – and there is nothing we can see that we can do in order to remove them from our corpus of search results. What we are currently doing is logging them on our paper records as advertisements and not including any specific details as to what type of advertisement they are. We will also not include them in the digital database that we will produce from all of the researchers’ paper logs.

The second problem with this new batch of results is that the information that accompanies each scan/search result (e.g. the title and a blurb) frequently don’t make much sense…in fact, in quite a number of instances they are complete gobbledygook. This means that in order to log the title and get a sense of the article (in order to determine whether it is an article or in fact an advertisement) the researchers are having to look through the digital scan itself. This takes more time than if the brief bit of information about each search result was readable. We can only surmise that this is due to a software being used by the British Newspaper Archive is not recognising the typeface/text in the old newspapers, and is therefore not registering a readable title and blurb. We would imagine that these descriptions and information that accompany the search results will be updated/corrected manually at some point in order to provide more accurate details. Again, we will drop the British Newspaper Archive an email to ask about this…but in the mean time, has anyone else encountered these types of problems before and do you know what were the reasons behind them?

One thought on “Adverts and Gobbledygook

  1. The British Newspaper Archive has explained why some of the search results had the characteristics outlined in this blog post:

    “When we digitise newspapers, we use computers to read the words on the page and make sense of them – this process is called Optical Character Recognition or OCR. Although machines are not as good as humans at reading text, it would cost far too much and take far too long for people to read and retype the newspaper content.

    The layout and quality of the newspaper image also has an impact on how good the generated text is. When the newspaper image is very clear and the size of the type is large, our results are generally very good. If the image is reasonably hard to read for a human, a machine will have similar issues in correctly interpreting the characters.”

    This explains why some of the blurbs and categorisation is a bit out on the newer results.

Comments are closed.