My Research Project

 Introduction

As part of my master’s course at Sunderland University, I have been allowed to undertake a research project. The topic I have decided to research is something called Web-scraping and applying it to searching and filtering for recipes. In this blog post, I will explain what Web-scraping is, what research there is about it, and what research is still to be done.

What is Web-scraping?

Web scraping is the name of various techniques to extract information from web pages. Even someone manually copying and pasting content from websites has been described as web scraping. Usually, though, web scraping refers to building some software that can do this process automatically. Indeed any situation where someone or an organisation wants to extract information that is either difficult or tedious to collect or so large that it needs to be automated could apply web scraping. This includes researchers, business, finance, media, etc.



The specific types of web scraping depend on the type and format of content one wants to extract. When a webpage is accessed, the HTML document of that webpage is interpreted by the browser and transformed into a webpage, as the above image shows. How these webpages are built can be extremely eclectic and varied. How a webpage has been designed will determine the web scraping techniques necessary to extract the content.

Some Types of Web scraping:

  • HTML parsing
  • DOM parsing
  • Computer vision web-page analysis
  • XPath

 For example, if you write a blog post, you can store that blog in individual <p> or paragraph tags. Then, one can download the HTML, copy the text from the paragraph tags and store them in a database. Automate this process, and you have successfully used HTML parsing. However, some websites don’t store data in simple HTML <p> tags or other HTML tags. For example, some websites generate their content ‘dynamically’, meaning the page can only be seen once rendered and not just in the HTML file. Accessing this kind of content will require DOM parsing, which means scraping the content while the page is fully loaded in a simulated web browser.

Web scraping will usually require three components:

A web crawler – Which accesses the web page you want to scrape

A web scraper – Which extracts the data or content

A database or storage of some kind – into which the scraper can deposit the data or content that has been scrapped.

What research is there about Web scraping?

Most of the research on Web scraping tends to focus on data, which is understandable. However, for researchers, Web scraping presents a real opportunity to automate data collection, particularly data re-formatting. For example, in Nature (DeVito, et al., 2020), several researchers discussed their usage of Web scraping as a tool to reduce the workload in processing documents such as coroners’ reports. They increased the screening rate of coroners’ reports from 25 an hour when doing it manually to 1000 an hour by automating the process with web scraping, which is obviously a fantastic productivity improvement.

Other papers look at using the vast data collection to produce something new rather than merely speeding up their existing research. For example, a paper by Bogdan Oanceaa and Marian Necula (Oancea & Necula, 2019) uses Web scraping of e-commerce sites to try and produce an alternative CPI – consumer price index. The CPI is meant to measure inflation, as with all economic statistics, and is very complicated to produce. The idea of an inflation measure is to keep account of how expensive the things people buy are and how quickly they are going up or down. How to produce a single figure to represent this idea accurately is conceptually tricky, but more importantly, it is logistically a nightmare. The information has to be collected from a vast and varied array of sources. Thus, inflation data tends to be released monthly or quarterly and is very costly to produce. Instead, this paper shows a method to use web scraping to produce a CPI from e-commerce websites. While I am unconvinced by the variety between it and the CPI, it is undoubtedly an exciting web scraping application.

Few papers have focused on any web scraping technique that doesn’t use data. Furthermore, few papers have been written comparing languages or approaches to writing web scraping software.

What do I want to research?

My research project is to apply web scraping to recipes, specifically building a web scraper that searches and filters recipes by the 14 allergen groups. Here is a GitHub project that does this already. I want to use this project as an opportunity to compare several different types of web scraping.

Recipes often have metadata attached, so they turn up as ‘recipe cards’.

These recipe cards are possible because the websites they come from have metadata built into their HTML documents.

The above is not presented in the eventual webpage but allows google to produce those recipe cards above. However, from a web scraping perspective, these are contained in the HTML document itself and thus can be parsed through the HTML document.

I want to build several tools, one that uses this technique and others that use traditional HTML and DOM parsing and compare their performance. I also want to compare the different languages these tools will be built in, the quantitative performance, and the qualitative ease of development. The most common are R and Python, and both have many open-source and free tools available.

In total, I want to ask these research questions?

RQ1: Can a web-based tool that can filter the allergen groups from recipes in both Python and R be created?

RQ2: Will a hybrid approach using metadata and HTML parsing be more efficient than HTML parsing alone?

RQ3: Can a machine learning algorithm improve the Accuracy of the tool compared to a simple text filter?

RQ4: Is there a performance difference between Python and R when applied to Web Scraping?

Next week, I will discuss the project management approach to this project and compare some tools to aid in this.

Bibliography

DeVito, N. J., Richards, G. C. & Inglesby, P., 2020. How We Learned to Stop Worrying and Love Web Scraping. Nature, Volume 585.

Oancea, B. & Necula, M., 2019. Web scraping techniques for price statistics – the Romanian experience. Statistical journal of the IAOS, Volume 35.


Comments

Popular posts from this blog

Visual Aids and Productivity tools

Putting it all together