My Research Project
Introduction
As part of my master’s course at Sunderland University, I
have been allowed to undertake a research project. The topic I have decided to
research is something called Web-scraping and applying it to searching and filtering
for recipes. In this blog post, I will explain what Web-scraping is, what
research there is about it, and what research is still to be done.
What is Web-scraping?
Web scraping is the name of various techniques to extract
information from web pages. Even someone manually copying and pasting content
from websites has been described as web scraping. Usually, though, web scraping
refers to building some software that can do this process automatically. Indeed
any situation where someone or an organisation wants to extract information
that is either difficult or tedious to collect or so large that it needs to be
automated could apply web scraping. This includes researchers, business,
finance, media, etc.
The specific types of web scraping depend on the type and format of content one wants to extract. When a webpage is accessed, the HTML document of that webpage is interpreted by the browser and transformed into a webpage, as the above image shows. How these webpages are built can be extremely eclectic and varied. How a webpage has been designed will determine the web scraping techniques necessary to extract the content.
Some Types of Web
scraping:
- HTML parsing
- DOM parsing
- Computer vision web-page analysis
- XPath
Web scraping will usually require three components:
A web crawler – Which accesses the web page you want to
scrape
A web scraper – Which extracts the data or content
A database or storage of some kind – into which the scraper
can deposit the data or content that has been scrapped.
What research is there about Web scraping?
Most of the research on Web scraping tends to focus on data, which is understandable. However, for researchers, Web scraping presents a real opportunity to automate data collection, particularly data re-formatting. For example, in Nature (DeVito, et al., 2020), several researchers discussed their usage of Web scraping as a tool to reduce the workload in processing documents such as coroners’ reports. They increased the screening rate of coroners’ reports from 25 an hour when doing it manually to 1000 an hour by automating the process with web scraping, which is obviously a fantastic productivity improvement.
Other papers look at using the vast data collection to
produce something new rather than merely speeding up their existing research.
For example, a paper by Bogdan Oanceaa and Marian Necula (Oancea
& Necula, 2019) uses Web
scraping of e-commerce sites to try and produce an alternative CPI – consumer
price index. The CPI is meant to measure inflation, as with all economic
statistics, and is very complicated to produce. The idea of an inflation
measure is to keep account of how expensive the things people buy are and how
quickly they are going up or down. How to produce a single figure to represent
this idea accurately is conceptually tricky, but more importantly, it is
logistically a nightmare. The information has to be collected from a vast and
varied array of sources. Thus, inflation data tends to be released monthly or
quarterly and is very costly to produce. Instead, this paper shows a method to
use web scraping to produce a CPI from e-commerce websites. While I am
unconvinced by the variety between it and the CPI, it is undoubtedly an
exciting web scraping application.
Few papers have focused on any web scraping technique that
doesn’t use data. Furthermore, few papers have been written comparing languages
or approaches to writing web scraping software.
What do I want to research?
My research project is to apply web scraping to recipes,
specifically building a web scraper that searches and filters recipes by the 14
allergen groups. Here is a GitHub project that does this already. I want to use
this project as an opportunity to compare several different types of web
scraping.
Recipes often have metadata attached, so they turn up as
‘recipe cards’.
These recipe cards are possible because the websites they come from have metadata built into their HTML documents.
The above is not presented in the eventual webpage but allows google to produce those recipe cards above. However, from a web scraping perspective, these are contained in the HTML document itself and thus can be parsed through the HTML document.
I want to build several tools, one that uses this technique
and others that use traditional HTML and DOM parsing and compare their
performance. I also want to compare the different languages these tools will be
built in, the quantitative performance, and the qualitative ease of
development. The most common are R and Python, and both have many open-source
and free tools available.
In total, I want to ask these research questions?
RQ1: Can a web-based tool that can filter the allergen
groups from recipes in both Python and R be created?
RQ2: Will a hybrid approach using metadata and HTML parsing
be more efficient than HTML parsing alone?
RQ3: Can a machine learning algorithm improve the Accuracy
of the tool compared to a simple text filter?
RQ4: Is there a performance difference between Python and R
when applied to Web Scraping?
Next week, I will discuss the project management approach to
this project and compare some tools to aid in this.
Bibliography
DeVito,
N. J., Richards, G. C. & Inglesby, P., 2020. How We Learned to Stop
Worrying and Love Web Scraping. Nature, Volume 585.
Oancea,
B. & Necula, M., 2019. Web scraping techniques for price statistics – the
Romanian experience. Statistical journal of the IAOS, Volume 35.



Comments
Post a Comment