Putting it all together
Introduction
In this blog post, I will combine my previous blog posts and
lay out a draft plan for my research project. As I have said before, the basic
idea is to use web scraping techniques to create a tool to filter recipes for
the 14 allergen groups.
As far as this project is concerned, there are two web
scraping two main and relevant techniques: HTML and DOM parsing.
There is the added component that, due to Google’s recipe
cards, most recipe sites have metadata that includes the ingredients in the
HTML content and is in a standard format. That means it will be easier for a
web scraper to generalise the recipe searcher to different websites.
Most web scrapers are written in either Python or R, and
these languages have the most available open-source frameworks specifically for
web scraping.
There is also the added component that web scraping often
has to work around anti-bot technology, as website owners usually try and stop
the content on their websites from being scraped.
I want to investigate which web scraping techniques perform
best when applied to scraping recipes online. For example, is the metadata
approach better than HTML and DOM parsing? In addition, I want to know which
language is easier to develop and whether there are any differences in
performance between tools written in Python or R. I also want to investigate
whether anti-bot technology impacts the performance of any of these tools and,
if so, to what extent.
Deliverables and Tasks
The deliverables of this project will therefore be:
- A HTML interface for the tool
- Metadata web scraper
- Python web scraper
- R scraper web scraper
- 10,000-word report
The additional tasks that will need to be completed as well
will include the following:
A research diary to document the development of each scraper
An experimental phase where the performance of each scraper
is analysed
While it definitely looks like a lot of work, I already know
the project is possible and have examples of tools and a range of resources in
the form of tutorials and open-source projects.
Breaking down the timeline then by month the research
project could be broken down into the following timetable:
Month 1 – HTML interface, Metadata Web Scraper, 2000 words
of the report
Month 2 – Python and R Web Scraper, 2000 words of the report
Month 3 – Experimental Phase, 2000 words of the report
Month 4 – Final 4000 words of the report
This seems a reasonable allocation of time. Assuming a
standard working time of 5 days a week and 1 hour for each coding component over
the course of that month will give up roughly 20 hours each. This seems
reasonable. For months 1-3, it would require only 100 words to be written
daily.
The final month only requires a writing pace of 200 words
per day, allowing flexibility in completing the other goals.
Research Methods
Last week I wrote about different research methodologies. As
part of my project plan, I would like to have thought out which methodologies
my project should follow.
Mine is mainly Quantitative. I want to test explicitly the
performance of the tools I will build. Having said that, there is also a
significant Qualitative aspect to the project as I also want to consider which
web scraping technique or language is better and how easy it was to build or
build with.
Therefore some sort of mixed methodology is appropriate.
However, as the Qualitative aspect comes first, the project is best described
as Sequential Exploratory.
Charts and Visual Aids
I really like the Gantt Chart as a visual tool for mapping
the whole project. So I will modify one and create it in excel, as it is the
software I already have access to, am familiar with and am fully modifiable.
I am also keen on the Kanban Boards, which will look like
the following.
This breaks down the total tasks into nice chunks and is
visually appealing.
Measuring Success
The main goals of the project are:
- to answer all the research questions
- produce the web scraping tool
- to keep to the deadline of the project

Comments
Post a Comment