Putting it all together

 

Introduction



In this blog post, I will combine my previous blog posts and lay out a draft plan for my research project. As I have said before, the basic idea is to use web scraping techniques to create a tool to filter recipes for the 14 allergen groups.

As far as this project is concerned, there are two web scraping two main and relevant techniques: HTML and DOM parsing.

There is the added component that, due to Google’s recipe cards, most recipe sites have metadata that includes the ingredients in the HTML content and is in a standard format. That means it will be easier for a web scraper to generalise the recipe searcher to different websites.

Most web scrapers are written in either Python or R, and these languages have the most available open-source frameworks specifically for web scraping.

There is also the added component that web scraping often has to work around anti-bot technology, as website owners usually try and stop the content on their websites from being scraped.

I want to investigate which web scraping techniques perform best when applied to scraping recipes online. For example, is the metadata approach better than HTML and DOM parsing? In addition, I want to know which language is easier to develop and whether there are any differences in performance between tools written in Python or R. I also want to investigate whether anti-bot technology impacts the performance of any of these tools and, if so, to what extent.

Deliverables and Tasks

The deliverables of this project will therefore be:

  • A HTML interface for the tool
  • Metadata web scraper
  • Python web scraper
  • R scraper web scraper
  • 10,000-word report

The additional tasks that will need to be completed as well will include the following:

A research diary to document the development of each scraper

An experimental phase where the performance of each scraper is analysed

While it definitely looks like a lot of work, I already know the project is possible and have examples of tools and a range of resources in the form of tutorials and open-source projects.

Breaking down the timeline then by month the research project could be broken down into the following timetable:

Month 1 – HTML interface, Metadata Web Scraper, 2000 words of the report

Month 2 – Python and R Web Scraper, 2000 words of the report

Month 3 – Experimental Phase, 2000 words of the report

Month 4 – Final 4000 words of the report

This seems a reasonable allocation of time. Assuming a standard working time of 5 days a week and 1 hour for each coding component over the course of that month will give up roughly 20 hours each. This seems reasonable. For months 1-3, it would require only 100 words to be written daily.

The final month only requires a writing pace of 200 words per day, allowing flexibility in completing the other goals.

Research Methods

Last week I wrote about different research methodologies. As part of my project plan, I would like to have thought out which methodologies my project should follow.

Mine is mainly Quantitative. I want to test explicitly the performance of the tools I will build. Having said that, there is also a significant Qualitative aspect to the project as I also want to consider which web scraping technique or language is better and how easy it was to build or build with.

Therefore some sort of mixed methodology is appropriate. However, as the Qualitative aspect comes first, the project is best described as Sequential Exploratory.

Charts and Visual Aids

I really like the Gantt Chart as a visual tool for mapping the whole project. So I will modify one and create it in excel, as it is the software I already have access to, am familiar with and am fully modifiable.

I am also keen on the Kanban Boards, which will look like the following.

This breaks down the total tasks into nice chunks and is visually appealing.

Measuring Success

The main goals of the project are:

  • to answer all the research questions
  • produce the web scraping tool
  • to keep to the deadline of the project
If I can achieve these, I will consider this project a success.

Comments

Popular posts from this blog

Visual Aids and Productivity tools