Photo by Richard Nolan

How to Scrape JavaScript Webpages with ParseHub

In this guide:

  • ParseHub desktop navigation
  • Scrape a single JavaScript rendered page with ParseHub

Scraping pages has never been easier with ParseHub. It’s also FREE to start! If you read my other post on scraping [How to Scrape JavaScript Rendered Websites with Python & Selenium], doing it yourself could require some coding knowledge and some more set up. With ParseHub, like Import.io, no coding is needed, just some knowledge on how websites work. However, unlike Import.io, ParseHub can bypass login pages, render JavaScript websites, and parse multiple pages like a champ (or learn How to Scrape Infinite Scrolling Content with ParseHub in the mean time).

We will go through some basic, but powerful features with ParseHub to parse a single page.

This guide will assume you have signed up for an account, logged in, downloaded the desktop app, successfully launched the app, logged into the desktop app, and see a screen like this:

Single page crawl

Let’s start with putting this link [https://www.target.com/c/furniture/-/N-5xtnr] into the address bar in the desktop app. Going forward, all actions are on the desktop app.

We want to grab all of the products on this single page:

Our target is Target

Let’s start by clicking the button “New Project” on the top left. Proceed with clicking “Start Project with this URL”.

After starting a new project

The left side bar will be persistent throughout the process and it’s how we’ll write the commands or steps we want the computer to do. We are greeted with:

  • main_template: the default template name we start with
  • Select page: the app will initially select the entire page, that will be our playground
  • Empty selection(0): starting selection, we begin by selecting an element we want to grab or extract

For our goal of getting all of the products on the page, there are a few routes, but let’s start with this:

  • Select the first product_name on the top right. You’ll notice that this image will turn green, that is the element we’ll extract. You’ll also notice that the other product_names will turn yellow, that denotes what ParseHub thinks you want or are similar.
Selecting the first product’s product_name
  • If you select the product_name to the right of the one you initially selected (highlighted in green), you’ll notice that some of the other product_names will be highlighted green and some won’t have highlights.
Training the app to select other product_names
  • Click on the ones that don’t have highlights until you see all of the product_names are green aside from the “trending items” at the bottom. You’ll likely just have to click that top right one with the two white chairs.
Confirm we are selecting all of the product_names
  • Next, let’s select the product_image of each product. We do this by clicking the + button to the right of “Begin new entry in selection1”, and selecting “Relative Select”
Relative select
  • Starting with the top left product again. Click on the product_name, and then click on the product_image of the product. In this case, we’d select “Futon with arms — Room Essentials(tm)” and then the futon image.
Relative select the product_image
  • Do this until all of the product_images are selected with green highlights.
Confirm we relative select ALL of the product_images
  • Let’s repeat this Relative Select for the product_price
Relative select ALL of the product_prices
  • Notice at the bottom right, the data will be populated
Notice the json data at the bottom right. You can preview the CSV/Excel data too
  • Alright, we have the commands needed to grab the data we want. Let’s make sure we save. It’s under the hamburger options.
Save
  • Let’s run this project. With the same hamburger menu, select Run. Proceed with “Save and Run”.
Run
  • After the job is finishes, click on the CSV button to save the .csv file. Success! Your run_results.csv file should look like this.
run_results.csv

If you’re having trouble, try saving this as a .phj file and importing it: https://gist.github.com/alex4hoang/14  

 

Challenge

 

Disclaimer: I am not affiliated with ParseHub, Target or target.com, and this guide is for educational purposes only.

Comment