How to Scrape Pages with Infinite Scrolling with ParseHub

In this guide:

  • Scroll feature in ParseHub
  • Scraping a page with infinite scrolling with ParseHub

One of the common elements that websites are using is infinite scrolling. Facebook made it popular and how it works is basically by scrolling down, new data loads, continue scrolling, more new data. This new data is likely delivered by JavaScript and from my other post [How to Scrape JavaScript Webpages with ParseHub], ParseHub can handle most JavaScript cases. Let’s get started!

This guide will assume you have signed up for an account, logged in, downloaded the desktop app, successfully launched the app, and logged into the desktop app.

Single Page with Infinite Scroll

Start entering this url into the ParseHub browser and start the project:

https://finance.yahoo.com/quote/TSLA?p=TSLA
Let’s start a new project with: https://finance.yahoo.com/quote/TSLA?p=TSLA

Our goal is to grab all the article links associated with this company. Scroll down and notice that as you scroll more, more articles shows up. We want to get all of those articles. If you don’t scroll down and have ParseHub grab just the selected data, you’ll only get 10-20 article-urls.

We begin with clicking the + sign next to the first “Select Page” and clicking on Extract:

Start by extracting the page URL

We want to keep the first extract which is the page’s URL. It’s always good practice to extract the page URL so you can use this as a reference for future scrapes.

Extracting the page URL

Let’s add a new Select (click + on the first “Select Page”) and selecting all of the boxes with articles. Be sure to scroll down a few times to load some more articles to teach ParseHub to select additional boxes as well. Green highlights means selected:

Select the elements with articles. Be sure to scroll down a few times

Now to add the scrolling, we will click the + sign on the first “Select Page” again, and click on Select. We will select one of the parent elements that contains the articles, in this case it’ll be the area with the chart and below. Make sure the green highlight also surrounds the article area (highlighted in blue below):

Select a parent element that encompasses the article elements

I’ve also renamed that select as “select_for_scroll”:

Rename to “select_for_scroll”

Within the “select_for_scroll” command, click the + button and add scroll:

Use the Scroll action

In the bottom left settings for the scroll, change the scroll to 20 times with the default 2 seconds wait and default align to the bottom. This means that it’ll take at least 40 seconds to process this page:

Almost ready!

Now you want to drag the “select_for_scroll” command right after the first Select Page command. This will tell the computer to scroll down 20 times before proceeding to grab data:

Move “select_for_scroll” to the top

Now we’re ready to test! Remember to Save first as well. Get to the test page and click Play:

Save and test run

While the test run is running, notice the commands showing on the bottom-left side. After the test run is done, you can quickly check if you’re successful by checking the bottom-right side and scrolling down slightly. There should be 100+ entries and not just 10-20 entries. (I captured at least 185 entries in the picture):

Success!

If you’re having trouble, try saving this as a .phj file and importing it: [https://gist.github.com/alex4hoang/0e]. Check it against your own to see if you’re doing it right.

 

Comment