In this guide:
- Scroll feature in ParseHub
- Scraping a page with infinite scrolling with ParseHub
Single Page with Infinite Scroll
Start entering this url into the ParseHub browser and start the project:
Our goal is to grab all the article links associated with this company. Scroll down and notice that as you scroll more, more articles shows up. We want to get all of those articles. If you don’t scroll down and have ParseHub grab just the selected data, you’ll only get 10-20 article-urls.
We begin with clicking the + sign next to the first “Select Page” and clicking on Extract:
We want to keep the first extract which is the page’s URL. It’s always good practice to extract the page URL so you can use this as a reference for future scrapes.
Let’s add a new Select (click + on the first “Select Page”) and selecting all of the boxes with articles. Be sure to scroll down a few times to load some more articles to teach ParseHub to select additional boxes as well. Green highlights means selected:
Now to add the scrolling, we will click the + sign on the first “Select Page” again, and click on Select. We will select one of the parent elements that contains the articles, in this case it’ll be the area with the chart and below. Make sure the green highlight also surrounds the article area (highlighted in blue below):
I’ve also renamed that select as “select_for_scroll”:
Within the “select_for_scroll” command, click the + button and add scroll:
In the bottom left settings for the scroll, change the scroll to 20 times with the default 2 seconds wait and default align to the bottom. This means that it’ll take at least 40 seconds to process this page:
Now you want to drag the “select_for_scroll” command right after the first Select Page command. This will tell the computer to scroll down 20 times before proceeding to grab data:
Now we’re ready to test! Remember to Save first as well. Get to the test page and click Play:
While the test run is running, notice the commands showing on the bottom-left side. After the test run is done, you can quickly check if you’re successful by checking the bottom-right side and scrolling down slightly. There should be 100+ entries and not just 10-20 entries. (I captured at least 185 entries in the picture):
If you’re having trouble, try saving this as a .phj file and importing it: [https://gist.github.com/alex4hoang/0e]. Check it against your own to see if you’re doing it right.