Smart Mode – Scrapestorm

How to collect web data in reverse order

wangyuyuan — Thu, 20 Apr 2023 07:55:56 +0000

When collecting data, it is often necessary to collect in reverse order (collecting data from the last page to the first). This article will show you how to use ScrapeStorm’s smart mode to collect web page data in reverse order.

Case 1: After paging the list page, the link changes, and the link to the last page exists .

Processing method 1: Use the last page link of the list page as the collection link.

When we can directly get the link to the last page of the website list page, we can use the link of the last page to create a collection task by directly copying the link.

1. Click to the last page in the browser and copy the link of the last page.

2. Create a smart mode task.

3. Set page turning button and manually select the “Previous” button.

4. Start the task to start collecting in reverse order.

Processing method 2: Set reverse page numbers in batches

When the link of the website will change according to the page turning, but there is no “Previous” button to realize the operation of turning the page forward, you can realize the reverse order collection by setting the page number.

1. Copy the link to the second page.

Generally speaking, the link on the first page may be different from the link on the second page and the third page. It is impossible to find a regular link through the link on the first page, so it is recommended to directly copy the link on the second page to create Task.

2. Use the function of generating URLs in batches to generate links.

As shown in the figure below, “Start” is set to “Last Page”, “End” is set to “First Page”, and “Step” is set to “decrease”.

For more details, please refer to the tutorial: How to use URLs Generator

3. Set paging.

When URLs have been generated in batches, there is no need to set a page turning button. The page button can be set to “None”. If the page needs to be scrolled to display more data, it is recommended to set it to “Scroll to Load”.

4. Start the task to start collecting in reverse order.

Case 2: After the list page is turned, the link remains unchanged, and there is no link to the last page

Processing method 1: There is a button to jump to the last page on the web page.

When the link of the website will not change according to the page turning, and we cannot directly get the link of the last page, we can jump to the last page by directly clicking the page turning button of the last page, so as to realize reverse collection.

1. Create a smart mode task.

2. Add a click component to the pre-executed operation interface to turn the page to the last page.

3. Set page turning button and manually select the “Previous” button.

4. Start the task to start collecting in reverse order.

Processing method 2: There is a page number input box on the web page

When the link of the website will not change according to the page turning, and we cannot directly get the link of the last page, we can jump to the last page by directly inputting the page number of the last page to achieve reverse collection.

1. Create a smart mode task.

2. Add a “Input” component and a “Click” component in the pre-executed operation interface to turn the page to the last page.

3. Set page turning button and manually select the “Previous” button.

4. Start the task to start collecting in reverse order.

How to save the current page as a file when scraping

wangyuyuan — Thu, 09 Mar 2023 10:51:51 +0000

When scraping data, if you encounter a situation where the scraping stops without scraping all the data, you can click the “Show Page” button to confirm whether the page is opened abnormally during the scraping process. This article will explain how to save the current page as a file when scraping.

Step 1: Click the “Show Page” button

After starting the task, the software will automatically open the “Running Interface”, click the “Show Page” button on this interface to see the page currently being scraped.

You can confirm the opening status of the current page of the task through the “Show Page” interface, including whether the Pre-execute Operation is running normally, whether the page is turned normally, whether there is an advertisement pop-up window, whether a CAPTCHA is encountered, etc.

Step 2: Download the current webpage

In the upper right corner of the opened current page, there is a “Save the current webpage to the file system” button, click this button to download the current page to the local.

How to scrape links of detail pages

wangyuyuan — Mon, 25 Apr 2022 08:46:02 +0000

When scraping data, it is often necessary to scrape links of detail pages. This article will explain how to use our ScrapeStorm smart mode to scrape links to the detail pages in three ways, and the flowchart mode is the same.

First way: Automatic detect

The smart mode will automatically detect the list. Generally, when the list is detected, the link of the detail page will also be detected.

Note: If the automatic detection is inaccurate, you can also select the list manually.

For more details, you can refer to the tutorial:

How to scrape a list page

Second way: Through “Scrape In”

In the process of list detection, sometimes it is encountered that the link to the detail page cannot be detected. At this time, we can use “Scrape In” to enter the detail page and scrape the link of the detail page.

1) After the list is detected, add a field to detect data with a link of the detail page. The software automatically generates the field.

Note: The data with the link is generally the title of the article, or the name of the product, etc. You can confirm it by operating it on the browser.

2) Right-click the generated field, set “Extract Type“, and select “Link URL“.

3) Click “Scrape In” to enter the detail page.

For more details, you can refer to the tutorial:

How to Scrape In

4) After entering the detail page, add a field arbitrarily, then right-click the generated field, set “Special Value“, and select “Page URL“.

Third way: Splicing to get the link

If none of the above methods can successfully scrape the link of the detail page, but the ID of the detail page can be extracted by using XPath or regular expressions, you can use “Modify Data” to splicing to get the link of the detail page.

Note: If you do not know XPath or regular expressions, please contact us for customization.

Right-click the field, set “Modify Data“, and create a new “Add Prefix“.

This way, we can get the link of the detail page.

How to set Email Alerts

ScrapeStorm — Thu, 17 Mar 2022 10:59:42 +0000

ScrapeStorm currently supports the “Email Alerts” function. After enabling Email Alerts, when a user’s task encounters a CAPTCHA or needs to log in to a website, ScrapeStorm will remind the user by email.

Note: Email Alerts is only available for the Business plan and above.

The specific steps as follows:

1. Configure the mail push service

The premise of using the “Email Alerts” function is to configure the parameters of the email push service. This means that you need to use a mailbox as a transfer station for reminder emails, and all reminder emails will be pushed through this mailbox. The following is an example of Gmail.

First, select “Show all settings” on the Gmail settings.

Select “Forwarding and POP/IMAP” in Gmail’s Settings and enable IMAP, then save changes.

Then, open the settings in ScrapeStorm, you can see the relevant settings of “Email Alerts“.

Finally, enter four parameters in sequence in the Settings: SMTP server, Port, Email Account and Password.

Note:

1) About Gmail’s SMTP parameters, you can refer to this article:

Check Gmail through other email platforms

2) If you try to log in to your Google account, you may not be able to log in because it does not support two-step verification. So what you entered is the app password. Please refer to the tutorial below for how to generate an app password.

How to Generate an App Password

2. Set up “Email Alerts”

After the parameters of the push mail service are set, you can set the “Email Alerts” in the run settings of the task, and set the recipient’s email address.

The task encounters the following two conditions during the scraping process, which will trigger an email alert:

1) Detect CAPTCHA During the Scraping

2) Detect Login Prompt During the Scraping

ScrapeStorm supports entering multiple email addresses, and each email address needs to be separated by a comma “,“.

3. Check the reminder email

After the “Email Alerts” function is set, click “Start” to run the task. When you encounter a CAPTCHA or need to log in, ScrapeStorm will automatically send an alert email to the set mailbox, so please pay attention in time.
The content of the reminder email is as follows:

How to Set up Bright Data Zone

wangyuyuan — Fri, 11 Feb 2022 01:20:24 +0000

1. Sign up

Open the official website and click to sign up.

Enter the information to create a new account.

After setting your password, the account creation is complete.

2. Activate account

After the account is created successfully, Bright Data will send a verification link to your email.

Please click on the link in your email to verify your account.

After clicking activate account in the email, it will jump to Bright Data.

Recharge on the Billing interface, you have to pay to activate ips. Bright Data supports multiple payment methods. After the payment is successful, you can activate proxy.

Then select “Proxies”, there are four zones, it is generally recommended to select “data_center”.

Click “Inactive” button to enable, it will change to “Active”, so that the zone has been enabled.

Click “data_center” to enter the settings page.

Click “Access parameters” to swtich the tab.

You can see information such as username and password on this page.

3. Set up IP Rotation

Open ScrapeStorm and set IP Rotation in Run Settings. Copy the information from the Bright Data official website and fill it in according to the format of username + : + password.

e.g., lum-customer-data_center:123456

You can also directly copy this information from the website.

For more details about IP Rotation, you can refer to this tutorial: How to set IP Rotation & Delay

How to set IP Rotation & Delay

wangyuyuan — Mon, 02 Dec 2019 11:56:21 +0000

It contains IP Rotaion, automatic settings and manul settings. This part of the function is mainly used to avoid various website blocking problems that may be encountered.

In the editing task interface, click the “Start” button to open the running settings, and click “IP Rotation & Delay”.

1. IP Rotation

(1) IP Type

ⅰ. Bright Data Zone

Bright Data Zone is an external proxy, you need to buy ip on the official website.

How to Set up Bright Data Zone

ⅱ. Custom IP

If you need to use your own IP, please click “Set Up” and set it as required. (Note: the custom IP will be switched in sequence)

(2) Switch IP when the following condition is met:

ⅰ. Interval

IP switches based on time. For example, if you set the switching condition to “Interval: 3 minutes”, then the IP will be switched every 3 min and one IP will be consumed at the same time.

ⅱ. Text appears on the page

Switch according to the text. For example, if you set the switching condition to “Text appears on the page: error”, then when the corresponding text appears on the web page, the IP will be switched once and one IP will be consumed at the same time.

2. Automatic

For general scraping tasks, just follow the default automatic settings.

3. Manual

If the webpage encountered is special, the automatic setting does not work. At this time, you can set the manual setting.

ⅰ. Delay(s)

Some webpages open slowly and sometimes affect the scraping effect. You can set a delay to effectively improve the quality of scraping. The default delay of the system is 1 second, which can be modified according to your needs.

ⅱ. Detect CAPTCHA During the Scraping

Software generally detects captcha automatically. In special cases, you can manually set to detect captcha when encountering specific text.

ⅲ. Detect Login Prompt During the Scraping

Websites that need to log in to scrape data may be logged out during the running process, making it impossible to scrape data, or some websites will be prompted to log in after scraping a certain amount of data. Select this function. When you log out or need to log in, you will be prompted to log in.

ⅳ. Only Extract Visible Elements

Some websites mix invalid data with valid data. When scraping data, many invalid characters appear, and these invalid characters are hidden. In this case, we can choose this setting and only scrape visible elements.

Note: If the website has no setting to hide invalid characters, selecting this option will cause incomplete data or cannot be scraped.

ⅴ. Load Page Info one by one

Some websites need to scroll to a certain position before the content can be displayed, otherwise the data cannot be scraped, and this function can be selected at this time. However, it should be noted that when this function is checked, the scraping speed will be affected.

ⅵ. Switch Browser Regularly

Switching browser regularly can solve the blocking problem of some websites and achieve anti-blocking effect.

You can set the interval for switching browser versions. Set the interval to choose from 30 seconds to 10 minutes. The software will automatically switch various browser versions according to the interval.

ⅶ. Clear Cookies Regularly

Clearing cookies regularly can solve the blocking problem of some websites and achieve anti-blocking effect.

You can set the interval for clearing cookies. Set the interval to choose from 30 seconds to 10 minutes. The software will automatically clearing cookies according to the interval.

How to set scraping range

wangyuyuan — Mon, 02 Dec 2019 09:14:14 +0000

If you need to set the scraping range, you can click the Set scraping range button on the page.

1. Set start page and end page

2. Set skip data

3. Set task stop triggers

You can set the conditions for stopping the task, which is consistent with the data filtering conditions. You can refer to How to filter data.

How to switch proxy while editing a task

wangyuyuan — Mon, 02 Dec 2019 08:01:41 +0000

When scraping data, you sometimes encounter more serious anti-crawling measures. Generally, you will not be able to open the webpage or encounter captcha. If you encounter captcha, you can directly use the function of Solve Captcha. If you are not prompted, you can try to switch proxy.

Step 1: Open proxy

Step 2: Enter your Bright Data Zone username and password

For more details about Bright Data Zone, you can refer to this tutorial:

How to Set up Bright Data Zone

Step 3: Click the switch button to turn it on

Step 4: If the switch is unsuccessful, you can click this button to refresh

Introduction to task running interface

wangyuyuan — Mon, 02 Dec 2019 07:31:36 +0000

After the task is set, click the “Start” button to jump to the running interface. On this interface we can view the data, open the automatic export, speed up, pause or stop the task. If you want to modify the task, you need to stop the task and open the task editing interface to modify it.

1. Show Page

Click “Show Page” button on the page, a webpage interface will pop up, this is the actual webpage situation.

We click the save button in the upper right corner to save this webpage. It is generally used to save the current webpage to check the problems encountered.

2. Logs

Used to view the current running log.

3. Data

Used to view the current data scraping.

The data here can only be viewed and cannot be copied and modified. The page buttons here are also invalid.

4. Pause

Used to pause the current task.

5. Stop

Used to stop the current task.

6. Schedule

If schedule is not set in the running settings, it cannot be set while the task is running.

7. IP Rotation

If IP Rotation is not set in the running settings, it cannot be set while the task is running.

8. Data Deduplication

If Data Deduplication is not set in the running settings, it cannot be set while the task is running.

9. Automatic Export

If Automatic Export is not set in the running settings, you can set a condition for automatic export here.

10. Details of data scraping

11. Speed Up

If the speed boost engine is not set in the running settings, you can click Speed Up to set. The effect of the speed boost engine is about 3-10 times the general scraping speed.

12. Edit task

If during the task scraping process, you find that the task settings are wrong and need to be reset, you can stop the task first, and then click the upper right corner button to open the task editing interface and edit again.

How to set pre-executed operation

wangyuyuan — Fri, 29 Nov 2019 09:16:57 +0000

Smart mode does not support clicking elements on the page. If you need to click, use pre-execution.

The specific steps are as follows:

1. Click the Pre-executed Operation button

2. Drag components

The pre-execution operation window is actually a simplified version of the flowchart mode. In this window, you can operate as a flowchart.

For more details, please refer to the following tutorials:

Smart Mode – Scrapestorm

How to collect web data in reverse order

How to save the current page as a file when scraping

How to scrape links of detail pages

How to set Email Alerts

How to Set up Bright Data Zone

1. Sign up

2. Activate account

3. Set up IP Rotation

How to set IP Rotation & Delay

How to set scraping range

How to switch proxy while editing a task

Introduction to task running interface

How to set pre-executed operation

Flowchart Mode Tutorial