Flowchart Mode – Scrapestorm

How to collect web data in reverse order

wangyuyuan — Thu, 20 Apr 2023 07:54:10 +0000

When collecting data, it is often necessary to collect in reverse order (collecting data from the last page to the first). This article will show you how to use ScrapeStorm’s flowchart mode to collect web page data in reverse order.

Case 1: After paging the list page, the link changes, and the link to the last page exists .

Processing method 1: Use the last page link of the list page as the collection link.

When we can directly get the link to the last page of the website list page, we can use the link of the last page to create a collection task by directly copying the link.

1. Click to the last page in the browser and copy the link of the last page.

2. Create a flowchart mode task.

3. After the flow chart mode detects the list, the software will prompt whether to detect the next page button. According to the operation prompt, manually click the “Previous” button.

4. Start the task to start collecting in reverse order.

Processing method 2: Set reverse page numbers in batches

When the link of the website will change according to the page turning, but there is no “Previous Page” button to realize the operation of turning the page forward, you can realize the reverse order collection by setting the page number.

1. Copy the link to the second page.

Generally speaking, the link on the first page may be different from the link on the second page and the third page. It is impossible to find a regular link through the link on the first page, so it is recommended to directly copy the link on the second page to create Task.

2. Use the function of generating URLs in batches to generate links.

As shown in the figure below, “Start” is set to “Last Page”, “End” is set to “First Page”, and “Step” is set to “decrease”.

For more details, please refer to the tutorial: How to use URLs Generator

3. When URLs have been generated in batches, there is no need to set the page turning button. You can select “No, extract only the currrent webpage” in the operation prompt. If the page needs to be scrolled to display more data, it is recommended to set it to “Scroll to Load”.

4. Start the task to start collecting in reverse order.

Case 2: After the list page is turned, the link remains unchanged, and there is no link to the last page

Processing method 1: There is a button to jump to the last page on the web page.

When the link of the website will not change according to the page turning, and we cannot directly get the link of the last page, we can jump to the last page by directly clicking the page turning button of the last page, so as to realize reverse collection.

1. Create a flowchart mode task.

2. Add a “Click” component to turn the page to the last page.

3. After the list is detected, the software will prompt whether to detect the next page button. According to the operation prompt, manually click the “Previous” button.

4. Start the task to start collecting in reverse order.

Processing method 2: There is a page number input box on the web page

When the link of the website will not change according to the page turning, and we cannot directly get the link of the last page, we can jump to the last page by directly inputting the page number of the last page to achieve reverse collection.

1. Create a flowchart mode task.

2. Add a “Input” component and a “Click” component to turn the page to the last page.

3. After the list is detected, the software will prompt whether to detect the next page button. According to the operation prompt, manually click the “Previous” button.

4. Start the task to start collecting in reverse order.

How to save the current page as a file when scraping

wangyuyuan — Thu, 09 Mar 2023 10:51:51 +0000

When scraping data, if you encounter a situation where the scraping stops without scraping all the data, you can click the “Show Page” button to confirm whether the page is opened abnormally during the scraping process. This article will explain how to save the current page as a file when scraping.

Step 1: Click the “Show Page” button

After starting the task, the software will automatically open the “Running Interface”, click the “Show Page” button on this interface to see the page currently being scraped.

You can confirm the opening status of the current page of the task through the “Show Page” interface, including whether the Pre-execute Operation is running normally, whether the page is turned normally, whether there is an advertisement pop-up window, whether a CAPTCHA is encountered, etc.

Step 2: Download the current webpage

In the upper right corner of the opened current page, there is a “Save the current webpage to the file system” button, click this button to download the current page to the local.

How to scrape data by inputting combined text list

wangyuyuan — Wed, 21 Dec 2022 08:32:49 +0000

This article will introduce to you how to scrape data by inputting combined text list in ScrapeStorm’s flowchart mode.

Note: The use of this function requires an enterprise plan. For more details, please refer to the introduction on the official website price page.

1. Create a new scraping task

1) Copy the URL

For more details, please refer to the following tutorials:

How to enter a correct URL

2) Create a new flowchart mode task

You can create a new task on the software, or directly import an already created task.

For more details, please refer to the following tutorials:

How to import and export scraping tasks

2. Configure the scraping task

1) Set the combined text list task

After entering the URL in the flow chart mode to create a new task, click the input box, and then enter text in the prompt box that appears in the upper left corner.

For an introduction to input text components, please refer to the following tutorials:

Introduction to flowchart components.

Since we need to enter text in multiple input boxes, we choose to click the “Input text list” on the operation tips.

Then select “Combine text list”.

Then click the second input box. After selecting all the input boxes, click “Confirm”.

Fill in the combined text in the pop-up interface, please use comma “,” to separate the combined text.

After clicking the “OK” button, the software will automatically generate components.

Then we click the “Sign in” button on the page, and select “Click the element” in the operation tips.

2) Extract fields

After the combined text list loop is set, we set the fields to be extracted. Click the field on the page, and select “Extract all elements in the list” in the operation prompt box in the upper left corner. Then the software will automatically detect the paging, and you can set the paging according to the “operation tips”.

Then we can set the fields according to our needs on this basis.

For more details, please refer to the following tutorials:

How to set the fields

3. Set up and start the scraping task

1) Run settings

Choose your own needs, you can set Schedule, IP Rotation&Delay, Automatic Export, Download Files, Speed Boost, Data Deduplication, Email Alerts and Developer.

What is Schedule function

How to set up Automatic Export

How to Download Files

2) Runing the task

After the task is started, the data is automatically scraped. We can intuitively see the program running process and collection results from the interface, and there will be a reminder after the collection is completed.

4. Export and view data

ScrapeStorm provides various export methods such as EXCEL, CSV, HTML, and TXT, and also supports exporting a specific number of items, you can select the number of records you want to export in the data, and then click “Export”.

How to scrape links of detail pages

wangyuyuan — Mon, 25 Apr 2022 08:46:02 +0000

When scraping data, it is often necessary to scrape links of detail pages. This article will explain how to use our ScrapeStorm smart mode to scrape links to the detail pages in three ways, and the flowchart mode is the same.

First way: Automatic detect

The smart mode will automatically detect the list. Generally, when the list is detected, the link of the detail page will also be detected.

Note: If the automatic detection is inaccurate, you can also select the list manually.

For more details, you can refer to the tutorial:

How to scrape a list page

Second way: Through “Scrape In”

In the process of list detection, sometimes it is encountered that the link to the detail page cannot be detected. At this time, we can use “Scrape In” to enter the detail page and scrape the link of the detail page.

1) After the list is detected, add a field to detect data with a link of the detail page. The software automatically generates the field.

Note: The data with the link is generally the title of the article, or the name of the product, etc. You can confirm it by operating it on the browser.

2) Right-click the generated field, set “Extract Type“, and select “Link URL“.

3) Click “Scrape In” to enter the detail page.

For more details, you can refer to the tutorial:

How to Scrape In

4) After entering the detail page, add a field arbitrarily, then right-click the generated field, set “Special Value“, and select “Page URL“.

Third way: Splicing to get the link

If none of the above methods can successfully scrape the link of the detail page, but the ID of the detail page can be extracted by using XPath or regular expressions, you can use “Modify Data” to splicing to get the link of the detail page.

Note: If you do not know XPath or regular expressions, please contact us for customization.

Right-click the field, set “Modify Data“, and create a new “Add Prefix“.

This way, we can get the link of the detail page.

How to turn pages by entering page numbers in batches

wangyuyuan — Thu, 21 Apr 2022 11:55:28 +0000

In the setting of the scraping task, it is often encountered that the webpage does not have a page turning button or the website cannot turn the page by clicking the next page button. At this time, we can use flowchart mode of ScrapeStorm to turn pages by entering page numbers in batches.

1. Create a new task

1) Copy the URL

Note: Please copy the URL of the search result page, not the home page.

How to enter a correct URL

2) Create a new flowchart mode task

You can enter the URL from the software home page or import the task you have.

How to import and export scraping task

2. Configure the task

1) Set the component to batch input page numbers

Click the page number input box and enter the page number in the operation tips.

Note: Since multiple page numbers need to be entered, please click “Input text list” in the operation tips.

Then select “Single text list“.

Then enter the page number to be set in the popup. For example “1”, “2” and “3”.

Click “OK” and the software will automatically generate a loop component that populates the page number.

Then click the “Go” button on the page, select “Click the element” in the operation tips, jump to the corresponding web page and generate the click component.

2) Extract the list data

After the page number loop is set, we set to extract the list data, click the field on the web page, and select “Extract all elements in the list” in the operation tips.

The software will automatically detect the paging. This task does not need to set automatic paging, so select “No, extract only the current webpage”.

The scraping field can then be set on this basis.

For more details, please refer to the following tutorial:

How to set the fields

3. Set and start the task

1) Start the task

Click the “Start” button to make some settings in the Run Settings, including functions such as “Schedule, IP Rotation & Delay, Automatic Export, Download Files”. The above functions are not used in this operation, just click the “Start” button to start scraping.

Please refer to the tutorial below for how to set each function.

What is Schedule function

How to set up Automatic Export

How to Download Files

Note: Schedule, Automatic Export and Download Images are available for users of Premium plan and above, and Download Other Files is available for users of Business plan and above.

2) Run the task to extract data

After the task is started, the data will be automatically scraped. We can intuitively see the running process and scraping results from the interface, and there will be a reminder after the scraping stops.

4. Export and view data

After the data scraping is completed, we can export and view the data. ScrapeStorm supports a variety of export methods and export file formats (EXCEL, CSV, HTML, and TXT), and also supports exporting specific numbers. You can select the data you want to export. , and then click “Export“.

How to set Email Alerts

ScrapeStorm — Thu, 17 Mar 2022 10:59:42 +0000

ScrapeStorm currently supports the “Email Alerts” function. After enabling Email Alerts, when a user’s task encounters a CAPTCHA or needs to log in to a website, ScrapeStorm will remind the user by email.

Note: Email Alerts is only available for the Business plan and above.

The specific steps as follows:

1. Configure the mail push service

The premise of using the “Email Alerts” function is to configure the parameters of the email push service. This means that you need to use a mailbox as a transfer station for reminder emails, and all reminder emails will be pushed through this mailbox. The following is an example of Gmail.

First, select “Show all settings” on the Gmail settings.

Select “Forwarding and POP/IMAP” in Gmail’s Settings and enable IMAP, then save changes.

Then, open the settings in ScrapeStorm, you can see the relevant settings of “Email Alerts“.

Finally, enter four parameters in sequence in the Settings: SMTP server, Port, Email Account and Password.

Note:

1) About Gmail’s SMTP parameters, you can refer to this article:

Check Gmail through other email platforms

2) If you try to log in to your Google account, you may not be able to log in because it does not support two-step verification. So what you entered is the app password. Please refer to the tutorial below for how to generate an app password.

How to Generate an App Password

2. Set up “Email Alerts”

After the parameters of the push mail service are set, you can set the “Email Alerts” in the run settings of the task, and set the recipient’s email address.

The task encounters the following two conditions during the scraping process, which will trigger an email alert:

1) Detect CAPTCHA During the Scraping

2) Detect Login Prompt During the Scraping

ScrapeStorm supports entering multiple email addresses, and each email address needs to be separated by a comma “,“.

3. Check the reminder email

After the “Email Alerts” function is set, click “Start” to run the task. When you encounter a CAPTCHA or need to log in, ScrapeStorm will automatically send an alert email to the set mailbox, so please pay attention in time.
The content of the reminder email is as follows:

How to Set up Bright Data Zone

wangyuyuan — Fri, 11 Feb 2022 01:20:24 +0000

1. Sign up

Open the official website and click to sign up.

Enter the information to create a new account.

After setting your password, the account creation is complete.

2. Activate account

After the account is created successfully, Bright Data will send a verification link to your email.

Please click on the link in your email to verify your account.

After clicking activate account in the email, it will jump to Bright Data.

Recharge on the Billing interface, you have to pay to activate ips. Bright Data supports multiple payment methods. After the payment is successful, you can activate proxy.

Then select “Proxies”, there are four zones, it is generally recommended to select “data_center”.

Click “Inactive” button to enable, it will change to “Active”, so that the zone has been enabled.

Click “data_center” to enter the settings page.

Click “Access parameters” to swtich the tab.

You can see information such as username and password on this page.

3. Set up IP Rotation

Open ScrapeStorm and set IP Rotation in Run Settings. Copy the information from the Bright Data official website and fill it in according to the format of username + : + password.

e.g., lum-customer-data_center:123456

You can also directly copy this information from the website.

For more details about IP Rotation, you can refer to this tutorial: How to set IP Rotation & Delay

How to scrape a list page & detail page

wangyuyuan — Fri, 16 Jul 2021 06:44:38 +0000

This tutorial will show you how to use the flowchart mode to scrape the data of the list page + detail page.

Step 1: Scrape the content of the list page

For more details, please refer to the following tutorial:

Step 2: Scrape In

On the basis of the first step, if you need to scrape the data of the details page, you can scrape into the detail page.

For more details, please refer to the following tutorial:

How to Scrape In

Step 3: Set the fields of the detail page
For more details, please refer to the following tutorial:

How to scrape a detail page

Step 4: Start the task

After the fields are set, you can start the task.

For more details, please refer to the following tutorial:

How to configure the scraping task

The fields set on the detail page will be automatically added after the fields on the list page.

How to scrape data by entering keywords in batches

wangyuyuan — Wed, 26 May 2021 08:56:25 +0000

Step 1: Create a new task

1. Copy the URL of the webpage (the URL of the search result page, not the URL of the homepage).

Click here to learn about how to enter a correct URL.

2. Create a new flowchart mode task

You can create new tasks directly on the software, or import tasks.

Click here to learn how to import and export scraping tasks.

Step 2: Configure scraping rules

1. Set a loop of multiple keywords

After entering the URL to create a new task in the flowchart mode, click the search box, and then enter the text in the operation tips.

Click here to learn more about the input text component.

Since we need to enter data for multiple keywords, we choose to click \”input text list”\.

Then select \”Single text list”\.

Then enter the text that needs to be set in the pop-up text list.

After clicking the “OK” button, the software will automatically generate a keyword loop.

Then click the search button, and select the \”click the element “\ in the operation tips to jump to the search result page.

2. Set the scraping field

Set the fields to be extracted, click on the field on the webpage, and select \”Extract all elements in the list”\ in the operation tips. Then the software will automatically detect the paging, follow the prompts to set the paging.

Then set the field on this basis, you can set it according to your own needs.

For more details, please refer to the following tutorial:

How to set the fields

3.Scrape In

If you need the data of the detail page, you can use \”Scrape In”\.

For more details, please refer to the following tutorial:

How to Scrape In

4. Set the fields of the details page
Click the data that needs to be scraped on the page, and then click \”Extract the element”\ in the operation tips, and then the data setting can refer to the setting of the list page.

For more details, please refer to the following tutorial:

How to scrape a detail page

5. Complete component process

Step 3: Set up and start the scraping task

1. Start the scraping task

Click the “Start” button to perform some advanced settings in the running settings, including the functions of “Schedule, IP Rotation & Delay, Automatic Export, Download files, Speed Boost, Data Deduplication, and Developer”.

Click here to learn more about Schedule.

Click here to learn more about Automatic Export.

Click here to learn more about Download Images.

2. Run tasks to scrape data

After the task is started, the data will be automatically scraped. From the interface, you can intuitively see the program running process and the scraping result, and there will be a reminder after the scraping is completed.

Step 4: Export and view the data

After data scraping is completed, you can view and export the data. ScrapeStorm supports multiple export methods (manual export to local, manual export to database, automatic export to database, automatic export to website) and export file formats (EXCEL, CSV, TXT, and HTML).
It also supports exporting the specific number of items, you can select the number of items you want to export, and then click \”Export”\.

How to set IP Rotation & Delay

wangyuyuan — Mon, 02 Dec 2019 11:56:21 +0000

It contains IP Rotaion, automatic settings and manul settings. This part of the function is mainly used to avoid various website blocking problems that may be encountered.

In the editing task interface, click the “Start” button to open the running settings, and click “IP Rotation & Delay”.

1. IP Rotation

(1) IP Type

ⅰ. Bright Data Zone

Bright Data Zone is an external proxy, you need to buy ip on the official website.

How to Set up Bright Data Zone

ⅱ. Custom IP

If you need to use your own IP, please click “Set Up” and set it as required. (Note: the custom IP will be switched in sequence)

(2) Switch IP when the following condition is met:

ⅰ. Interval

IP switches based on time. For example, if you set the switching condition to “Interval: 3 minutes”, then the IP will be switched every 3 min and one IP will be consumed at the same time.

ⅱ. Text appears on the page

Switch according to the text. For example, if you set the switching condition to “Text appears on the page: error”, then when the corresponding text appears on the web page, the IP will be switched once and one IP will be consumed at the same time.

2. Automatic

For general scraping tasks, just follow the default automatic settings.

3. Manual

If the webpage encountered is special, the automatic setting does not work. At this time, you can set the manual setting.

ⅰ. Delay(s)

Some webpages open slowly and sometimes affect the scraping effect. You can set a delay to effectively improve the quality of scraping. The default delay of the system is 1 second, which can be modified according to your needs.

ⅱ. Detect CAPTCHA During the Scraping

Software generally detects captcha automatically. In special cases, you can manually set to detect captcha when encountering specific text.

ⅲ. Detect Login Prompt During the Scraping

Websites that need to log in to scrape data may be logged out during the running process, making it impossible to scrape data, or some websites will be prompted to log in after scraping a certain amount of data. Select this function. When you log out or need to log in, you will be prompted to log in.

ⅳ. Only Extract Visible Elements

Some websites mix invalid data with valid data. When scraping data, many invalid characters appear, and these invalid characters are hidden. In this case, we can choose this setting and only scrape visible elements.

Note: If the website has no setting to hide invalid characters, selecting this option will cause incomplete data or cannot be scraped.

ⅴ. Load Page Info one by one

Some websites need to scroll to a certain position before the content can be displayed, otherwise the data cannot be scraped, and this function can be selected at this time. However, it should be noted that when this function is checked, the scraping speed will be affected.

ⅵ. Switch Browser Regularly

Switching browser regularly can solve the blocking problem of some websites and achieve anti-blocking effect.

You can set the interval for switching browser versions. Set the interval to choose from 30 seconds to 10 minutes. The software will automatically switch various browser versions according to the interval.

ⅶ. Clear Cookies Regularly

Clearing cookies regularly can solve the blocking problem of some websites and achieve anti-blocking effect.

You can set the interval for clearing cookies. Set the interval to choose from 30 seconds to 10 minutes. The software will automatically clearing cookies according to the interval.