Web Scraping – Octoparse

The text that follows is owned by the site above referred.

Here is only a small part of the article, for more please follow the link

SOURCE: http://scraping.pro/octoparse-review/

octoparse-logoOctoparse is a new modern visual web data extraction software. It provides users a point-&-click UI to develop extraction patterns, so that scrapers can apply these patterns to structured websites. Both experienced and inexperienced users find it easy to use Octoparse to bulk extract information from websites – for most of scraping tasks no coding needed! 

Overview

Octoparse, being a Windows application, is designed to harvest data from both static and dynamic websites (including those whose web pages that use ajax). The software simulates human operation to interact with web pages. To make data extraction easier, Octoparse features filling out forms, entering a search term into the text box, etc. You can run your extraction project either on your own local machine (Local Extraction) or in the cloud (Cloud Extraction). Octoparse’s cloud service, being available only in paid editions though, works well for harvesting large amounts of data to meet large-scale extraction needs. There are various export formats of your choice like CSV, Excel formats, HTML, TXT, and database (MySQL, SQL Server, and Oracle).                         

Plans and pricing

Octoparse free and paid editions share the same functional features. They offer users the gentleman set of features. Paid editions allow users to extract data on a 24-7 basis using Octoparse’s cloud service. The price of Standard Edition subscription is $89/month, limited with 4 simultaneous threads though, while the Professional Edition subscription cost $189/month with 10 simultaneous threads.

Workflow        

Octoparse provides a visual operation pane, which is very user friendly and straightforward while sometimes laggy. Octoparse simulates human web browsing behavior like opening a web page, logging into an account, entering a text, pointing-and-clicking the web element, etc. Just click the information on the website in the built-in browser and perform the extraction. You will be able to extract structured data that you need.

octoparse-main-workflowOctoparse provides for users with two modes: Wizard mode and Advanced mode. The latter suits well for extracting from complex sites. It takes you less than half an hour to get started with Octoparse. After you configure some steps, you might drag-&-drop the blocks inside of workflow designer to reconfigure your project.

octoparse_workflow_designer

The designer pop-up window is a suggest tool to make project building easy.octoparse_constraction_pop-up

There are lots of rich video demonstrations and explicit manual on its official site. I’ve watched one and liked it. The tutorials teach you how to apply the scraping features. Those videos are useful for both beginners and advanced users.

Cloud Service for scrape

Scraping the web on a large scale simultaneously, based on distributed computing, is the modern tendency, and Octoparse provides the feature also. To use it, you first have to switch from the free edition to any of the paid editions. After you upload your configuration project to the cloud, you can perform the extraction concurrently through Octoparse’s cloud servers. If you need to scrape thousands of web pages within a short time, Octoparse cloud service is ideal. Standard Edition limits you with only 4 concurrent threads (10 in Professional Edition). Extraction scheduling also is offered.

I’ve tried cloud extraction. The speed of simple link extraction has impressed me: over 3000 links in 1.5 min.cloud_extracion_speed

Advanced scrape

octoparse_advanced_optionsFor the advanced scrape, the software provides rich set of tools.

These tools include:

  • Regex outworking
  • XPath editing
  • Execution timeouts setting
  • Scrolling down
  • Page anchor hook
  • etc.

RegEx Tool

To improve user experience, Octoparse provides a inbuilt regex generator. Refining scraped fields might require you to apply regex, so this fits well for both generating and verifying regexes. Kind of special, all-inclusive interface:octoparse_regex_tool

To start a project in advanced mode, choose new task as shown below, thus advanced features will be available:octoparse_choose_mode

API

The Octoparse API makes it easy to connect your system to your scraped data in real time. You can either import the Octoparse data into your own DB, or use our API to require access to your account’s data. Just configure the rule for your task, run it in cloud, and Octoparse cloud servers will do the rest. API request data are returned as XML.octoparse-api

The Octorparse’s API allows the user to extract data on a timely basis: from a datetime till a datetime with max interval being 1 hour. Not that convenient. Insert datetime markers into the link parameters as follows:

http://dataapi.octoparse.com/SkieerDataAPI/GetData?key=fae1dcdd-d740-42c9-952c-0b3e694e915c&from=07-12-2016%2008:00:00&to=07-12-2016%2009:00:00

In this link I’ve highlighted with bold the timing (8am till 9am in this case) to distinguish it from white space notation (%20). The cloud task extraction pane shows only end time and average time, thus it’s hassle to manually calculate start time:  {start time}  = {complete time} – {average used time}.octoparse-project-timing

Proxying

Does it ever drive you crazy that your IP address has been banned and you cannot access a website if you scrape a website frequently?  Yeah, it always happens especially when you extract data from business directories, which apply strict ban based on recurring IP(s). However, Octoparse enables you to scrape these websites by rotating anonymous HTTP proxy servers. In the cloud extraction mode, Octoparse applies more than 500 3rd party proxies for automatic IP rotation.

For local extraction, you have to add a list of external proxy addresses manually and configure them for automatic rotation. To learn how to include IP rotation into scraping project, please refer to here.

IPs are rotated with a interval of time that you set. In this way, you can extract data from the website without the risk of getting the IP address banned – in case you do not overload site’s bandwidth.

Support

The customer support is responsive and provides equal assistance both to paying plan users and free plan users. Support is accessible via Phone, Email, Skype (no limit for free users).

Some testing

Octoparse software offers predefined scrape patterns. See in the picture below:octoparse_predefined_scrape_patterns

Using “List or Table scrape” pattern I could have managed to extract blocks data within 10 min:octoparse_extract_blocks_data

To have more precise scrape, you’d need to apply regex. When using advanced mode, I could have managed to extract description separately from title. In the given example, title and description are non-standard HTML: <tagA><tagB>title</tagB>description</tagA> where only regex might work out to get them separately.

In Case 2, the scraper was not able to scrape the data. See test drive results on blocks scrape.

Captcha

So far Octoparse does not provide to handle captchas so far. Hopefully, this new and growing software will catch up.

Conclusion

Octoparse is a feature rich visual scraping application. It offers good point-&-click interface, though tasks handling sometimes lag. It’s easy to master in short time (with help of the good tutorials). Software is able to handle modern dynamic sites (in advanced mode). What impressed me is Octoparse’s cloud service – extract data in the cloud in a short time – not free though. In my opinion, it’s worth a try if you are collecting a large amount of data.

1 Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: