Web Scraping with Concurrency in Golang

Bismo Baruno
The Zeals Tech Blog
5 min readMar 21, 2022

--

Photo by unsplash

Introduction

Hi everyone, my name is Bismo or sometimes they called me “Momo.” I am working as a Backend Engineer in Zeals. At Zeals, I am mostly taking care of RPA project microservices. In this article, I want to share something related to our project. It’s about how to do Web Scraping with Concurrency using Golang!

Maybe you’re wondering, “why do we need concurrency?” Sometimes, in a single request, we need to populate various data from multiple pages. In general, it would become a queue that is needed to wait for every page done before scraping to the other page. We have to face this condition and should deal with this performance issue. But, with Concurrency, it becomes possible to scrape multiple pages at the same time.

To give some context to what we should do, here are several points that might be helpful to understand:

Data Flow Design

As mentioned previously in the introduction part, the common scraping method used for multiple pages requires you to wait for every single page to be done before going on to the other. Here is the Data Flow to describe how common scraping works.

Data Flow without Concurrency

However, with a concurrency, in a single request, Web Scraper can scrape multiple pages at the same time. The Data Flow below shows how it’s different than the common scraping.

Data Flow with Concurrency

Use Case

There might be a lot of use cases to use concurrency when doing scraping. Because I can’t share the real use case for our project due to a confidential issue, instead, in this case, let’s use get historical exchange rates for a specific currency.

The website https://www.x-rates.com/ is ideal for this example because the website doesn’t have an API. Also, the request for getting the history is limited by a specific date.

That means getting historical information for a month will send the 30 requests and the loading time will increase based on the number of requests as well. In this example, we will try to get a historical currency between two currencies within the date range.

Then in the next section, we will try to compare the speed results between non-concurrency and with concurrency.

Historical Exchange Rates Website

Code Overview

The completed code will be huge. Please check this repository for the detail because in this article we will pick the important part only.

Basically, to get the content of the page, with Golang we can simply use net/http package and send the request with NewRequestWithContext function.

To easily read HTML tags by selectors, we can use this Go lib called goquery . The details of the installation or how to use it can be found directly in this repository.

Code Implementation

First, create a reusable function to get the page content and parse it to HTML document ready using goquery package. To use this function just need to pass the parameters such as target URL, method type, and the others if needed (header, form-data, cookies, etc).

Next, we will try to parse the currency value. The original target URL will be https://www.x-rates.com/historical/?from=IDR&amount=1&date=2022-03-19 . Our service will have parameters like from, to, date

This function below is the final scraping code to get the currency based on parameters. Because this function only scrapes on a specific date, we need another function to do an iteration by the time range parameters.

After that, we need to implement the concurrency part. We will iterate the time range parameter and use error group because we need to handle the error. The final code will look like this:

In the benchmark section, we will compare the performance between using concurrency and without concurrency. The code without concurrency will look like this:

That’s all the important code needed to explain the concept of Web Scraping with concurrency in Golang. The details can be found in the repository mentioned in the previous section.

Benchmark

Finally, we are on to the interesting part which is benchmarking!

The scenario will be testing multiple requests with different numbers of date queries (1, 2, 5, 10, 20, and 30) for non and with concurrency. How to simulate the benchmark basically just running the service and calling an endpoint about currency history. The number of queries is based on a different number of days between the start and end date.

For the example having 10 queries:

v1/currency/history?from=IDR&to=JPY&start_date=2022-03-01&end_date=2022-03-10

FYI, I'm using this PC specification when doing the test.

  • Mac mini (M1, 2020)
  • Chip Apple M1
  • Memory 16 GB
  • macOS Monterey Version 12.3
  • Internet speed 42.53 (Download) 15.34 (Upload) 29ms (Ping)
  • Internet region: Indonesia

And here is the benchmark result with a line chart:

Curve Result

As the prediction, in the beginning, the response time for non-concurrency will consistently increase depending on the number of requests. But with concurrency, the response time also increased, but was not significant.

Note: The response time might also be depend on the website condition, traffic, internet speed, region, etc.

Closing

When working on Web Scraping in my project at work, I can say the implementation in Golang is very simple and powerful. If we don’t need to manipulate the DOM and purely get the data alone, with some pages needing to be scrapped at the same time, I can give the recommendation to try this Data Flow design!

Note: Some websites may implement rate-limiting, and in this case, then the number of concurrencies should be considered under the rate-limiting value.

Thank you! I hope this article is useful for you!

--

--