Case study: data scraping or harnessing the potential of big data

How to effectively, quickly, and efficiently monitor 340 different websites for specific information? For a training company, we designed and implemented an efficient process for automatically gathering data from the web. Here’s our take on web scraping.

From this text you will learn:

what automated web scraping is,
how we helped a client streamline the crawling of several hundred web pages,
why automated data retrieval is the future.

A training company aiming to consistently acquire clients needs to stay updated on the available funds from the National Training Fund. Such information is typically published on the websites of District Labor Offices (PUP) nationwide.

The problem lies in the fact that in order to obtain such data and inform clients about training funding opportunities in their relevant labor office, the training company must sift through hundreds of websites daily. Time and attention are crucial here, as the order of submitting applications matters. It’s a tedious, error-prone, and repetitive task – perfect conditions for a bot to work!

Web scraping – collecting data from websites

Web scraping is an automated process of collecting data from publicly accessible websites. A web scraper is a tool or software designed to extract data from websites. Special programs, called scrapers or bots, are often developed and deployed by companies to automate data collection. A program or computer program can be used for web scraping, and these programs often emulate user interactions or parse outputs to extract data efficiently. Screen scraping and screen scrapers are also used to extract data from the visual display of legacy systems or applications that lack modern APIs.

Web scraping is used in many areas, such as competitive analysis, market research, or price monitoring. However, it is important to remember that web scraping is generally considered legal when done in compliance with terms of service, but bad actors may misuse scraping to exploit users or website visitors. Websites may monitor requests to detect and block automated scraping attempts.

There are many creative applications and ideas for data scraping, and connecting to websites securely is an important step in the process.

Web scraping tools

Selecting the right data scraper is essential for effective web data extraction. The choice depends on the type of data, how often it needs to be collected, and the desired output format.

There are user-friendly platforms like Octoparse and ParseHub, as well as more advanced frameworks such as Scrapy. For simpler tasks, browser extensions like Data Scraper or WebHarvy can be convenient. When choosing a tool, consider ease of use, scalability, and support for dynamic content or anti-bot protections.

Modern tools often include features like AI-powered extraction, real-time monitoring, and integrations with other systems. These are especially useful for competitor analysis, market research, or automating product data feeds.

Remember to ensure your scraping activities comply with website terms of service and respect data ownership rights. Using the right tools and best practices, businesses can efficiently turn web data into valuable competitive advantages.

Common web scraping techniques

There are many ways to perform web scraping, and the best method depends on several factors: how the website is built, how much data you need, whether the content is dynamic, and of course, the legal and ethical boundaries of the project.

One of the most common techniques is DOM parsing, which involves navigating the HTML structure of a webpage to locate and extract specific elements—like product prices, article titles, or descriptions. Scraping tools scan the site’s code, identify the relevant data, and extract it in a structured format. The more consistent and clean the website’s code is, the easier and more reliable the scraping process becomes.

For simpler tasks, even Google Sheets can be used. The IMPORTXML function, for instance, allows you to pull data from specific elements on a webpage based on a given pattern (like XPath).

Once the data is collected, it’s important to process it—cleaning, organizing, and converting it into a format that’s ready for analysis or integration with other systems.

Extracting data from websites: a training company’s daily routine

At our client, a consulting firm, the individual tasked with manual web scraping had to manually comb through 340 web pages of County Labor Offices (PUPs) daily. Afterward, they filled out an Excel spreadsheet detailing calls for proposals, available budgets, and submission deadlines. Only then was it possible to send out individual client emails.

This method of data retrieval consumes resources and poses significant challenges in the event of vacations or emergencies. If there’s a lack of personnel familiar with navigating PUP sites, the training company’s sales process could be adversely affected. Compounding the issue is the disparate structure of PUP websites, which lack standardized website data and have inconsistent web page formats, complicating the extraction process. The required data is scattered across different tabs, often in a non-intuitive manner, leading to prolonged data retrieval times. It’s akin to the challenges faced in cross-border payments, where a new standard aims to streamline the information and data format. For further insights on this topic, refer to the text on the use of ISO 20022 data).

Proposed solution: automated web scraping process

We suggested the company automate web scraping and create a tool to streamline the process of gathering information on training funding. In the first stage, we developed a sophisticated application (bot) in Java using Spring Boot. The aim was to enable the bot to navigate job center websites efficiently, eliminating the need for human involvement in this tedious and repetitive task.

In stage two, our focus shifted to identifying updates on web pages. Through careful configuration of the bot, we implemented automatic monitoring of changes on specific pages and the detection of new information. This approach allows for rapid access to real-time data on funds available from the National Training Fund through automated web scraping.

Web scraping, or what is the advantage of automation

Collecting and automatically retrieving data necessitates standardization and structuring of the information. Our objective was to automate the data analysis process. As part of this initiative, the bot logs into 340 sites automatically, extracts at least 10 pieces of information from each site, and transfers the data to the application. Subsequently, it analyzes the data and identifies current enrollments in the National Training Fund.

At this stage, we’re implementing data standardization, streamlining the comparison of information across various sources. This transition from dispersed data to a centralized database is key. By consolidating data reporting from county offices, we enhance the efficiency of analyzing bids for training funding. Our plan involves initiating email notifications from the application to the requesting party initially, and eventually directly to its customers. This approach ensures improved readability of information and eliminates errors stemming from inconsistently provided data. Moreover, it enables us to promptly respond to emerging changes and enhance our client’s competitiveness in the training services market.

Web scraping: act faster than the competition

Automation, coupled with GPT and data standardization, enables the effective utilization of even hard-to-access and disparate data. It also addresses the challenge posed by website diversity. However, web scraping isn’t limited to specific types of data; a similar approach can be successfully applied, for instance, to monitor competitors’ prices.

Web scraping makes the process of informing customers faster and easier. Harness the power of web scraping! If your service company is seeking to turbocharge its manual data workflows, look no further. Contact us today, and let’s tailor a solution to fit your unique needs.