Case study: data scraping or harnessing the potential of big data

How to effectively, quickly, and efficiently monitor 340 different websites for specific information? For a training company, we designed and implemented an efficient process for automatically gathering data from the web. Here’s our take on web scraping.  

From this text you will learn:  

  • what automated web scraping is,
  • how we helped a client streamline the crawling of several hundred web pages, 
  • why automated data retrieval is the future. 

A training company aiming to consistently acquire clients needs to stay updated on the available funds from the National Training Fund. Such information is typically published on the websites of District Labor Offices (PUP) nationwide. 

The problem lies in the fact that in order to obtain such data and inform clients about training funding opportunities in their relevant labor office, the training company must sift through hundreds of websites daily. Time and attention are crucial here, as the order of submitting applications matters. It’s a tedious, error-prone, and repetitive task – perfect conditions for a bot to work! 

Extracting data from websites: a training company’s daily routine 

At our client, a consulting firm, the individual tasked with manual web scraping had to manually comb through 340 pages of County Labor Offices (PUPs) daily. Afterward, they filled out an Excel spreadsheet detailing calls for proposals, available budgets, and submission deadlines. Only then was it possible to send out individual client emails. 

This method of data retrieval consumes resources and poses significant challenges in the event of vacations or emergencies. If there’s a lack of personnel familiar with navigating PUP sites, the training company’s sales process could be adversely affected. Compounding the issue is the disparate structure of PUP websites, which lack a standardized format for information. The required data is scattered across different tabs, often in a non-intuitive manner, leading to prolonged data retrieval times. It’s akin to the challenges faced in cross-border payments, where a new standard aims to streamline the information and data format. For further insights on this topic, refer to the text on the use of ISO 20022 data).  

Proposed solution: automated web scraping process 

We suggested the company automate web scraping and create a tool to streamline the process of gathering information on training funding. In the first stage, we developed a sophisticated application (bot) in Java using Spring Boot. The aim was to enable the bot to navigate job center websites efficiently, eliminating the need for human involvement in this tedious and repetitive task. 

In stage two, our focus shifted to identifying updates on web pages. Through careful configuration of the bot, we implemented automatic monitoring of changes on specific pages and the detection of new information. This approach allows for rapid access to real-time data on funds available from the National Training Fund through automated web scraping.  

Web scraping, or what is the advantage of automation 

Collecting and automatically retrieving data necessitates standardization and structuring of the information. Our objective was to automate the data analysis process. As part of this initiative, the bot logs into 340 sites automatically, extracts at least 10 pieces of information from each site, and transfers the data to the application. Subsequently, it analyzes the data and identifies current enrollments in the National Training Fund. 

At this stage, we’re implementing data standardization, streamlining the comparison of information across various sources. This transition from dispersed data to a centralized database is key. By consolidating data reporting from county offices, we enhance the efficiency of analyzing bids for training funding. Our plan involves initiating email notifications from the application to the requesting party initially, and eventually directly to its customers. This approach ensures improved readability of information and eliminates errors stemming from inconsistently provided data. Moreover, it enables us to promptly respond to emerging changes and enhance our client’s competitiveness in the training services market.  

Web scraping: act faster than the competition 

Automation, coupled with GPT and data standardization, enables the effective utilization of even hard-to-access and disparate data. It also addresses the challenge posed by website diversity. However, web scraping isn’t limited to specific types of data; a similar approach can be successfully applied, for instance, to monitor competitors’ prices. 

Web scraping makes the process of informing customers faster and easier. Harness the power of web scraping! If your service company is seeking to turbocharge its manual data workflows, look no further. Contact us today, and let’s tailor a solution to fit your unique needs.