Accessing multiple data sources with data scraping.
ChallengeFast and accurate data scraping from multiple sources
SolutionBest practices for robust & resilient web scraping
Technologies and toolsMicrosoft Azure Cloud Services for infrastructure hosting, tuning and administration. Python language with required libraries and frameworks (Azure-sdk, Scrapy, Selenium, etc.) for web sites scraping and crawling process
The client is a non-commercial organization that provides support for African American small businesses and entrepreneurs. They’re proud to provide services that help African American businessmen get grants and achieve success in competitions.
Challenge: fast and accurate data scraping from multiple sources
The client deals with huge amounts of data coming from various sources regularly. So data management has become a concern for them.
They wanted to scrape job offers, mentorship and network opportunities for talented African American entrepreneurs from various websites and publish it on their own platform. So, entrepreneurs can easily discover businesses owned by African Americans and go support them or locate their own.
ESSID Solutions was challenged to develop a strong data scraping solution for the client’s marketplace.
Solution: best practices for robust & resilient web scraping
Our team of engineers have applied their data scraping expertise to enable effective data collection from various sources.
The ESSID Solutions team was to setup infrastructure & code flow for the client:
Git and CI/CD part
For code management there was used AzureDevOps repository with such a pipeline setup that allowed our team to build and push docker images to the registry by using a parallel job agent.
Registry and Logic App part
Next, we created Azure Docker Container Registry on Azure portal to store our docker images.
Then we needed to create docker instances from images by using Azure Logic app to run scraper code in parallel and separately.
During this stage, the ESSID Solutions team created container instances with Logic apps. Then we needed to give each container access to Azure resources and sensitive data such as passwords, connection strings etc. that were stored in Azure KeyVault.
To store scraper outputs, our team decided to create a Storage Account which would be like a cloud folder to save scraped data. After that we were able to start our scrapers in a manual way, but we needed some orchestration, automatisation and postprocessing.
Data Factory and orchestration part
Our engineers ran all our scrapers with time-trigger and in a single pipeline run with Azure Data Factory.
The main pipeline was supposed to start all containers with requests via azure API, then run DataBricks Notebooks to process collected data.
At this stage, we were scrapping all the data from websites (as incremental loading of data from websites is not possible or difficult) and full data processing/saving to the database. Before loading new data to the database, we deleted the existing data.
As a result, the client has got a robust data scraping solution that scrapes data from multiple sites and business listings and gathers information about businesses’ founded by African Americans that are useful for the subscribers of the client’s platform.
Result: data scraping optimization for reduced processing time
Our team of data scientists and engineers drew upon multiple sources to meet the client’s data scraping needs.
Our solution has empowered the client in the following ways:
- Data extraction at scale
- Structured data delivered
- Low-maintenance and speed
- Easy to implement