If you have not heard of web scrapping, it is the term used to extract data from websites. You may have visited a website and they kindly provide data tables for you to download. On other occasions, you may have thought. As you manually copied and pasted the data into your database or spreadsheet there must be a better way.
Warning: Before extracting any data, check you won't be breaking any copyright infringement rules.
Fortunately, there is an array of software tools available.
Sequentum (desktop or server) aimed at the enterprise market. Pricing you need to contact Sequentum. I have no experience with this software. The fact they will not provide pricing on their website suggests it is not for everyone.
Data Miner, a Chrome browser extension. It is free if you scrape below 500 pages a month. I have used this tool and found it to be intuitive and extracts the data required without too much effort.
Dexio is a suite of software to meet all your extraction transform, load (ETL) needs. Again, frightened to provide pricing and I have no experience of using Dexio.
UniPath provides robotic (algorithm) automation process software. An example job candidate submits their resume, a robot automates resume data extraction. Most end-users should be able to use this software.
Web Scraper is a free browser extension with paid options. To collect web data, it has a point and clicks interface. They call this a scraper. Once you have configured a scraper and ran your scraper you can download the data in CSV format.
Web scraping R (open source) used by data scientists. R requires coding knowledge. I have used this with Microsoft Power BI and found R to be robust. R objective is to process large data sets. This you just could not do with other software other than Python.
Web scraping in Python (open source). Again a professional data extraction tool mostly used by data scientists.
To use either R or Python you would have to invest time learning how. Without question R and Python are the most robust solutions but not for everyone.
Web scraping with Power BI, Power Query
As an example, let’s suppose I wanted to extract data from Amazon. I seek books to do with ‘collect combine and transform data.’ I require the book title. Author, the number of reviews, average rating and published date.
Open Microsoft Power BI. Click the Get Data icon to reveal the drop-down and select web.
Paste the URL
You should see the next window. Click ‘Add Table Using Examples.’
You should see the example window below.
Tip: Maximise the window.
In the top half of the window you should be able to scroll through the list to view the available books.
In the bottom half of the window. On the left you should see a table with one column. Named ‘column 1.’ Double click the ‘column 1.’ Rename the column to ‘Title.’ Below the title row click the empty row. Type the first few letters of the book title. In this case ‘Learn Microsoft PowerApps.’ You should not need to type as many letters. Before the title becomes obvious to you. Click book title to add to the column. Add the next book title. On rare occasions you may need to add another book title. To have the column fully populate with webpage book titles.
On the right next to ‘Title’ column is a thin column with an asterisk. Click to add another column. Name the column ‘Author.’ Repeat the previous steps you did to add book title but this time author name. Repeat the same steps to add ‘Average Rating’ and ‘Release Date.’
You should find the table rows self-populate.
Once you completed click ‘OK.’
Power BI should open with the table populated. If you do not see the table. On the left hand of the screen you should see three icons. Click the table icon see below.
You can now manipulate the data to suit your needs see below.
Power Query offers you a lot of flexibility and for many Power Query is all they will ever need.