Scraping is most likely a technique rather than a specific tool or framework. It’s a way to get information in an automatic way from different sources.
It’s said that everything that is public is scrapable starting with HTML content, digital documents like spreadsheets, csv and any kind of files. In addition to scraping, we can specify and focus only on web content and we call it WebScraping, which refers to applying the scraping technique to websites and public information sources.
We want to have other’s information in our system, so as a result of this technique we can:
- Collect structured information from non-structured information
- Create or discover an api on a site that doesn’t have a public api
- Collect information for analytics or indexing purposes
- Create an aggregator site
- Detect changes on a site
- Offer a service based on other’s information
Frameworks and Tools for scraping
To get to our structured information, first we need a way to get it. You can find a lot of tools to do this:
- wget –recursive –convert-links
- phantomjs.org ( Webkit )
- slimerjs.org ( Gecko )
- casperjs.org ( Scripting Layer on top of Phantom or Slimer )
- requests ( python )
- BeautifulSoup ( python html parser )
- request ( js )
- cheerio ( js jquery like selector for backend )
- scrapy.org ( python )
- pjscrape ( js, jquery and phamtomjs )
- Grab ( python )
Basic steps and structure of WebScraping
I will describe in a few steps how to structure the scraping process.
- Grab the website url you plan to scrape.
- Follow the site links and try to discover the overall structure.
- Define what you need to get and how the resulting structure should be, usually a dictionary like object is ok.
- Detect hidden endpoints and html structure that appears to be repeated or with special html classes or ids.
- Find a tool or a framework that suits you and helps you automate the process of getting the information.
- Store the information.
- Be happy!
- Re-visit your scrappers structure frequently because websites receive regular updates.
What to look for
For example, if you are looking at an ecommerce site, the product description may appear inside a tag like <div class=”product-name”>Phone</div> so you know that you have to look for a div with that class to get the product´s name.
Images may appear as an <img> element or inside a background style of an element.
Sometimes, it is useful to look for the network tab and filter xhr requests to check if the site is using an undocumented api. We can use that to easily get the information.
Pitfalls and how to overcome them
Sometimes, websites try to avoid being scraped, so they try to protect information by making it difficult to be scraped:
If your request is blocked by a website, try to look at the headers, cookies and POST arguments of the requests that the browser performs as if you were normally browsing the website. They usually provide useful information about why you are being blocked. If you are still being blocked, try to get the browser cookies and use them in your requests. Also, it is worth checking the user agent you are using to perform the request.
Another key point to scrap information and not be blocked is the concurrency and rate limit. A normal person browsing the website won’t be hitting the server too fast, so check not to trigger alert because you are trying to scrape information too fast.
Example and Summary
I’ll show an easy example about how to use wget as a web scraper to get offline documentation or site replica. In our case we will use the django’s readthedocs documentation http://django.readthedocs.io/en/stable/ as an example
$wget -c –recursive –convert-links –domains=django.readthedocs.io –no-parent http://django.readthedocs.io/en/stable/
–recursive will tell wget to follow links
–convert-links will transform links within the downloaded content to local file references
–domains will restrict our scraping
–no-parent will restrict to subpaths of the given url