Recently I had work on the project in which I have to scrap data from websites Jabong —– looks easy ….??? No because websites like Jabong are not fixed length websites like IMDB which I have use in my blog, websites like jabong automatically expand when scroll down to bottom and keem on expanding untill all products in category get over.
So for this case, If we use lxml
as I have used in the previous post, lxml
will work but problem is, that it will only scrap the products which are listed on the initial page, around 52 products will be scrapped, but for scraping 20,000 products we have to use some thing extra.
The biggest problem is that how to get scroll the page to bottom. Some here comes selenium
python package. Selenium Python bindings provides a simple API to write functional/acceptance tests using Selenium WebDriver. It open a web browser and your python program will interact with it. Selenium
contain various browser action keys like scrolling, selecting and Xpath
.
To install Selenium
python package type
pip install selenium
in your terminal.
Import necessary packages
Now you done with imports of necessary packages, now launch webdriver specific for your browser. I am using firefox in this case you can use any browser depends on selenium list of supported browser. It will open the web browser.
Now next thing is that you have to scroll down untill to the end of the page or upto specific number of times depends on how much products you want. I am scrolling 50 times, In each scrolling 52 products are added so for 50 X 52 = 2600 , around 2600 products will scrap. And after each scroll I used to wait for 3 sec so that html page load properly.
After doing this we are having html content for all the products. Now the thing required is to scrap the necessary info from page, so for our case we are scraping product-url for each product. so for this we are using xpath
selector, If in case you want learn about xpath
click here.
###Writing whole code in once.
For any typos and any problem feel free to comment and and mail. Piece :)