How To Scrape A Website Which Redirects For Some Time

February 27, 2023 Post a Comment

I am trying to scrape a website which has a delay of 5 sec while displaying a ddos prevention page, the website is Koinex I am using Python3 and BeuwtifulSoup, I think I would ne

Solution 1:

It uses JavaScript to generate some value which is send to page https://koinex.in/cdn-cgi/l/chk_jschl and get cookie cf_clearance which is checked by page to skip doss page.

Code can generate value using different parameters and different methods in every requests so it can be easier to use Selenium to get data

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get('https://koinex.in/')

time.sleep(8)

tables = driver.find_elements_by_tag_name('table')

for item in tables:
    print(item.text)
    #print(item.get_attribute("value"))

Result

VOLUME PRICE/ETH
5.2310 64,300.00
0.0930 64,100.00
10.7670 64,025.01
0.0840 64,000.00
0.3300 63,800.00
0.2800 63,701.00
0.4880 63,700.00
0.7060 63,511.00
0.5020 63,501.00
0.1010 63,500.01
1.4850 63,500.00
1.0000 63,254.00
0.0300 63,253.00
VOLUME PRICE/ETH
1.0000 64,379.00
0.0940 64,380.00
0.9710 64,398.00
0.0350 64,399.00
0.7170 64,400.00
0.3000 64,479.00
5.1650 64,480.35
0.0020 64,495.00
0.2000 64,496.00
9.5630 64,500.00
0.4000 64,501.01
0.0400 64,550.00
0.5220 64,600.00
DATE VOLUME PRICE/ETH
31/12/2017, 12:19:29 0.2770 64,300.00
31/12/2017, 12:19:11 0.5000 64,300.00
31/12/2017, 12:18:28 0.3440 64,025.01
31/12/2017, 12:18:28 0.0750 64,026.00
31/12/2017, 12:17:50 0.0010 64,300.00
31/12/2017, 12:17:47 0.0150 64,300.00
31/12/2017, 12:15:45 0.6720 64,385.00
31/12/2017, 12:15:45 0.2000 64,300.00
31/12/2017, 12:15:45 0.0620 64,300.00
31/12/2017, 12:15:45 0.0650 64,199.97
31/12/2017, 12:15:45 0.0010 64,190.00
31/12/2017, 12:15:45 0.0030 64,190.00
31/12/2017, 12:15:25 0.0010 64,190.00

You can also get HTML from Selenium and use with BeautifulSoup

soup = BeautifulSoup(driver.page_source)

but Selenium can get data using xpath, css selector and other methods so mostly there is no need to use BeautifulSoup

See documentation: 4. Locating Elements

EDIT: this code uses cookies from Selenium to load page with requests and it has no problem with DDoS page.

Problem is that page uses JavaScript to display tables so you can't get them using requests+BeautifulSoup. But maybe you will find urls used by JavaScript to get data for tables and then requests can be useful.

from selenium import webdriver
import time

# --- Selenium ---

url = 'https://koinex.in/'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(8)

#tables = driver.find_elements_by_tag_name('table')
#for item in tables:
#    print(item.text)

# --- convert cookies/headers from Selenium to Requests ---

cookies = driver.get_cookies()

for item in cookies:
    print('name:', item['name'])
    print('value:', item['value'])
    print('path:', item['path'])
    print('domain:', item['domain'])
    print('expiry:', item['expiry'])
    print('secure:', item['secure'])
    print('httpOnly:', item['httpOnly'])
    print('----')

# convert list of dictionaries into dictionary
cookies = {c['name']: c['value'] for c in cookies}

# it has to be full `User-Agent` used in Browser/Selenium (it can't be short 'Mozilla/5.0')
headers = {'User-Agent': driver.execute_script('return navigator.userAgent')}

# --- requests + BeautifulSoup ---

import requests
from bs4 import BeautifulSoup

s = requests.Session()
s.headers.update(headers)
s.cookies.update(cookies)

r = s.get(url)

print(r.text)

soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')

print('tables:', len(tables))

for item in tables:
    print(item.get_text())

Html5 Developer

How To Scrape A Website Which Redirects For Some Time

Solution 1:

Post a Comment for "How To Scrape A Website Which Redirects For Some Time"