Crawler: get dynamic loading data (selenium) (a station)
thoustree 2021-06-04 10:41:52

If the website data is loaded dynamically , You need to pull down the progress bar to display the data , use selenium Simulation browser drop-down progress bar can achieve dynamic data capture .

In this paper, I hope to find some topics that are discussed more , In order to find the topic keywords involved in each question ( Invasion and deletion ).

The following code uses driver.execute_script("window.scrollTo(0, document.body.scrollHeight)") Simulation browser drop-down progress bar 200 Time , Get the women's topic next to 900 Multiple answers , duplicate removal ( There are repetition problems under the same topic ) Get back 600 Many questions

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions
import time
import random
import re
from pymongo import MongoClient
client = MongoClient('localhost')
db = client['test_db']
def get_links(url, word):
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_argument('user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"')
driver = Chrome(options=option)
time.sleep(10)
driver.get('https://www.zhihu.com')
time.sleep(10)
driver.delete_all_cookies() # Clean up the cookie
time.sleep(2)
cookie = {} # Replace it with your own cookie
driver.add_cookie(cookie)
driver.get(url)
time.sleep(random.uniform(10, 11))
for i in range(0, 200):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(random.uniform(3, 4))
links = driver.find_elements_by_css_selector('h2.ContentItem-title > div > a')
print(len(links))
# duplicate removal regex
= re.compile(r'https://www.zhihu.com/question/\d+') # Match questions instead of answer Links links_set = set() for link in links: try: links_set.add(regex.search(link.get_attribute("href")).group()) except AttributeError: pass print(len(links_set)) with open(r' Zhihu image link ' + '/' + word + '-'.join(str(i) for i in list(time.localtime())[:5]) + '.txt', 'a') as f: for item in links_set: f.write(item + '\n') db[word + '_' + 'links'].insert_one({"link": item}) if __name__ == '__main__': input_word = input(' Input topic :') input_link = input(' Input topic corresponding link url:') get_links(input_link, input_word)

The screenshot is as follows :

 

 

 

 

 

 

Please bring the original link to reprint ,thank
Similar articles

2021-08-09

2021-08-09