如何在多个线程中运行`selenium-chromedriver`-Python问题

How to run `selenium-chromedriver` in multiple threads(如何在多个线程中运行`selenium-chromedriver`)

本文介绍了如何在多个线程中运行`selenium-chromedriver`的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 selenium 和 chrome-driver 从某些页面 scrape 数据，然后使用该信息运行一些额外的任务(例如，在某些页面上输入一些评论)

I am using selenium and chrome-driver to scrape data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)

我的程序有一个按钮.每次按下它都会调用 thread_(self)(如下)，开始一个新线程.目标函数 self.main 具有在 chrome-driver 上运行所有 selenium 工作的代码.

My program has a button. Every time it's pressed it calls the thread_(self) (bellow), starting a new thread. The target function self.main has the code to run all the selenium work on a chrome-driver.

def thread_(self):
    th = threading.Thread(target=self.main)
    th.start()

我的问题是用户第一次按下后.这个 th 线程将打开浏览器 A 并做一些事情.当浏览器 A 正在做一些事情时，用户将再次按下按钮并打开运行相同 self.main 的浏览器 B.我希望每个打开的浏览器同时运行.我遇到的问题是，当我运行那个线程函数时，第一个浏览器停止并且第二个浏览器打开.

My problem is that after the user press the first time. This th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.

我知道我的代码可以无限创建线程.我知道这会影响电脑性能，但我可以接受.我想加快 self.main 完成的工作！

I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by self.main!

`Threading` for `selenium` 加速

考虑以下函数来举例说明与单一驱动程序方法相比，使用 selenium 的线程如何提供一些加速.下面的代码 scraps 来自 selenium 使用 BeautifulSoup 打开的页面的 html 标题.页面列表是links.

`Threading` for `selenium` speed up

Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using BeautifulSoup. The list of pages is links.

import time
from bs4 import BeautifulSoup
from selenium import webdriver
import threading

def create_driver():
   """returns a new chrome webdriver"""
   chromeOptions = webdriver.ChromeOptions()
   chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers
   return webdriver.Chrome(options=chromeOptions)  

def get_title(url, webdriver=None):  
   """get the url html title using BeautifulSoup 
   if driver is None uses a new chrome-driver and quit() after
   otherwise uses the driver provided and don't quit() after"""
   def print_title(driver):
      driver.get(url)
      soup = BeautifulSoup(driver.page_source,"lxml")
      item = soup.find('title')
      print(item.string.strip())

   if webdriver:
      print_title(webdriver)  
   else: 
      webdriver = create_driver()
      print_title(webdriver)   
      webdriver.quit()

links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/", 
"https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]

现在在上面的 links 上调用 get_tile.

Calling now get_tile on the links above.

顺序方法

单个 chrome 驱动程序并按顺序传递所有链接.我的机器需要 22.3 秒(注意:windows).

A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).

start_time = time.time()
driver = create_driver()

for link in links: # could be 'like' clicks 
  get_title(link, driver)  

driver.quit()
print("sequential took ", (time.time() - start_time), " seconds")

多线程方法

为每个链接使用一个线程.结果在 10.5 秒内 >快 2 倍.

Using a thread for each link. Results in 10.5 s > 2x faster.

start_time = time.time()    
threads = [] 
for link in links: # each thread could be like a new 'click' 
    th = threading.Thread(target=get_title, args=(link,))    
    th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option
    threads.append(th)        
for th in threads:
    th.join() # Main thread wait for threads finish
print("multiple threads took ", (time.time() - start_time), " seconds")

这里和这个更好是其他一些工作示例.第二个在 ThreadPool 上使用固定数量的线程.并建议存储在每个线程上初始化的 chrome-driver 实例比每次都创建-启动它更快.

This here and this better are some other working examples. The second uses a fixed number of threads on a ThreadPool. And suggests that storing the chrome-driver instance initialized on each thread is faster than creating-starting it every time.

我仍然不确定这是否是 selenium 的最佳方法有相当大的加速. 因为 threadingin-python?rq=1">无 IO 绑定代码将结束顺序执行(一个线程一个接一个).由于 Python GIL(全局解释器锁)，Python 进程无法并行运行线程(利用多个 cpu 核).

Still I was not sure this was the optimal approach for selenium to have considerable speed-ups. Since threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).

使用包multiprocessing


To try to overcome the Python GIL limitation using the package multiprocessing and Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the get_title function above. Additional code is here.
start_time = time.time() 

processes = [] 
for link in links: # each thread a new 'click' 
    ps = multiprocessing.Process(target=get_title, args=(link,))    
    ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)``
    processes.append(ps)        
for ps in processes:
    ps.join() # Main wait for processes finish

return (time.time() - start_time)

与我的预期相反 基于 Python multiprocessing.Process 的 selenium 平均并行度 比 threading.Thread 慢大约 8%. 但很明显，booth 的平均速度比顺序方法快两倍多.刚刚发现 selenium chrome-driver 命令使用 HTTP-Requets (如 POST, GET) 所以它是I/O 受限，因此它释放了 Python GIL，确实使其在线程中并行.
Contrary of what I would expect Python multiprocessing.Process based parallelism for selenium in average was around 8% slower than threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Just found out that selenium chrome-driver commands uses HTTP-Requets (like POST, GET) so it is I/O bounded therefore it releases the Python GIL indeed making it parallel in threads.
这不是一个确定的答案，因为我的测试只是一个很小的例子.此外，我使用的是 Windows 和 multiprocessing 在这种情况下有很多限制.每个新的 Process 都不像 Linux 中的分叉，这意味着除了其他缺点外，还浪费了大量内存.
This is not a definitive answer as my tests were only a tiny example. Also I'm using Windows and multiprocessing have many limitations in this case. Each new Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
考虑到所有这些:根据用例，线程可能与尝试更重的进程方法(特别是对于 Windows 用户)一样好或更好.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).

                        这篇关于如何在多个线程中运行`selenium-chromedriver`的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持编程学习网！