Processes and threads (Part 2)

lc013 2021-09-15 08:32:19

 

Multithreading

As mentioned earlier, a process contains at least one thread , Actually A process is composed of several threads . Threads are the execution units directly supported by the operating system , Therefore, high-level languages usually have built-in multithreading support ,Python No exception , and  Python The thread of is real  Posix Thread , Instead of a simulated thread .

Multithreading has the following advantages :

  • Using threads, you can put Tasks in programs that occupy a long time are processed in the background .

  • User interface can be more attractive , For example, the user clicks a button to trigger the handling of certain events , You can pop up a progress bar to show the progress of processing .

  • Program may run faster .

  • stay Some waiting tasks are implemented, such as user input 、 File reading and writing, network receiving and sending data, etc , Threads are more useful . In this case, we can release some precious resources such as memory occupation and so on .

Threads can be divided into :

  • Kernel thread : Created and undone by the operating system kernel .

  • User threads : Threads implemented in user programs without kernel support .

Python The standard library provides two modules :_thread  and  threading, The former is a low-level module , The latter is an advanced module , Yes  _thread  It was packaged . In most cases, only  threading The module can , This module is also recommended .

Here again, take the download file as an example , Use multithreading to realize :

from random import randint
from threading import Thread, current_thread
from time import time, sleep
def download(filename):
print('thread %s is running...' % current_thread().name)
print(' Start the download %s...' % filename)
time_to_download = randint(5, 10)
sleep(time_to_download)
print('%s Download complete ! It took %d second ' % (filename, time_to_download))
def download_multi_threading():
print('thread %s is running...' % current_thread().name)
start = time()
t1 = Thread(target=download, args=('Python.pdf',), name='subthread-1')
t1.start()
t2 = Thread(target=download, args=('nazha.mkv',), name='subthread-2')
t2.start()
t1.join()
t2.join()
end = time()
print(' The total cost is %.3f second ' % (end - start))
print('thread %s is running...' % current_thread().name)
if __name__ == '__main__':
download_multi_threading()

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.

Multithreading is implemented in a way similar to multithreading , through  Thread  Class to create a thread object ,target  The parameter represents the function to be executed ,args  Parameters are parameters that represent the parameters passed to the function , then  name  Is to name the current thread , The default naming is as follows  Thread- 1、Thread-2  wait .

Besides , Any process will start a thread by default , We call it the main thread , The main thread can start a new thread , stay  threading  There is a function in the module  current_thread() , You can return an instance of the current thread . The name of the main thread instance is  MainThread, The name of the child thread is specified at the time of creation , That is to say  name  Parameters .

Running results :

thread MainThread is running...
thread subthread-1 is running...
Start the download Python.pdf...
thread subthread-2 is running...
Start the download nazha.mkv...
nazha.mkv Download complete ! It took 5 second
Python.pdf Download complete ! It took 7 second
The total cost is 7.001 second
thread MainThread is running...

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
Lock

The biggest difference between multithreading and multiprocessing is , Multi process , The same variable , Each has a copy in each process , They don't influence each other , and In a multithreaded , All variables are shared by all threads , therefore , Any variable can be modified by any thread , therefore , The biggest danger of sharing data among threads is that multiple threads change a variable at the same time , Change the content .

Here is an example , Demonstrates that multiple threads operate on a variable at the same time , How to mess up the memory :

from threading import Thread
from time import time, sleep
# Suppose it's your bank account :
balance = 0
def change_it(n):
# Save before you pick up , The result should be 0:
global balance
balance = balance + n
balance = balance - n
def run_thread(n):
for i in range(100000):
change_it(n)
def nolock_multi_thread():
t1 = Thread(target=run_thread, args=(5,))
t2 = Thread(target=run_thread, args=(8,))
t1.start()
t2.start()
t1.join()
t2.join()
print(balance)
if __name__ == '__main__':
nolock_multi_thread()

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.

Running results :

-8

  • 1.

A shared variable is defined in the code  balance, Then start two threads , Save before you pick up , In theory, the result should be  0 . however , Because the scheduling of threads is determined by the operating system , When t1、t2 In alternation , As long as there are enough cycles ,balance  The result is not necessarily 0 了 .

The reason is the following statement :

balance = balance + n

  • 1.

The execution of this statement is divided into two steps :

  • To calculate  balance + n, Save the results to a temporary variable

  • Assign the value of a temporary variable to  balance

That is to say, it can be seen as :

x = balance+n
balance=x

  • 1.
  • 2.

Normal operation is as follows :

 Initial value balance = 0
t1: x1 = balance + 5 # x1 = 0 + 5 = 5
t1: balance = x1 # balance = 5
t1: x1 = balance - 5 # x1 = 5 - 5 = 0
t1: balance = x1 # balance = 0
t2: x2 = balance + 8 # x2 = 0 + 8 = 8
t2: balance = x2 # balance = 8
t2: x2 = balance - 8 # x2 = 8 - 8 = 0
t2: balance = x2 # balance = 0
result balance = 0

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.

But in fact, the two threads run alternately , That is to say :

 Initial value balance = 0
t1: x1 = balance + 5 # x1 = 0 + 5 = 5
t2: x2 = balance + 8 # x2 = 0 + 8 = 8
t2: balance = x2 # balance = 8
t1: balance = x1 # balance = 5
t1: x1 = balance - 5 # x1 = 5 - 5 = 0
t1: balance = x1 # balance = 0
t2: x2 = balance - 8 # x2 = 0 - 8 = -8
t2: balance = x2 # balance = -8
result balance = -8

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.

In short , Because right  balance  The modification of requires multiple statements , When executing these statements , Thread may break , This causes multiple threads to mess up the contents of the same object .

Make sure the calculation is correct , Need to give  change_it()  Add a lock , After adding locks , Other threads must wait for the current thread to finish executing and release the lock , Can execute this function . And the lock is only one , No matter how many threads , At most one thread holds the lock at the same time . adopt  threading  Modular  Lock  Realization .

So the code is modified to :

from threading import Thread, Lock
from time import time, sleep
# Suppose it's your bank account :
balance = 0
lock = Lock()
def change_it(n):
# Save before you pick up , The result should be 0:
global balance
balance = balance + n
balance = balance - n
def run_thread_lock(n):
for i in range(100000):
# First get the lock :
lock.acquire()
try:
# Don't worry about it :
change_it(n)
finally:
# Make sure to release the lock after modification :
lock.release()
def nolock_multi_thread():
t1 = Thread(target=run_thread_lock, args=(5,))
t2 = Thread(target=run_thread_lock, args=(8,))
t1.start()
t2.start()
t1.join()
t2.join()
print(balance)
if __name__ == '__main__':
nolock_multi_thread()

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.

But unfortunately Python It can't fully play the role of multithreading , Here you can write an endless loop , Then view the status of the process through the task manager CPU Usage rate .

Normally , If there are two dead loop threads , In multicore CPU in , Can monitor the occupancy 200% Of CPU, That is to occupy two CPU The core .

Want to put N nucleus CPU The core is all over , You have to start N A dead loop thread .

The dead loop code is as follows :

import threading, multiprocessing
def loop():
x = 0
while True:
x = x ^ 1
for i in range(multiprocessing.cpu_count()):
t = threading.Thread(target=loop)
t.start()

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.

stay 4 nucleus CPU We can monitor CPU The occupancy rate is only 102%, That is to say, only one core is used .

But in other programming languages , such as C、C++ or Java To rewrite the same dead cycle , You can run all the cores directly ,4 The core runs to 400%,8 The core runs to 800%, Why? Python No way ?

because Python Although the thread is the real thread , But when the interpreter executes the code , There is one GIL lock :Global Interpreter Lock, whatever Python Before thread execution , You have to get GIL lock , then , Every execution 100 Bytecode , The interpreter will automatically release GIL lock , Give other threads a chance to execute . This GIL The global lock actually locks the execution code of all threads , therefore , Multithreading in Python Can only be executed alternately , Even if 100 Threads are running in 100 nucleus CPU On , You can only use 1 A nuclear .

GIL yes Python The historical legacy of interpreter design , Usually the interpreter we use is officially implemented CPython, To really use multi-core , Unless you rewrite one without GIL Interpreter .

Although multithreading can't take full advantage of multi-core , But the efficiency of the program is still greatly improved , If you want to achieve multi-core tasks , Multi core tasks can be realized through multiple processes . Multiple Python Processes have their own GIL lock , They don't influence each other .

ThreadLocal

When using multithreading , It is better for a thread to adopt its own local variables than global variables , The reason is also introduced earlier , If not locked , Multiple threads may randomly change the value of a global variable , Local variables are only visible to each thread itself , Does not affect other threads .

however , There are also problems with the use of local variables , Is when the function is called , It will be more troublesome to pass , As follows :

def process_student(name):
std = Student(name)
# std It's a local variable , But every function uses it , So it must be passed in :
do_task_1(std)
do_task_2(std)
def do_task_1(std):
do_subtask_1(std)
do_subtask_2(std)
def do_task_2(std):
do_subtask_2(std)
do_subtask_2(std)

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.

Local variables need to be passed to each function layer by layer , More trouble , Is there a better way ?

One idea is to use an overall  dict , Then use each thread as  key , The code example is shown below :

global_dict = {}
def std_thread(name):
std = Student(name)
# hold std Put it in the global variable global_dict in :
global_dict[threading.current_thread()] = std
do_task_1()
do_task_2()
def do_task_1():
# Don't pass in std, Instead, it looks for... Based on the current thread :
std = global_dict[threading.current_thread()]
...
def do_task_2():
# Any function can find the of the current thread std Variable :
std = global_dict[threading.current_thread()]

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.

This approach is theoretically feasible , It can avoid passing local variables in each layer of function , It's just that the code to get local variables is not elegant enough , stay  threading  Module provides  local  function , This can be done automatically , The code is as follows :

import threading
# Create a global ThreadLocal object :
local_school = threading.local()
def process_student():
# Get the... Associated with the current thread student:
std = local_school.student
print('Hello, %s (in %s)' % (std, threading.current_thread().name))
def process_thread(name):
# binding ThreadLocal Of student:
local_school.student = name
process_student()
t1 = threading.Thread(target= process_thread, args=('Alice',), name='Thread-A')
t2 = threading.Thread(target= process_thread, args=('Bob',), name='Thread-B')
t1.start()
t2.start()
t1.join()
t2.join()

  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.

Running results :

Hello, Alice (in Thread-A)
Hello, Bob (in Thread-B)

  • 1.
  • 2.

A global variable is defined in the code  local_school , It's a  ThreadLocal  object , Each thread can read and write to it  student  attribute , But it doesn't affect each other , There is no need to manage the problem of locks , This is a  ThreadLocal  It will be dealt with internally .

ThreadLocal  The most common is to bind a database connection for each thread ,HTTP request , User identity information, etc , All the processing functions called by such a thread can easily access these resources .

process vs Threads

We have introduced the implementation of multiprocess and multithreading respectively , So which method should we choose to implement concurrent programming , What are the advantages and disadvantages of these two ?

Usually the implementation of multitasking , We all design  Master-Worker,Master  Responsible for assigning tasks ,Worker  Be responsible for carrying out tasks , So in a multitasking environment , It's usually a  Master  And multiple  Worker.

If it's implemented in multiple processes  Master-Worker, The main process is  Master, Other processes are  Worker.

If you use multithreading  Master-Worker, The main thread is  Master, Other threads are  Worker.

For multiple processes , The biggest advantage is high stability , Because a child process hung up , The main process and other child processes are not affected . Of course, the main process hung up , All processes naturally hang up , But the main process is only responsible for assigning tasks , The probability of hanging up is very low . The famous Apache The first is to adopt the multi process mode .

Disadvantages are :

  • Creating a process is expensive , Especially in windows System , It's expensive , and  Unix/ Linux  Because the system can call  fork() , So the cost is OK ;

  • The operating system can run a limited number of processes at the same time , Will be affected by memory and CPU The limitation of .

about Multithreading , It's usually too fast , But not too fast ; The disadvantage is poor stability , Because all threads share the process's memory , If a thread hangs up, it may directly cause the whole process to crash . For example Windows On , If there's something wrong with the code that a thread is executing , You can often see such hints :“ The program performed an illegal operation , About to close ”, In fact, it's often a thread that has a problem , But the operating system forces the entire process to end .

process / Thread switching

Whether to adopt multitasking mode , The first thing to note is , Once there are too many tasks , Efficiency will not go up , This is mainly because switching processes or threads has a price .

The flow of the operating system when switching processes or threads is like this :

  • First save The current field environment (CPU Register state 、 Memory pages, etc )

  • And then put The execution environment for new tasks is ready ( Restore last register state , Switch memory pages, etc )

  • Start the mission

This switching process is very fast , But it also takes time , If there are thousands of tasks , The operating system may be busy switching tasks , Without time to perform the task , This is the most common case of hard disk crash , Click the window and there is no response , The system is in a suspended state .

Computationally intensive vsI/O intensive

The second consideration for multitasking is the type of task , Tasks can be divided into Computing intensive and I/O intensive .

Computing intensive tasks are characterized by a large number of calculations , Consume CPU resources , For example, video coding and decoding or format conversion , This kind of task depends on CPU Computing power of , Although you can also use multitasking , But the more tasks , The more time you spend switching tasks ,CPU The less efficient the task is . Compute intensive tasks due to major consumption CPU resources , This kind of task uses Python Such a scripting language is usually inefficient to execute , The most competent person for this kind of task is C Language , We mentioned earlier Python Embedded in C/C++ Mechanism of code . however , If you have to use Python To deal with it , The best thing is to use multiple processes , And the number of tasks should preferably be equal to CPU The number of core .

In addition to computing intensive tasks , Others involve The Internet 、 storage medium I/O All tasks can be regarded as I/O Intensive task , This kind of task is characterized by  CPU Consume little , Most of the time the task is waiting I/O Operation is completed ( because I/O The speed is much lower than CPU And the speed of memory ). about I/O Intensive task , If you start multitasking , You can reduce I/O Wait for time so that CPU Efficient operation . Generally, multithreading is used to process I/O Intensive task .

asynchronous I/O

The modern operating system is right I/O The most important improvement of operation is to support asynchronous I/O. If we make full use of the asynchrony provided by the operating system I/O Support , Can Use the single process and single thread model to perform multitasking , This new model is called the event driven model .Nginx It supports asynchronous I/O Of Web The server , It's on a single core CPU The single process model can support multitasking efficiently . In multicore CPU On , Can run multiple processes ( Quantity and CPU Same number of cores ), Make full use of multicore CPU. use Node.js The developed server-side program also uses this working mode , This is also a trend of multitasking programming .

stay Python in , Single thread + asynchronous I/O The programming model is called coprocessing , With the support of the Association , You can write an efficient multitask program based on event driver . The biggest advantage of collaborative process is its high execution efficiency , Because subroutine switching is not a thread switching , It's controlled by the program itself , therefore , There is no overhead for thread switching . The second advantage of synergy is No multi-threaded locking mechanism is required , Because there's only one thread , There is no conflict between writing variables at the same time , Control shared resources in the coroutine without locking , You just have to judge the state , So execution efficiency is much higher than multithreading . If you want to make the best of CPU The multicore nature of , The easiest way is to multiprocess + coroutines , Both make full use of multicore , And give full play to the high efficiency of the coordination , Extremely high performance can be obtained .


Reference resources

  • https://www.liaoxuefeng.com/wiki/1016959663602400/1017627212385376

  • https://github.com/jackfrued/Python-100-Days/blob/master/Day01-15/13.%E8%BF%9B%E7%A8%8B%E5%92%8C%E7%BA%BF%E7%A8%8B.md

  • https://www.runoob.com/python3/python3-multithreading.html

 

 Processes and threads ( Next )_ Multicore

 

 

Please bring the original link to reprint ,thank
Similar articles

2021-09-15