Java, Go, Rust- Who Will Win in the Era of High Concurrency

The dramatic ups and downs of the age of the Internet traffic many tech giants also have failed in the face of the impact of the traffic, some games or apps collapsed news out continuously, and Serverless (what is serverless architecture: https://newrelic.com/blog/best-practices/what-is-serverless-architecture)with mechanical elastic characteristics of rapid expansion, can calmly deal with similar colossal traffic, it also makes the new technology out of the limelight.

Behind the noise of Serverless, Rust seems to be the most popular, but there are many patterns and routines to be summarized under the topic of high concurrency, especially professional programming frameworks such as Tokio and RxJava, which are very helpful for programmers to write high-performance programs. In order to discuss the topic of high concurrency in-depth, this article will focus on Java, C, Go, and Rust, which all have their own niche when it comes to high concurrency (what is concurrency: https://quietbookspace.com/chapter-1-concurrency-overview-in-c-sharp/).

In fact, concurrency and parallelism are two completely different things. Parallelism is a core responsibility for one task, and its foundation is a multi-core execution architecture. Concurrency is the execution of multiple tasks alternately, that is to say, high concurrency is to squeeze the performance of the system to the limit, try to wait for the IO return window period, but also to arrange full load of CPU work so that the single-core to play the effect of multi-core.

The Go Programming Language

Unlike Java, Rust, etc., Go is a language of its own that doesn’t need a high concurrency framework because Go itself is a compelling high concurrency framework. The first impression of the Go language is highly extreme. It has strict requirements on code simplicity. Import is strictly prohibited for packages that cannot be used in code, and variables that cannot be used are also required to be forcibly deleted.

There are many excellent examples of the Go language, Docker, K8s, TiDB, BFE, and so on. However, even without reference to these successful open source projects, it is possible to make a simple line of Go statements perform surprisingly well just by relying on official demonstrations. With limited lines of code, Go should achieve the best of any framework.

Using the Go makes it easy for programmers to create powerful applications, but this simplicity and ease of use has led many developers to mistake efficiency as a sign of their coding prowess. However, if you look deeply into the Go language, you will find many details hidden behind Goroutine, a highly concurrent artifact. Here are two examples.

We launch a Gourtine infinite call I ++ in the following code, which continuously +1 the variable i.

				
					package main
import (
    "fmt"
    "runtime"
    "time"
)
func main() {
    var y int32
    go func() {
        for {
            y++
        }
    }()
    time.Sleep(time.Second)
    fmt.Println("y=", y)
}
				
			

But no matter how long your main thread waits, the output will always be a=0.

This is a cache/memory barrier problem. CPU operations on variable A are confined to the cache but are not flushed into memory, so the main goutine prints the value of variable A only with the initial value of 0.

This problem is mind-bogglingly simple: simply add an if judgment to Gouroutine’s body that is entirely impossible to execute.

				
					package main
import (
    "fmt"
    "runtime"
    "time"
)
func main() {
    var y int32 z:=0
    go func() {
        for {
            if z > 0 {
                fmt.Println("zis", z)
            }
            y++
        }
    }()
    time.Sleep(time.Second)
    fmt.Println("y=", y)
}
				
			

If you look at the assembly code with a decompile tool, you can see that the if operation actually implicitly calls the writebarrier operation.

If you look at the assembly code with a decompile tool, you can see that the if operation actually implicitly calls the writebarrier operation.

This if the branch is not executed at all, the Goroutine performs the wirtebarrier operation when it is scheduled out of execution as long as this fragment exists, thus flushing variables from the adjustment cache into main memory. This mechanism can potentially hide bugs that are very difficult to troubleshoot.

Closure address transfer, make wrong slice element value error: in daily work, if a slice/independent of the elements in an array, we are very likely to create a closure through gouroutine, remove a section of each element to handle alone, but if you don’t refer to best practice, carelessly written code is likely to exist hidden dangers, such as:

				
					
import (
    "fmt"
    "time"
)
func main() {
    tests1ice := []int{1, 2, 3, 4, 5}
    for _, v := range tests1ice {
        go func() {
            fmt.Println(v)
        }()
    }
    time.Sleep(time.Millisecond)
}

				
			

The code above will only take one element, like five threes or five fives in a row,

To solve this problem, we need to enforce value passing as follows:

				
					go func(v int) {
    fmt.Println(v)
}(v)
				
			

The difficult and complicated diseases related to the Go language are not the focus of our attention today. Here, the author wants to express that the Go language is easy to use, but it is difficult to use essence and limit. Therefore, I think the Go language is very interesting. It is easy to start and fast to form. However, to become a master, you have to Go the same long way.

The Concepts of Poll, Epoll, and Future in High Concurrency

After discussing the language after comparing alternative factions, we return to several important concepts in high concurrency because today we focus on several kinds of languages, the realization of the Future is not a major method. Still, the Future and Poll concept is so essential that we have to address them in the beginning. Due to the Rust and the Go compared, the Java is more complete, for the Future implementation features also support thoroughly. So the following code uses Rust as an example.

To put it simply, a Future is not a value, but a value type, a value type that can be obtained in the Future. Future objects must implement the std::future::future interface in the Rust standard library. The Future Output is the value that will not be generated until the Future is complete. Future in Rust drives Future operations by calling Future::poll through the manager.

The Future is essentially a state machine and can be nested. In the following example, we instantiate MainFuture and call the .await. MainFuture calls a delayed Future in addition to migrating between states to achieve Future nesting.

MainFuture uses State0 as its initial state. When the scheduler calls the poll method, MainFuture tries to raise its state as much as possible. Poll::Ready is returned if the Future is completed, or Pending is returned if MainFuture is not completed because the DelayFuture it is waiting for is not Ready. When the scheduler receives a Pending result, it puts the MainFuture back on the queue to be scheduled and later calls the Poll method again to advance Future execution. Details are as follows:

				
					use std::future::Future;
use std::pin::Pin;
usestd::task::{Context, Poll};
usestd::time::{Duration, Instant};
 
struct Delay {
    when: Instant,
}
impl Future forDelay {
    type Output = &'static str;
 
    fn poll(self: Pin<&mut Self>, cx:&mut Context<'_>)
        -> Poll<&'static str>
    {
        if Instant::now() >= self.when {
            println!("Hello world");
            Poll::Ready("done")
        } else {
          
            cx.waker().wake_by_ref();
            Poll::Pending
        }
    }
}
enum MainFuture {
   
    State0,
    State1(Delay),
    Terminated,
} 
impl Future forMainFuture {
    type Output = ();
 
    fn poll(mut self: Pin<&mut Self>,cx: &mut Context<'_>)
        -> Poll<()>
    {
        use MainFuture::*;
     
        loop {
            match *self {
                State0 => {
                    let when = Instant::now() +
                        Duration::from_millis(1);
                    let future = Delay { when};
                    println!("initstatus");
                    *self = State1(future);
                }
                State1(ref mut my_future) =>{
                    matchPin::new(my_future).poll(cx) {
                        Poll::Ready(out) =>{
                            assert_eq!(out,"done");
                           println!("delay finished this future is ready");
                            *self = Terminated;
                            returnPoll::Ready(());
                        }
                        Poll::Pending => {
                            println!("notready");
                            returnPoll::Pending;
                        }
                    }
                }
                Terminated => {
                    panic!("future polledafter completion")
                }
            }
        }
    }
}
#[tokio::main]
async fn main() {
    let when = Instant::now() +Duration::from_millis(10);
  
    let mainFuture=MainFuture::State0;
    mainFuture.await;
  
}

				
			

There is an obvious problem with the implementation of this Future. It is also clear from the results that the debugger performs many Poll operations while waiting. Ideally, the Poll operation should be performed when the Future is progressing. In a constant loop, the Poll will degrade to be an inefficient Select. We’ll cover Epoll in the next section, but we won’t go into details here.

The solution to poll is the Context in the poll function, the Future’s waker(). Calling waker signals to the executor that the task should poll. Finally, call the wake to notify the executor when the Future state is advanced, which is the correct solution. This requires changing the Delay part of the code:

				
					
let waker =cx.waker().clone();
let when = self.when;
// Spawn a timer thread.
thread::spawn(move || {
    let now = Instant::now();
    if now < when {
        thread::sleep(when - now);
    }
    waker.wake();
});

				
			

Regardless of the high concurrency framework, Poll is essentially a scheduler based on this Task/Poll mechanism, and Poll essentially monitors the execution status of a chain of tasks.

With the proper mechanism of Poll, we can avoid the scheduling algorithm in which the event cycle periodically traverses the whole event queue. The essence of Poll is to notify the event in the ready state to the corresponding handler, while when the application development of the framework designed based on Poll, such as Tokio, is carried out, programmers need not care about the whole message transmission at all.

Just use the and_then, spawn methods to set up the quest chain and get the system working. The well-known epoll multiplexing in Linux is a poll-based high-concurrency mechanism that allows a thread to monitor the status of multiple tasks and notify the corresponding handler of subsequent operations once a task descriptor state becomes ready.

To put it simply, the Future is a value type that can be obtained in the Future. The Poll is a method to promote Future state migration, while Epoll is a multiplexing mechanism that monitors multiple Future/Task states using only one thread.

The C Programming language

There are numerous high-concurrency products in C language. Classic operating systems and databases from Linux to Redis are basically developed based on C language. Even Epoll, a high-concurrency gem in Linux we just mentioned, is essentially an application in C language. The idea of C is to trust the programmer in his own right. The language has neither syntactic sugar nor strict compile checks, so it will yield little productivity if you can’t master C.

However, C has the highest ceiling of any language we’ll talk about today. This is because C has neither virtual machines nor garbage collectors. Its only limit is the physical limits of the computer.

C language as the first language for most programmers in the IT world. Here we take Tdengine’s cache as an example to make a brief interpretation. TaosCache works as follows:

1. Cache initialization (taosOpenConnCache) : First initialize cache object SConnCache, then initialize hash table connHashList, and call taosTmrReset to reset timer.

2. TaosAddConnIntoCache: First through the IP, port, username (hash) calculate the hash value, and then to add the links (connInfo) connHashList hash corresponding pNode node, pNode itself is a double linked list, ConnInfo with the same hash value is also sorted into the pNode double-linked list based on the time it was added. Note that pNode is a node of the connHashList hash table, which is itself a linked list. The code is as follows:

				
					void *taosAddConnIntoCache(void *handle, void*data, uint32_t ip, short port, char *user) {
    int         hash;
    SConnHash* pNode;
    SConnCache *pObj;
 
    uint64_ttime = taosGetTimestampMs();
 
    pObj =(SConnCache *)handle;
    if (pObj== NULL || pObj->maxSessions == 0) return NULL;
    if (data== NULL) {
        tscTrace("data:%p ip:%p:%d not valid, not added in cache",data, ip, port);
        returnNULL;
    }
    hash =taosHashConn(pObj, ip, port, user);//通过ip port user计算哈希值
    pNode =(SConnHash *)taosMemPoolMalloc(pObj->connHashMemPool);
    pNode->ip = ip;
    pNode->port = port;
    pNode->data = data;
    pNode->prev = NULL;
    pNode->time = time;
    pthread_mutex_lock(&pObj->mutex);
    //以下是将链接信息加入pNode的链表
    pNode->next = pObj->connHashList[hash];
    if(pObj->connHashList[hash] != NULL) (pObj->connHashList[hash])->prev =pNode;
    pObj->connHashList[hash] = pNode;
    pObj->total++;
    pObj->count[hash]++;
    taosRemoveExpiredNodes(pObj, pNode->next, hash, time);
    pthread_mutex_unlock(&pObj->mutex);
    tscTrace("%p ip:0x%x:%d:%d:%p added, connections in cache:%d",data, ip, port, hash, pNode, pObj->count[hash]);
    
    returnpObj;
}

				
			

3. Remove the link from the cache (taosGetConnFromCache) : Compute the hash value based on IP, port, and username, fetch the pNode corresponding to connHashList[hash], and then fetch the element with the same IP and port as required from the pNode.

Java RxJava- the most balanced beauty of Excalibur sword

High-concurrency products written in Java are no less impressive than C, as Kafka, Rocket MQ, and many other classics are Java masterpieces. Compared to Go and C, Java is not too difficult to get started with, and headache instruction problems and memory leaks are largely absent in the Java world thanks to the garbage collector GC.

With the JVM, the Java language’s lower bound is so high that even beginner programmers can be more productive with Java than intermediate programmers with C. But it is also a limitation of the JVM. On the other hand, the upper limit of the Java language is not as high as that of C and Rust. However, it cannot be denied that Java is the most balanced language in terms of learning difficulty, productivity, performance, memory consumption, etc.

In fact, concurrency and parallelism are two completely different things. Parallelism is a core responsibility for one task, and its foundation is a multi-core execution architecture. Concurrency is the execution of multiple tasks alternately, that is to say, high concurrency is to squeeze the performance of the system to the limit, try to wait for the IO return window period, but also to arrange full load of CPU work so that the single-core to play the effect of multi-core.

At present, the high concurrency framework of Java to RxJava is the most popular, because Java is too popular, the interpretation of a lot of online, here will not enumerate the code, at the end of this article to Java as an example, talk about the problems that may exist in high concurrency.

Rust’s Tokio

Rust is a new language that has emerged in recent years along with Serverless. On the surface, it looks like C, with no JVM or GC garbage collector. However, on closer inspection, it is not C yet. Try to get the Rust compiler to kill bugs in the program during the Build phase before generating the executable. Because there is no GC, Rust has created a unique lifecycle and borrowing mechanism for variables. Developers must always be careful if there are problems with the life cycle of variables.

Both Rust and The Martian languages have multiple channels that need to be cloned before they are used, and locked hashes must be unwrapped before they are used. These uses are entirely different from Java and Go, but both suffer from the same problems as the Gorotine in Go. None appear in Rust because the use of Go does not comply with Rust’s variable lifecycle checks and would be impossible to compile.

So Rust is a lot like The Easygoing School. It’s tough to get started, but if you can get started and write a program that compiles, you’re 100% good at it. So it’s an extreme language with very high limits and very high limits.

The most representative of Rust’s highly concurrent programming frameworks is Tokio, on which the Future example at the beginning of this article was written, and I won’t go into details here.

According to Rust, each Tokio task is only 64 bytes in size, which is orders of magnitude more efficient than sending network requests directly through folk threads. With the help of a high concurrency framework, developers can fully exploit the performance of their hardware.

Be Especially Careful of Pits in High Concurrency.

Whether it is RxJava, Tokio, or Gortouine, high concurrency frameworks, however powerful, have some common problems that need special attention to pursue extreme performance. Here are a few examples.

1. Pay attention to branch prediction: we know that modern CPUS are based on instruction pipeline execution, that is, CPU will put the code that may be executed in the future on the pipeline for decoding and other processing operations in advance, but when it comes to code branches, it needs prediction to know which instruction is likely to be executed.

A typical example of instruction prediction can be seen in the following code:

				
					
public class Main {
    public static void main(String[] args) {
        long timeNow=System.currentTimeMillis();
        int max=100,min=0;
        long a=0,b=0,c=0;
        for(int j=0;j<10000000;j++){
            int ran=(int)(Math.random()*(max-min)+min);
            switch(ran){
                case 0:
                    a++;
                    break;
                case 1:
                    b++;
                    break;
                default:
                    c++;
            }
        }
        long timeDiff=System.currentTimeMillis()-timeNow;
        System.out.println("a is "+a+"b is"+b+"c is "+c);
        System.out.println("总耗时 "+timeDiff);
    }
}

				
			

The above code changes the scope of random numbers to Max = 5 from Max = 100. So the above code execution time was optimized 30% at least, that’s because Max ran ahead of 5 variable values range from 0 to 5, the probability distribution of each branch is more balanced, no branch is an advantage, Instruction prediction is therefore likely to fail, resulting in a decrease in CPU execution efficiency. This issue needs to be addressed in highly concurrent programming scenarios.

2. Align variables with cache lines: At present, all the major high concurrency frameworks are based on the multiplexing mechanism, and the task scheduling on each CPU core basically does not need programmers to care, but in the multi-core scenario, programmers need to pay attention to align the variable according to the size of the cache line as far as possible, to avoid the problem of invalid cache between CPUs. For example, in the following example, two threads operate on the [0] and [1] members of the data arr.

				
					
public class Main {
    public static void main(String[] args) {
        final MyData data = new MyData();
        new Thread(new Runnable() {
            public void run() {
                data.add(0);
            }
        }).start();

        new Thread(new Runnable() {
            public void run() {
                data.add(1);
            }
        }).start();
        try{
            Thread.sleep(100);
        } catch (InterruptedException e){
            e.printStackTrace();
        }
         
        long[] arr=data.Getitem();
        System.out.println("arr0 is "+arr[0]+"arr1is"+arr[1]);
    }
}
class MyData {
    private long[] arr={0,0};
    public long[] Getitem(){
        return arr;
    }
    public void add(int j){
        for (;true;){
            arr[j]++;
        }
    }
}


				
			

However, as long as the arr is changed into two-dimensional data and the variable of operation is changed from arr[j] to arr[j][0], the running efficiency of the program can be significantly improved.

Performance and efficiency are the eternal pursuits of programmers. Each language, whether C, Java, Rust, or Go, has its own niche. Java is the pursuit of stability and balance in all aspects. The Rust focuses on the goal of ultimate performance of the development team suggested; Heroic individual geniuses are better off with C, as long as they choose their own framework, strictly follow best practices, and pay attention to branch prediction and variable alignment, they can achieve outstanding performance.

Views: 194

Leave a Reply

Your email address will not be published. Required fields are marked *