A Strange Header Propagation Issue

Ahmet Geymen
6 min readJan 16, 2022

--

I want to share a flow of a situation I have experienced, why it occurred and how it was resolved. It was a case where topics like WebClient, Spring-Cloud-Sleuth, Brave, and Kotlin coroutine were intertwined.

Photo by Jack Anstey on Unsplash

TL;DR — for readers who want to skip whole story

  • In order not to lose the sleuth tracing feature, be sure to use the WebClient or WebClient.Builder as a Bean
  • If you have baggage header fields propagated via Brave, try not to make requests by changing threads.
  • When communicating with more than one service, if different ‘property naming strategies’ come into play, I would say reconsider using your codecs with a common builder. Especially if there is a structure that works with lazy loading, the client with the codec may inadvertently impose its features on other clients through this common builder. This is may not be the desired behavior.

Here the journey begins

First of all, the problem we experienced could be defined simply as follows. Although we made a language selection on the client-side before the request, the content that came as a response was not affected by this preference unlike before. We’ve made an update on the gateway but what caused this unexpected behavior? Well, it used to work, why is it broken now?

Let’s take a look at the structure that was set up for the system to work correctly. The ‘Accept-Language’ header value in the HTTP request made from the mobile client was transmitted to the microservice inside which has an access to the database through a gateway.

Basic representation of the architecture

Even though there was no update on the client-side code, we experienced the expected behavior had changed. The first comment that can be made is that the Accept-Language header information was not transmitted during “from client to service b” propagation. The architecture element allowing this transfer is the tracing library that comes as known as spring-cloud-sleuth and we use it on our gateway.

Among the abilities of Sleuth there is also a propagation feature known as ‘Baggage’ which allows us to carry additional HTTP headers we want. This can be easily set with configuration as spring.sleuth.baggage.remote-fields. In this way, we were providing the Accept-Language header value to be transmitted.

So what was the code change that caused the header to not move anymore? Let’s inspect the codes generating WebClients on our gateway.

As we can see here, on the gateway, we create and use WebClient over the autowired WebClient.Builder. But this example is for “service a”. What about the “service b” where the actual header information should reach?

Our update on gateway was using “service b” instead “service a” for some intended requests. In our case “service b” is written with a different framework then “service a” and it has different property naming strategy on request bodies. To comply with this service, the gateway has to use a proper naming strategy as “snake_case” as seen on the code below.

When the codecs are involved, or when we don’t prefer to use the autowired builder, or don’t want to update the already built client, we had used a method like this.

The thing that caused us to make such a choice was that while determining the codec through the builder, it created an array called ‘strategiesConfigurers’ that it kept inside and put its configs in it.

Especially when the clients are lazy, if the first query after the system wakes up triggers a query to be sent to the “service b”, WebClient.Builder, which is used as a Bean, starts to carry the ‘naming strategy’ determined for the “service b”.

When the clients of other services were lazy-loaded with this autowired Bean, the same ‘naming strategy’ was reflected on other clients as well. This caused us to get a meaningless 400 between services. At first, we found the solution using ‘WebClient.builder()’ instead autowiring but later realized we were wrong.

Instead of using autowired WebClient.Builder, we were creating an instant instance during the first time of use. Although we thought that it saved us from the codec problem, it brought us another problem.

You have to register WebClient as a bean so that the tracing instrumentation gets applied. If you create a WebClient instance with a new keyword, the instrumentation does NOT work.

Using ‘WebClient.builder()’ prevented adding ‘TraceExchangeFilterFunction’ via ‘TraceWebClientBeanPostProcessor’ into the filters that WebClient.Builder carries while standing up as Bean. This situation prevented the propagation of the ‘Accept-Language’ headers.

So, how can we ensure that other clients are not affected by codecs while using the filter function that comes with Bean? Here, too, the ‘clone()’ method helped us.

While we thought that we solved the problem completely after applying these solutions, unfortunately, we saw that the language problem persisted in the answers from a query. Now we were in a position to listen to inter-service requests with Wireshark to find the problem. So why was the issue not completely resolved and we were seeing this nuisance in this single query? We agreed that the crux of the problem was that the ‘Accept-Language’ header information was not transmitted, yes, but where was the exact problem?

The point that was causing trouble was that we wanted to make another parallel request while we were doing a job, but we didn’t want it to block us. The Accept-Language header should be transmitted with pagination request at the same time.

As we can see here, the page request was made with the coroutine builder we call ‘withContext(Dispatchers.Default)’ and was fed with the response from here. It seems there was no apparent problem. So is it true?

This function uses dispatcher from the new context, shifting execution of the block into the different thread if a new dispatcher is specified, and back to the original dispatcher when it completes.

A query was being made in the coroutine that was executed through another thread. All right, but where is the problem?

The problem lies below.

The tracing instrumentation library named Brave, which is used by default with Sleuth, naturally forgets or ignores the propagation baggage fields it keeps in this context since it lost the ‘CurrentTraceContext’ when it tried to operate on a different thread.

To solve this situation, we left our withContext approach which extracts our request to another thread, and updated it as follows. We have also solved the problems we all faced.

Well I really appreciate who really made till this point without showing any signs of disconnection or boredom. To sum up I can list the following inferences:

  • In order not to lose the sleuth tracing feature, be sure to use the WebClient or WebClient.Builder as a Bean
  • If you have baggage fields propagated via Brave, try not to make requests by changing threads.
  • When communicating with more than one service, if different ‘property naming strategies’ come into play, I would say reconsider using your codecs with a common builder. Especially if there is a structure that works with lazy loading, the client with the codec may inadvertently impose its features on other clients through this common builder. This is may not be the desired behavior.

Thank you.

--

--