This topic keeps coming up with my customers so the purpose of this blog post is to keep all the useful information in one post. Think of this post as a consolidation of different pieces of information. If you’re stuck on what “pipelined”, “created”, or “transit time” means, you’re in the right spot.
We are not talking about Diagnostic Settings/Log Analytics in this post, I’d describe those as server-side rather than client-side diagnostics. They are useful in different ways, our client diagnostics will help us understand the path that a request takes from the application to Cosmos DB and back, along with any latency on that path.
Note: before we get started, check to see if you’re using Direct or Gateway mode. Ideally sending your requests Direct mode in the .NET or Java SDKs will usually result in faster requests.
How do we get a Cosmos DB Client Diagnostic?
In order to understand a diagnostic log, you need to capture it. To do so, you simply return the diagnostic object from any request to Cosmos DB. All responses, successful and unsuccessful will generate a diagnostic for your application.
Client Diagnostics for .NET SDK for Cosmos DB:
Client Diagnostics for Java SDK for Cosmos DB
Now that I have diagnostics in my application, let’s interpret them
At this point you should have a JSON object that contains words like “ChannelAcquisitionStarted” and “LatencyInMS.” We can simply follow the descriptions here to understand each of the steps in the connection, however a couple notes before we begin. LatencyInMS represents the duration of the whole request. BELatency or BackendLatency is the duration on the server. It’s normal for those two values to be very different.
From the article linked above, pasted here for convenience:
"ChannelAcquisitionStarted: The time to get or create a new connection. Connections can be created for numerous reasons such as: The previous connection was closed due to inactivity using CosmosClientOptions.IdleTcpConnectionTimeout, the volume of concurrent requests exceeds the CosmosClientOptions.MaxRequestsPerTcpConnection, the connection was closed due to a network error, or the application is not following the Singleton pattern and new instances are constantly created. Once a connection is established, it is reused for subsequent requests, so this should not impact P99 latency unless the previously mentioned issues are happening.
Pipelined: The time spent writing the request into the TCP socket. Request can only be written on a TCP socket one at a time, a large value indicates a bottleneck on the TCP socket which is commonly associated with locked threads by the application code or large requests size.
Transit time: The time spent on the network after the request was written on the TCP socket. Compare this number to the BELatencyInMs. If BELatencyInMs is small, then the time was spent on the network, and not on the Azure Cosmos DB service. If the request failed with a timeout, it indicates how long the client waited without response, and the source is network latency.
Received: The time between the response was received and processed by the SDK. A large value is normally caused by a thread starvation or locked threads."
Based on those descriptions, here’s some simple next steps if you have an issue:
- If most of the time is spent in Transit Time: It’s either slow execution on the server or slow network traffic. Check BE/BackendLatency to see if it’s slow server-side. Also check request and response size to see if either one is abnormally large.
- If most of the time is before Transit: It could be a problem acquiring the connection on the network or it could be a very busy application server. Check the memory and CPU statistics to see if your client has enough resources.
What can I do if BE/BackendLatency is high?
There’s a lot of room for tuning in Cosmos DB. My troubleshooting changes dramatically depending on the type of operation.
- For slow queries, check the indexing policy. If it’s set to the default /*, the only options are to add composite indexes or consider reducing the size of the request and result set.
- For slow create, update, and delete, consider reducing the number of indexes. The default /* indexing policy can mean that requests cost more RU than necessary so remove any unnecessary indexes after evaluating the indexes used by your application.
- For slow reads (point reads), consider checking your document size. This one is the most difficult to troubleshoot but excessively large documents may not be optimal.
Other miscellaneous fields in the Client Diagnostic
There’s often a bunch of abbreviations at the bottom of the Client Diagnostic, like PT0 or PT1H. I always found these confusing. These values are actually telling your application’s connection configurations and the details can be found here:
I hope this was useful, I always end up searching through 4 or 5 pages to consolidate this information. Stay tuned!